CN113407669B

CN113407669B - Semantic track query method based on activity influence

Info

Publication number: CN113407669B
Application number: CN202110674824.7A
Authority: CN
Inventors: 袁野; 李翀; 马德龙
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2022-11-11
Anticipated expiration: 2041-06-18
Also published as: CN113407669A

Abstract

The invention provides a semantic track query method based on activity influence, which deeply researches an index structure, a query processing algorithm and a query optimization technology of semantic track data; specifically, the invention provides a concept of activity influence in semantic track data, and defines a semantic track query based on the activity influence according to the concept; meanwhile, in order to realize the efficient processing of the query, the invention designs a Hybrid Grid Index (HGI) structure which integrates multiple information of semantic track spatial positions, activity keywords and activity influence at the same time, and designs and realizes an efficient heuristic search framework based on the Index, wherein the framework can find the track which accords with the user query keywords in the semantic track data set and preferentially matches the activity influence top-k within a threshold value specified by a user.

Description

Semantic track query method based on activity influence

Technical Field

The invention belongs to the technical field of spatial data processing, and particularly relates to a semantic track query method based on activity influence.

Background

With the continuous development of mobile social networks and location-based service applications, a great deal of semantic track data is formed. The semantic track not only contains longitude and latitude and timestamp information in the traditional spatiotemporal track, but also is accompanied by text information describing user behavior activities. Query studies on semantic track data have generated many application values, for example: recommending friends with the same interest or similar habits in a social network, providing suitable advertising targets or selecting suitable advertising topics for merchants in advertising marketing, providing personalized travel route recommendations for tourists on a travel application, and the like.

Efficient querying of trajectories relies on an efficient indexing mechanism, and IR-trees are often used to index semantic trajectory data. The essence of the IR-Tree is to extend the inverted index on the basis of the R-Tree, and at each node of the IR-Tree, there is a pointer to the inverted file describing all the keywords contained within the minimum bounding rectangle represented by the node. Each inverted file associated with the intermediate node records all sub-nodes with the keywords; the inverted list in the inverted file associated with the leaf node records the specific track point. There are also several variants of IR trees, such as DIR trees, CIR trees, etc. In the construction process of the DIR tree, the spatial information and the semantic information of the spatial text object are considered at the same time, so that the objects in the same node are similar as much as possible in semantic attributes. The CIR tree divides the objects into different clusters according to the spatial proximity of all the objects in the nodes, and records the distribution condition of the keywords in each cluster.

In the existing patent 'reverse nearest neighbor query method and device based on semantic track big data', the patent designs a query method aiming at reverse nearest neighbors in semantic track data, and the method is not suitable for the query problem solved by the invention.

In order to realize fast query processing of semantic tracks, a number of difficulties still exist at present: (1) With the continuous development of location-based service applications, users can put forward more personalized and diversified query requirements, and the traditional query technology cannot meet increasingly complex query requests. (2) When the index structures such as IR-Tree and the like face massive customized query, the efficiency is not high enough because various information in the semantic track is not fully utilized for pruning. (3) At present, a query processing framework and an optimization mechanism aiming at the traditional track are not suitable for semantic track query, and the existing algorithm framework and the existing optimization mechanism for processing the semantic track query are not completely suitable for the query problem provided by the invention.

Disclosure of Invention

In order to solve the problems, the invention provides a semantic track query method based on activity influence, which can search tracks with greater activity influence preferentially.

A semantic track query method based on activity influence comprises the following steps:

s1: acquiring basic information of a semantic track data set D, wherein the basic information comprises track numbers of all semantic tracks T in the semantic track data set D and activity influence of keywords corresponding to all track points p on all the semantic tracks T, each track point corresponds to at least one keyword, the keywords corresponding to all the track points are not identical, and the activity influence of the same keyword on different track points is not identical; meanwhile, the activity influence of the keyword on any track point is the occurrence frequency of the keyword on the track point;

s2: constructing a mixed grid index HGI based on basic information of a semantic track data set D, wherein the mixed grid index HGI consists of a mixed quadtree index HQ-tree and a mixed inverted index HIF;

s3: for a user-specified query requirement Q = (loc, acts, d) _max ) Finding out k semantic tracks which are most matched with the query requirement Q in a semantic track data set D based on a hybrid grid index HGI, wherein loc is a query position set by a user, acts is a keyword set by the user, and D _max K is at least 3 for the user set maximum value for the expected journey.

Further, the method for constructing the mixed quadtree index HQ-tree in step S2 is as follows:

s21: dividing a real geographic area in which a semantic track data set D is positioned into grids of D levels, wherein each level comprises 2 ^d-1 ×2 ^d-1 Each grid is at least 3, and each grid corresponds to a grid number;

s22: respectively judging whether each grid of each hierarchy has a track point, eliminating grids which do not contain the track point, and constructing a quadtree by the rest grids according to the hierarchy inclusion relation of the grids, wherein nodes without child nodes are leaf nodes, and the rest nodes are non-leaf nodes;

s23: the index entries and bitmap information are respectively associated with each node of the quadtree, and the specific steps are as follows:

for any one non-leaf node g ₀ Its associated index entry is a triple (Gid) ₀ { g' ∈ g. Substrids }, ifile), where Gid ₀ Representing non-leaf nodes g ₀ The number of the corresponding grid, { g' ∈ g. Subgrids } denotes a list of pointers to all child nodes not leaf node g, ifile denotes not leafLeaf node g ₀ The first inverted index of the keywords corresponding to all the track points existing in the corresponding grids comprises the following information: each keyword is at a non-leaf node g ₀ Influence of activity in the corresponding grid, non-leaf nodes g where keywords appear ₀ The grid number corresponding to the child node;

node g for any one leaf ₁ Its associated index entry is a tuple (Gid) ₁ Ilist), wherein Gid ₁ Represents the leaf node g ₁ Number of the corresponding grid, ilist, denotes the leaf node g ₁ And the second inverted indexes of the keywords corresponding to all the track points in the corresponding grids comprise the following information: each keyword is at leaf node g ₁ The activity influence in the corresponding grid and the number of semantic tracks of each keyword;

for any node of the quadtree, the bitmap information is a data sequence consisting of 0 and 1, and each data bit on the data sequence corresponds to a semantic track, wherein 0 indicates that the semantic track corresponding to the data bit does not pass through the grid corresponding to the current node, and 1 indicates that the semantic track corresponding to the data bit passes through the grid corresponding to the current node.

Further, the method for calculating the activity influence of the keyword in the grid corresponding to any node comprises the following steps:

and acquiring the activity influence of the keyword at all track points in the grid corresponding to the current node, and taking the maximum value as the activity influence of the keyword in the grid corresponding to the current node.

Further, the hybrid inverted index HIF in step S2 is composed of an activity list, an activity inverted arrangement list and a sub-track length list, where the activity list is used to store keywords corresponding to all track points on the semantic track, the activity inverted arrangement list is used to store association relations between the keywords and the track points where the keywords appear, and the sub-track length list is used to store lengths from each track point to the initial track point on the semantic track.

Further, the step S3 of finding k semantic tracks that are most matched with the query requirement Q in the semantic track data set D based on the hybrid grid index HGI specifically includes:

s31: setting a heap set, traversing a mixed quadtree index HQ-tree from top to bottom, firstly putting a root node into the heap set, acquiring child nodes of the root node, then removing the root node from the heap set, and adding the child nodes of the root node which accord with a set pruning rule into the heap set;

s32: finding out the node with the maximum activity influence in the current heap set, then removing the heap set from the node, adding child nodes of the node which accord with the set pruning rule into the heap set, and repeating the step until the leaf nodes are traversed;

s33: according to the index entries related to the leaf nodes obtained through traversal, obtaining semantic tracks of key words acts set by a user in the leaf nodes, and taking the obtained semantic tracks as tracks to be verified; acquiring the activity influence of each track to be verified, and taking the front k tracks with the maximum activity influence as alternative tracks;

s34: removing the leaf nodes in the step S33 to form a heap set, repeating the steps S32 to S33 on the current heap set until the early termination condition is met or all the nodes are traversed, and finally obtaining k candidate tracks which are the query requirements Q = (loc, acts, d) given by the user _max ) The most matched semantic track.

Further, in step S31, the pruning rule is set as: the distance between at least one track point in the grid corresponding to the child node and the query position loc set by the user is not more than d _max (ii) a The influence of the activity of the child node is not 0; at least one of all semantic tracks passing through the grid corresponding to the child node is not accessed.

Further, the method for calculating the activity influence of any node in step S32 includes:

finding out keywords belonging to keywords acts set by a user from all keywords corresponding to the node; and summing the maximum value of the activity influence of the found keywords in the node, and dividing the sum value by the total number of the keywords in the word set acts to obtain the activity influence of the node.

Further, the method for calculating the activity influence of each to-be-verified track in step S33 is as follows:

s331: initializing window endpoints s and e, and recording the sub-tracks extracted by the window each time as T [ s, e ];

s332: the left end point s is fixed as the first track point and is unchanged, the right end point e increases from the first track point along the direction of the track to be verified, and a section of sub-track T [ s, e ] is intercepted every time the right end point is increased]Judging the currently intercepted sub-track T [ s, e ]]Whether the distance between the query position loc and the query position loc set by the user is larger than d _max If yes, keeping the current position of the right end point e unchanged, and increasing the left end point s until the sub-track T [ s, e ] is obtained again]The distance between the query position loc set by the user is not more than d _max Then, the sub-track T [ s, e ] is calculated according to the following formula]Activity influence of (c); if not, directly calculating the sub-track T [ s, e ] according to the following formula]Activity influence of (a);

wherein Inf (T [ s, e ]]Q) denotes the sub-track T [ s, e ]]Act | represents the total number of keywords in the word set acts, inf _p.poi (w) is represented in the sub-track T [ s, e ]]Activity influence of keywords appearing in and belonging to the word set acts, inf _max (w) represents the maximum value of the activity influence of the keywords belonging to the word set acts at all locus points, Q.acts is a keyword set by the user, w is the keyword belonging to Q.acts, and p is the sub-locus T [ s, e ]]The track points appearing in the picture;

s333: keeping the current position of the left end point s unchanged, continuously increasing the right end point e along the direction of the track to be verified, continuously carrying out condition judgment on the intercepted sub-track, obtaining the activity influence of the sub-track meeting the condition, and repeating the steps until the right end point e reaches the last track point;

s334: taking the maximum value of the obtained activity influence of all sub-tracks as the activity influence of the current track to be verified;

further, in step S332, the calculation formula of the distance between the sub-trajectory T [ S, e ] and the query location loc set by the user is:

wherein d (T [ s, e ]]Q) is a sub-track T [ s, e ]]Distance, dist (q.loc, p) from the query location loc set by the user _s ) Euclidean distance, dist (p), between query location loc set for the user and left endpoint s _i ,p _i+1 ) Is a sub-track T [ s, e ]]Any two adjacent track points p except the right end point e _i And p _i+1 The euclidean distance between.

Further, the early termination condition in step S34 is: the first sum is smaller than the minimum value of the track activity influence in the k candidate tracks obtained by current calculation, wherein the calculation method of the first sum is as follows:

respectively taking each keyword acts set by the user as a current keyword to execute the following operations: finding out the maximum value of the influence of the current keyword on the activities at all the track points corresponding to all the nodes of the current heap set, and taking the maximum value of the influence of the activities as the maximum influence corresponding to the current keyword;

and summing the maximum influence corresponding to all the keywords acts set by the user, and dividing the sum by the total number of the keywords in the word set acts to obtain a first sum.

Has the advantages that:

1. the invention provides a semantic track query method based on activity influence, which deeply researches an index structure, a query processing algorithm and a query optimization technology of semantic track data; specifically, the invention provides a concept of activity influence in semantic track data, and defines a semantic track query requirement based on the activity influence according to the concept; meanwhile, in order to realize the efficient processing of semantic track query, the invention designs a mixed grid index structure which integrates multiple information of semantic track spatial position, keywords and activity influence at the same time, has stronger pruning capability compared with the prior index technology, and can preferentially search tracks with larger activity influence based on the index structure.

2. The invention provides a semantic track query method based on activity influence, which realizes an efficient heuristic search method based on mixed grid index structure design, traverses the mixed grid index structure in a heap aggregation mode, quickly finds k semantic tracks which are most matched with a query requirement Q, and can find key words which meet the query requirement of a user in a semantic track data set, namely, the invention preferentially matches the first k tracks with the maximum activity influence within a user specified threshold, thereby greatly improving the processing efficiency of track query.

3. The invention provides a semantic track query method based on activity influence, which introduces an early termination condition in a heuristic search process and can accelerate a query processing process.

Drawings

FIG. 1 is a flowchart of a semantic track query method based on activity influence according to the present invention;

FIG. 2 is a schematic diagram of a semantic track data set based on activity influence according to the present invention;

FIG. 3 is a schematic diagram of two-dimensional geospatial meshing provided by the present invention;

FIG. 4 (a) is a schematic diagram of a quad-tree provided by the present invention;

FIG. 4 (b) is a diagram illustrating an index entry of a non-leaf node according to the present invention;

FIG. 4 (c) is a diagram illustrating index entries of a leaf node according to the present invention;

FIG. 5 is a schematic diagram of a hybrid inverted file provided by the present invention;

FIG. 6 is a heuristic search framework provided by the present invention;

fig. 7 is a schematic structural diagram of a main stack and an auxiliary stack provided by the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The invention provides and solves a novel semantic track query in large-scale semantic track data, and deeply researches an index structure, a query processing algorithm and a query optimization technology of the semantic track data. Specifically, the invention provides a concept of Activity Influence in semantic track data, and defines a semantic track Query model (AITQ) based on the Activity Influence according to the concept. In order to realize the efficient processing of the query, the invention designs a mixed grid index structure which simultaneously integrates multiple information of semantic track space positions, activity keywords and activity influence, and designs and realizes a heuristic search framework based on the index, wherein the framework can preferentially search tracks with larger activity influence. In the heuristic search process, the invention introduces an early termination condition to accelerate the query processing process.

First, the symbols and meanings adopted in the present invention as shown in table 1 are given.

TABLE 1 symbols and meanings of the invention

Some basic definitions are given below:

definition 1: a semantic track T containing n POI (Point Of Interest) positions is defined as T = { p = ₁ ,p ₂ ，…,p _n In which p is _i = (poi, act, t), poi representing any two-dimensional spatial location that can be semantically labeled, act representing a text record that is composed of some keywords describing the user's behavioral activity, t representing a timestamp.

Definition 2: giving a semantic track T = (p) consisting of n track points ₁ ,p ₂ ,…,p _n ) Will consist of successive points of track (p) _s ,p _s+1 ,…,p _e ) The sequence of points formed is called a sub-track of the track T, where s is greater than or equal to 1 and less than or equal to e and less than or equal to n, denoted as T [ s, e%]. With W [ s, e]Representing sub-tracks T [ s, e ]]All active keywords contained therein, i.e.

Definition 3: given a user-specified query Q = (loc, acts, d) _max ) Where loc represents the query location represented by latitude and longitude coordinates; acts is a set of activity keywords that describe a series of activities that the user intends to perform; d _max Is the expected maximum travel distance of the user for all activities from the query location.

Definition 4: given the distance between the query and the sub-track, Q = (loc, acts, d) _max ) A sub-track T [ s, e ] of the semantic track T]Query Q is compared to sub-track T [ s, e ]]Is denoted as d (T [ s, e ]]Q), and

where dist denotes the euclidean distance of the two positions.

Definition 5: given a semantic track T, given a query Q = (loc, acts, d) _max ) If one of the sub-tracks T s, e in track T]Satisfies the following conditions: (1)

(2)d(T[s，e]，Q)≤d _max Then, the trajectory T [ s, e ] is called]Matching query Q.

Definition 6: given a query Q, if there is at least one sub-track T [ s, e ] in a track T in the semantic track data set D that matches the query Q, the track T is referred to as a candidate track for the query Q.

In FIG. 2, the query activity currently submitted by the user is { w } ₁ ，w ₂ ，w ₄ And d is the maximum journey of the journey. Track T ₁ In does not contain activity w ₄ Thus T ₁ In the absence ofSub-track matching query, so ₁ Are not candidate trajectories for the query. At T ₂ In which only sub-track T exists ₂ [1，5]The activity requirement of the query is met, and the distance d (T) between the sub-track and Q can be seen from the graph ₂ [1，5]Q) is large, assuming d (T) ₂ [1，5]，Q)>d, then T ₂ Nor are candidate trajectories. At the track T ₃ In (1), consider sub-track T ₃ [3，4]Match Q because of T ₃ [3，4]Active set W [ s, e ] of]＝{w ₁ ，w ₂ ，w ₃ ，w ₄ Contains the query active set w ₁ ，w ₂ ，w ₄ And consider T ₃ [3，4]The distance to Q is small, satisfying the journey limit (i.e. d (T) ₃ [3，4]Q) < d), thus T ₃ Are candidate trajectories for Q.

Definition 7: given a semantic track dataset D, set

Representing all POIs, sets in D

All active keywords in D are represented. Suppose that a certain activity consists of keywords

Represents, defines the activity w in

The influence of (c) is the number of semantic tracks at this poi that contain the active keyword w, i.e.:

definition 8: given query Q = (loc, acts, d) _max ) And a candidate trajectory T, if sub-trajectory T s, e]Matching with the query Q, defining the sub-track activity influence under the query Q as:

wherein, inf _max (w) represents the maximum impact of activity w at all pois, i.e.

Since there may be large differences in the influence of activities in query Q, causing the influence of sub-track activities to be highly dependent on certain activities, using Inf _max (w) obtaining relative activity impacts, aggregating the impacts of all query activities indifferently, and normalizing the activity impacts of the sub-tracks to [0,1]. There may be multiple sub-trajectories in T that match query Q, and the activity impact of candidate trajectory T under query Q is defined as:

Inf(T，Q)＝max _{s is more than or equal to 1 and less than or equal to e and less than or equal to n and T [ s, e%]Is matched with Q} (Inf(T[s，e]，Q)) (4)

Referring to FIG. 2, only one candidate trajectory T for a given query Q exists in FIG. 2 ₃ At T ₃ Only the sub-track T exists in all the sub-tracks of ₃ [3，4]Matching Q. In the data set consisting of these three tracks, the keyword w ₁ 、w ₂ 、w ₄ The maximum activity impact is: inf _max (w ₁ )＝1，Inf _max (w ₂ )＝2，Inf _max (w ₄ )＝1。

Inf (T) is calculated according to definition 7 ₃ ，Q)＝Inf(T ₃ [3，4]Q) = (1/2 + 1/1)/3 =0.833, i.e. candidate track T ₃ The activity impact size under query Q is 0.833.

Definition 9: given query Q = (loc, acts, d) _max ) And inputting a positive integer k into the semantic track data set D. Hypothesis set C

Containing all candidate tracks for Query Q in data set D, an Activity-based semantic track Query (AITQ) returns a result set RS from candidate set C that satisfies: (1)

And | RS | = k; (2) For is to

Inf (Q, T) is more than or equal to Inf (Q, T').

Specifically, as shown in fig. 1, a semantic track query method based on activity influence includes the following steps:

s1: obtaining basic information of a semantic track data set D, wherein the basic information comprises track numbers of all semantic tracks T in the semantic track data set D and the activity influence of keywords corresponding to all track points p on all the semantic tracks T, each track point corresponds to at least one keyword, the keywords corresponding to all the track points are not identical, and the activity influence of the same keyword on different track points is not identical.

It should be noted that the activity influence of the keyword on any track point is the number of occurrences of the keyword on the current track point.

S2: and constructing a Hybrid grid index HGI based on the basic information of the semantic track data set D, wherein the Hybrid grid index HGI consists of a Hybrid Quad-tree index HQ-tree (HQ-tree) and a Hybrid Inverted index HIF (HIF).

It should be noted that the HGI index takes the track as an index object, and integrates the space, the activity keyword and the activity influence into the index structure. The hybrid quadtree index HQ-tree resides in memory, and the hybrid inverted index HIF is stored on disk.

S3: for a user-specified query requirement Q = (loc, acts, d) _max ) Finding k semantic tracks which are most matched with the query requirement Q in a semantic track data set D based on the hybrid grid index HGI, wherein loc is usedThe query location set by the user, acts is the set of keywords set by the user, d _max K is at least 3 for the maximum value of the expected journey set by the user.

The construction method of the hybrid lattice index HGI is set forth in detail below.

On the first hand, the method for constructing the mixed quadtree index HQ-tree is as follows:

s21: dividing a real geographic area in which a semantic track data set D is positioned into grids of D levels respectively, wherein each level comprises 2 ^d-1 ×2 ^d-1 Each grid is corresponding to a grid number, and d is at least 3.

That is, the HQ-tree organizes semantic track information by way of space Grid division, recursively divides the whole two-dimensional space into four equal subspaces, sequentially forms a hierarchy of 1-Grid,2-Grid, \8230 \, (d-1) -Grid, d-Grid, until the last layer forms 2 ^d-1 ×2 ^d-1 And indexing all grids by using a Quad-tree, and embedding track information, activity keywords and activity influence into the Quad-tree index to form an HQ-tree index. Meanwhile, each node in the HQ-tree index represents an MBR corresponding to a grid area, and each node is associated with an active keyword inverted index.

Referring to FIG. 2, it is a schematic diagram of a real geographic region where a semantic track data set D is located, and there are 3 semantic tracks T in the region ₁ ～T ₃ ，

Representing a track T ₁ The upper 4 track points are arranged on the surface of the film,

representing a track T ₂ The upper 4 track points are arranged on the surface of the film,

representing a track T ₁ Last 5 track points, w ₁ ～w ₆ Then the semantic track T is represented ₁ ～T ₃ The corresponding keywords of the upper trace points.

Referring to fig. 3, a schematic diagram of meshing is shown. The whole two-dimensional space is divided into 3-Grid, all grids have a Grid number Gid,1-Grid is represented by Grid 0, grid 0 is divided into four grids (grids 1-4) to form 2-Grid, and grids 5-20 form 3-Grid.

for any one non-leaf node g ₀ Its associated index entry is a triple (Gid) ₀ { g' ∈ g. Substrids }, ifile), where Gid ₀ Representing non-leaf nodes g ₀ The number of the corresponding grid, { g' ∈ g.subgrids } represents a list of pointers to all child nodes of the non-leaf node g, ifile represents the non-leaf node g ₀ The first inverted index of the keywords corresponding to all the track points existing in the corresponding grids comprises the following information: each keyword is at a non-leaf node g ₀ Influence of activity in the corresponding grid, non-leaf nodes g where keywords appear ₀ The grid number corresponding to the child node of (1).

It should be noted that the first inverted index is implemented by two hash tables, and the key is a key in the hash table. The value of the first hash table records the maximum influence of the current key (assumed to be w) in g

Namely, it is

Wherein Inf _max (w) represents the maximum influence of w in all pois; the value of the second hash table is a list of pointers that record the child nodes that contain the activity.

Meanwhile, the method for calculating the activity influence of the keyword in the grid corresponding to any node comprises the following steps: and acquiring the activity influence of the keyword on all track points in the grid corresponding to the current node, and taking the maximum value as the activity influence of the keyword in the grid corresponding to the current node.

For any leaf node g ₁ Its associated index entry is a tuple (Gid) ₁ Ilist), wherein Gid ₁ Represents the leaf node g ₁ Number of the corresponding grid, ilist, denotes the leaf node g ₁ And the second inverted indexes of the keywords corresponding to all the track points in the corresponding grids comprise the following information: each keyword is at leaf node g ₁ The activity influence in the corresponding grid, and the number of semantic tracks where each keyword appears.

It should be noted that the difference between the ifile of the non-leaf node and the ifile of the non-leaf node is that the value of the second hash table of the inverted list pointed by the ilist stores a track id containing a keyword key in the range of the current node MBR, and the track can be quickly obtained from the disk through the track id. Both ifile and ilist are implemented using hash tables, and the inverted index of each node can be accessed at O (1) time.

For any node of the quadtree, the bitmap information is a data sequence consisting of 0 and 1, and each data bit in the data sequence corresponds to a semantic track, wherein 0 indicates that the semantic track corresponding to the data bit does not pass through the grid corresponding to the current node, and 1 indicates that the semantic track corresponding to the data bit passes through the grid corresponding to the current node.

That is, each node of the present invention maintains a bitmap SIG that marks all traces that pass through the node. Specifically, the id of each track is mapped to one bit in SIG by using a hash function, and if there is a track point in the track T in the grid represented by the node, the id of T is mapped to i by using the hash function, that is, hash (T) = i, then the ith bit of SIG in the node is set to 1 (initially 0).

For example, FIG. 4 (a) is an index structure obtained after indexing the trellis of FIG. 3 using an HQ-tree. The grid 7 does not contain any trajectoriesThe node is set to null in the index structure, and all null nodes are not shown in fig. 4 (a). FIG. 4 (b) shows the structure of a non-leaf node 2, with two children, node 9 and node 11, and also shows the activity w for indexing into node 2 ₁ And w ₂ The inverted list of (1). Fig. 4 (c) shows a specific structure of the leaf node 11, which is one of the child nodes of the node 2. There are three tracks in total in the whole grid, the length of bitmap is set to 3 if hash (T) ₁ ) =1 and hash (T) ₂ ) =2, the trajectory through

nodes

2 and 11 is only T ₁ 、T ₂ Therefore, the bit maps SIG of node 2 and node 11 are both 110.

In general, the off-line construction method of the HQ-tree index trace data set can be summarized as the following steps 2-1-1 to 2-1-3:

step 2-1-1: and carrying out grid division on a two-dimensional space formed by the data set, and organizing grids of all layers by using a Quad-tree after the division is finished.

Step 2-1-2: and sequentially processing each track point of each track in the data set through insertion operation, continuously updating the ifile and SIG of the middle node from top to bottom until the track point reaches the leaf node, and returning after updating the ilist and SIG of the leaf node.

Step 2-1-3: after all tracks are processed, all nodes are updated, leaf nodes which do not contain any track point are set to null, and then the same processing is carried out on middle nodes of which all child nodes are null in a bottom-up mode until a root node is returned. Finally, the inverted file of the root node contains all the active keywords and the maximum activity influence corresponding to the active keywords, and each bit in the root node bitmap SIG is set to be 1.

In a second aspect, the hybrid inverted index HIF is composed of an activity list, an activity inverted arrangement list and a sub-track length list, wherein the activity list is used for storing keywords corresponding to all track points on a semantic track, the activity inverted arrangement list is used for storing association relations between the keywords and track points where the keywords appear, and the sub-track length list is used for storing the lengths from the track points on the semantic track to an initial track point.

It should be noted that the first list in the HIF index contains all the activities in T and is arranged in ascending order for quickly filtering out traces that do not meet the query activity requirements; the second list stores the movable inverted index of each track point in the T, and track points which do not need to be processed can be filtered by using the list; the third list stores p in track T ₁ The length of all sub-tracks as starting points, i.e. len (T [1, e ]]) The length of the target sub-track can be quickly obtained from the lengths of the sub-tracks in the list.

Referring to fig. 5, fig. 5 is a schematic diagram of a mixed inverted file of two tracks on a disk. If the currently matched sub-track is T [3,5], len (T [3,5 ]) = len (T [1,5 ]) -len (T [1,3 ]). When the track data set is small, the active list can be loaded into the memory in advance for rapid filtering, and other lists are loaded from the disk when tracks needing further verification are obtained.

The following details how heuristic searching is performed based on the hybrid lattice index HGI when a user submits the AITQ query Q online. When processing AITQ query, the invention designs a heuristic query framework based on HGI index, the framework heuristically traverses HQ-tree, preferentially obtains the track to be verified with larger activity influence, quickly calculates the actual activity influence value of the track by using HIF index in the track verification stage, and can terminate the processing process in advance according to the termination strategy in the traversal process.

Specifically, referring to fig. 6, fig. 6 is a schematic diagram of a heuristic search framework. Given a trajectory data set D, the query Q = (loc, acts, D) _max ) The positive integers k, HQ-tree and HIF. Firstly, initializing a result set RS and a to-be-verified set VS, and initializing a bitmap BM to record a searched track; in the process of traversing the HQ-tree nodes, a large top Heap Heap is used for sorting each node needing further access according to the F (node, Q) value of the node, the F (node, Q) value of the top Heap node is the largest, and the root node of the HQ-tree is piled.

Further, the step S3 of finding k semantic tracks that are most matched with the query requirement Q in the semantic track data set D based on the hybrid grid index HGI specifically includes the following steps:

s31: setting a heap set, traversing the mixed quadtree index HQ-tree from top to bottom, firstly putting a root node into the heap set, acquiring child nodes of the root node, then removing the heap set from the root node, and adding the child nodes of the root node, which accord with the set pruning rule, into the heap set.

The set pruning rule is as follows: the distance between at least one track point in the grid corresponding to the child node and the query position loc set by the user is not more than d _max (ii) a The influence of the activity of the child node is not 0; at least one of all semantic tracks passing through the grid corresponding to the child node is not accessed.

Note that, for pruning rule 1, if d (Q, node)>d _max If the node does not have the track points, the distance constraint is satisfied, and all the tracks in the node can be safely pruned; for the pruning rule 2, if Inf (node, Q) calculated by the formula (1) is 0, it indicates that there are no active keywords in q.acts in the node, then the node is pruned; for pruning rule 3, a node is pruned if all tracks that pass through the node have been visited before. Check whether there is an unaccessed track in the node by SigCheck (BM) function. SIG, in particular, if a bitwise operation of the node is performed&SIG result has 1 bit, which shows that there is no access track in node, sigCheck returns false; otherwise, sigCheck returns true, the node may be pruned. Sig = [1, 0,1, 0)]、BM＝[1,0,1,0,1]When the trace is not accessed, namely the trace mapped to the 2 nd and 4 th positions of SIG, the SIG is in the process&BM^SIG＝[0,1,0,1,0]Obviously, node nodes need to be accessed.

It should be noted that the semantic track of the grid corresponding to the child node is quickly judged through the bitmap information in the HQ-tree index.

S32: finding out the node with the largest activity influence in the current heap set, then removing the heap set from the node, adding the child nodes of the node which accord with the set pruning rule into the heap set, and repeating the step until the leaf nodes are traversed.

That is to say, when the Heap set Heap is not empty, pop up the Heap top node, if the node is a leaf node, obtain some tracks to be verified, if the node is a non-leaf node, obtain its child node, and put the child nodes passing through 3 pruning rules into the Heap.

Further, the method for calculating the activity influence of any node comprises the following steps: finding out keywords belonging to keywords acts set by a user from all keywords corresponding to the node; and summing the maximum value of the activity influence of the found keywords in the node, and dividing the sum by the total number of the keywords in the word set acts to obtain the activity influence of the node.

It should be noted that each keyword may appear on multiple track points in the grid corresponding to the node, however, the activity influence (i.e., the number of occurrences) of each keyword on different track points is not necessarily the same, and therefore, the activity influence on which track point the keyword has the largest activity influence is taken as the maximum activity influence of the keyword in the current node.

Further, the calculation formula of the activity influence of any node is as follows:

maximum influence of active keyword w in node

Can be obtained from the inverted index of the node. Then, the maximum influence of each active keyword w in q.acts in the node is obtained in turn, and Inf (node, Q) can be calculated by formula (1). In order to consider the spatial proximity of the node and the query Q and the activity influence between the node and the query Q during heuristic traversal, a constant c is used for integrating the spatial distance between the node and the Q and the activity influence to obtain the following functionNumber F (node, Q):

where d (Q, node) represents the minimum distance between MBR and q.loc of the node (q.loc is within the node, the distance is considered to be 0), and d is used _max And (6) carrying out normalization. c is an element of [0,1 ]]For controlling the impact of spatial proximity, default c =0.2. In the process of traversing HQ-tree nodes, nodes with larger F function values are accessed in a heuristic mode preferentially.

S33: according to the index entries related to the leaf nodes obtained through traversal, obtaining semantic tracks of key words acts set by a user in the leaf nodes, and taking the obtained semantic tracks as tracks to be verified; and acquiring the activity influence of each track to be verified, and taking the front k tracks with the maximum activity influence as alternative tracks.

It should be noted that, for all currently obtained tracks to be verified, the searched tracks are marked in the bitmap BM, all the tracks that have not been verified are screened out by using the HIF index, and then verification in S331 to S334 is performed, that is, the activity influence of each track to be verified is calculated, and the top k tracks are selected. If the activity list of a certain track to be verified does not contain all the activity keywords in Q.acts, the track to be verified is directly skipped over, and the activity influence is not calculated.

Further, the method for calculating the activity influence of each track to be verified is as follows:

s332: the left end point s is fixed as the first track point and is unchanged, the right end point e is increased from the first track point along the direction of the track to be verified, and each time the right end point is increased, a section of sub-track T [ s, e ] is intercepted]Judging the currently intercepted sub-track T [ s, e ]]Whether the distance between the query position loc and the query position loc set by the user is larger than d _max If the current position of the right end point e is larger than the current position of the right end point e, the current position of the right end point e is kept unchanged, and the left end point s is increased until the sub-track T [ s, e ] is obtained again]And user settingsIs not greater than d _max Then, the sub-track T [ s, e ] is calculated according to the formula (7)]Activity influence of (c); if not, directly calculating the sub-track T [ s, e ] according to the formula (7)]Activity influence of (a);

wherein Inf (T [ s, e ]]Q) denotes the sub-track T [ s, e ]]Act | represents the total number of keywords in the word set acts, inf _p.poi (w) in sub-track T [ s, e ]]Activity influence of keywords appearing in and belonging to the word set acts, inf _max (w) represents the maximum value of the activity influence of the keywords belonging to the word set acts at all locus points, Q.acts is a keyword set by the user, w is the keyword belonging to Q.acts, and p is the sub-locus T [ s, e ]]The track points appearing in the picture;

meanwhile, in step S332, the calculation formula of the distance between the sub-trajectory T [ S, e ] and the query position loc set by the user is:

S333: keeping the current position of the left end point s unchanged, continuously increasing the right end point e along the direction of the track to be verified, continuously carrying out condition judgment on the intercepted sub-track, obtaining the activity influence of the sub-track meeting the conditions, and repeating the steps until the right end point e reaches the last track point;

It should be noted that, each time steps S32 to S33 are executed, k candidate tracks are obtained. And after the heuristic search process is finished, returning the top-k track with the maximum activity influence under the query Q, and finishing the query processing method.

Further, the early termination condition is: the first sum is smaller than the minimum value of the track activity influence in the k candidate tracks obtained by current calculation, wherein the calculation method of the first sum is as follows:

respectively taking each keyword acts set by the user as a current keyword to execute the following operations: finding out the maximum value of the influence of the current keyword on the activity at all track points corresponding to all nodes of the current heap set, and taking the maximum value of the influence of the activity as the maximum influence corresponding to the current keyword;

That is, the present invention attempts to find the upper bound Inf of the activity impact of all traces in the nodes within the Heap in order to end the traversal of the nodes as early as possible before the Heap Heap is empty _ub (Heap, Q), the search process maintains the kth large activity impact Inf in the result set RS _k If Inf is found _k ≥Inf _ub (Heap, Q), the search process can be ended early.

Definition 10: given query Q = (loc, acts, d) _max ) And a node in the HQ-tree, defining the activity impact of Heap under query Q as:

wherein the content of the first and second substances,

represents the maximum influence of activity w in the Heap.

The maximum influence of activity w within a node can be obtained from the inverted index of the node.

Theorem 1: inf (Heap, Q) is the upper bound of the activity impact of all the not yet verified tracks, i.e. for any one not yet verified track T: inf (T, Q) is less than or equal to Inf (Heap, Q).

And (3) proving that: assuming that the track T is an unverified track, T can pass through at least one node in the Heap certainly, and for any w ∈ Q.acts, the track point possibly matching w in T is certain in a certain node in the Heap, so Inf _T (w)≤Inf _w (Heap, Q) according to definition 7 and definition 10, inf (T, Q). Ltoreq.Inf (Heap, Q). Theorem 1 proves that the process is finished.

To efficiently update Inf (Heap, Q) in traversing HQ-tree nodes, a maximum Heap Heap is maintained for each w ∈ Q _w Nodes in each heap are arranged according to

And (6) sorting the values. The nodes within the Heap of these heaps are all identical to the nodes in the main Heap, but in a different order. To be able to immediately find a node in the Heap from these heaps, the nodes in these heaps are linked to the same node in the Heap. Whenever a node pops out of a Heap, nodes linked with the node are simultaneously popped out of all Heaps _w Popping up; when adding a child node to a Heap, it is also necessary to add the child node to each Heap _w . For any w epsilon Q.acts, heap _w Of pile top nodes

I.e. the maximum influence Inf of the activity w in the main stack Heap _w (Heap, Q). At any stage of traversing the nodesCan pass through each Heap _w The stack top node obtains the corresponding Inf _w (Heap, Q), calculating the action influence force Inf (Heap, Q) of the stack Heap by the formula (9), and when the k-th large action influence force Inf is found _k When the value is more than or equal to Inf (Heap, Q), the searching process can be ended.

Referring to FIG. 7, FIG. 7 is a schematic diagram of the structure of a main stack and a sub-stack, two stacks being present in addition to the stack Heap

And

all three stacks have the same node N ₁ -N ₅ The order of the nodes in each heap differs, and the nodes in both subsidiary heaps are linked to the same node in the main heap. In that

Middle heap top node N ₄ Is

I.e. activity w in the Heap ₁ Is most influential, so by

And

and obtaining the upper bound Inf (Heap, Q) of the influence of the unverified track activities.

That is, the middle main Heap is ordered from large to small according to the total activity influence of the nodes, and the left Heap

According to the keyword w ₁ The activity influence of (2) ranks the nodes from large to small, right heap

For according to the keyword w ₂ Of (2)The dynamic influence orders the nodes from large to small, assuming w ₁ And w ₂ If the key word is the key word given by the user, the node N is set ₄ Middle key word w ₁ Activity influence of (2) and node N ₃ Middle key word w ₂ The maximum activity influence of all nodes which are not searched for is obtained by adding the activity influences; if the maximum activity impact is less than the minimum of the activity impacts of the first k candidate tracks calculated currently, the iteration is ended early.

In conclusion, the invention designs a novel index structure aiming at semantic track query based on activity influence, the structure integrates geographic position information, activity keyword information and activity influence information, and the indexing technology has stronger pruning capability compared with the existing indexing technology. In addition, the query processing framework and the optimization mechanism designed by the index structure can efficiently process semantic track query based on activity influence, and the prior art cannot achieve the processing efficiency.

The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it will be understood by those skilled in the art that various changes and modifications may be made herein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A semantic track query method based on activity influence is characterized by comprising the following steps:

s2: constructing a mixed grid index HGI based on basic information of a semantic track data set D, wherein the mixed grid index HGI consists of a mixed quadtree index HQ-tree and a mixed inverted index HIF; the construction method of the mixed quadtree index HQ-tree comprises the following steps:

s21: dividing a real geographic area in which a semantic track data set D is positioned into grids of D levels respectively, wherein each level comprises 2 ^d-1 ×2 ^d-1 Each grid is at least 3, and each grid corresponds to a grid number;

s22: respectively judging whether each grid of each hierarchy has a track point, eliminating grids which do not contain the track point, and constructing a quadtree by the rest grids according to the hierarchy containing relation of the grids, wherein nodes without child nodes are leaf nodes, and the rest nodes are non-leaf nodes;

for any one non-leaf node g ₀ Its associated index entry is a triple (Gid) ₀ { g' ∈ g. Substrids }, ifile), where Gid ₀ Representing non-leaf nodes g ₀ The number of the corresponding grid, { g' ∈ g.subgrids } represents a list of pointers to all child nodes of the non-leaf node g, ifile represents the non-leaf node g ₀ The first inverted index of the keywords corresponding to all the track points existing in the corresponding grids comprises the following information: each keyword is at a non-leaf node g ₀ Activity influence in the corresponding grid, non-leaf nodes g where keywords appear ₀ The grid number corresponding to the child node;

node g for any one leaf ₁ Its associated index entry is a tuple (Gid) ₁ Ilist), wherein Gid ₁ Represents the leaf node g ₁ Number of the corresponding grid, ilist, denotes the leaf node g ₁ And the second inverted indexes of the keywords corresponding to all the track points in the corresponding grids comprise the following information: each keyword is at leaf node g ₁ The influence of the activity in the corresponding grid,The number of the semantic track of each keyword appears;

for any node of the quadtree, the bitmap information is a data sequence consisting of 0 and 1, and each data bit in the data sequence corresponds to a semantic track, wherein 0 represents that the semantic track corresponding to the data bit does not pass through the grid corresponding to the current node, and 1 represents that the semantic track corresponding to the data bit passes through the grid corresponding to the current node;

the mixed inverted index HIF is composed of an activity list, an activity inverted arrangement list and a sub-track length list, wherein the activity list is corresponding to each semantic track in the semantic track data set D, the activity list is used for storing keywords corresponding to all track points on the semantic track, the activity inverted arrangement list is used for storing incidence relations between the keywords and the track points where the keywords appear, and the sub-track length list is used for storing the length from each track point to an initial track point on the semantic track;

s3: for a user-specified query requirement Q = (loc, acts, d) _max ) Finding out k semantic tracks which are most matched with the query requirement Q in a semantic track data set D based on a hybrid grid index HGI, wherein loc is a query position set by a user, acts is a keyword set by the user, and D _max K is at least 3 for the maximum value of the expected journey set by the user.

2. The semantic track query method based on activity influence according to claim 1, wherein the activity influence of the keyword in the grid corresponding to any node is calculated by:

and acquiring the activity influence of the keyword on all track points in the grid corresponding to the current node, and taking the maximum value as the activity influence of the keyword in the grid corresponding to the current node.

3. The activity influence-based semantic track query method according to claim 1, wherein the step S3 of finding k semantic tracks that best match the query requirement Q in the semantic track dataset D based on the hybrid mesh index HGI specifically comprises:

s31: setting a heap set, traversing the mixed quadtree index HQ-tree from top to bottom, firstly putting a root node into the heap set, acquiring child nodes of the root node, then removing the heap set from the root node, and adding the child nodes of the root node, which accord with a set pruning rule, into the heap set;

4. The activity influence-based semantic track query method according to claim 3, wherein the pruning rule set in step S31 is: the distance between at least one track point in the grid corresponding to the child node and the query position loc set by the user is not more than d _max (ii) a The influence of the activity of the child node is not 0; at least one of all semantic tracks passing through the mesh corresponding to the child node is not accessed.

5. The semantic track query method based on activity influence according to claim 3, wherein the activity influence of any node in step S32 is calculated by a method comprising:

6. The semantic track query method based on activity influence according to claim 3, wherein the activity influence of each track to be verified in step S33 is calculated as follows:

s332: the left end point s is fixed as the first track point and is unchanged, the right end point e increases from the first track point along the direction of the track to be verified, and a section of sub-track T [ s, e ] is intercepted every time the right end point is increased]Judging the currently intercepted sub-track T [ s, e ]]Whether the distance between the inquiry position loc set by the user is larger than d _max If the current position of the right end point e is larger than the current position of the right end point e, the current position of the right end point e is kept unchanged, and the left end point s is increased until the sub-track T [ s, e ] is obtained again]The distance between the query position loc set by the user is not more than d _max Then, the sub-track T [ s, e ] is calculated according to the following formula]Activity influence of (a); if not, directly calculating the sub-track T [ s, e ] according to the following formula]Activity influence of (c);

wherein Inf (T [ s, e ]]Q) denotes the sub-track T [ s, e ]]Act | represents the total number of keywords in the word set acts, inf _p.poi (w) in sub-track T [ s, e ]]Activity influence of keywords occurring in and belonging to the word set acts, inf _max (w) represents the maximum value of the activity influence of the keywords belonging to the word set acts at all locus points, Q.acts is a keyword set by the user, w is the keyword belonging to Q.acts, and p is the sub-locus T [ s, e ]]The trace points appearing in (1);

s334: and taking the obtained maximum value of the activity influence of all the sub-tracks as the activity influence of the current track to be verified.

7. The activity influence-based semantic track query method according to claim 6, wherein the calculation formula of the distance between the sub-track T [ S, e ] and the query position loc set by the user in step S332 is:

8. The activity impact-based semantic track query method according to claim 3, wherein the early termination condition in step S34 is: the first sum is smaller than the minimum value of the track activity influence in the k candidate tracks obtained by current calculation, wherein the calculation method of the first sum is as follows: