CN107220285B

CN107220285B - Space-time index construction method for massive trajectory point data

Info

Publication number: CN107220285B
Application number: CN201710270989.1A
Authority: CN
Inventors: 陈昭; 王磊; 刁博宇; 徐勇军
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-04-24
Filing date: 2017-04-24
Publication date: 2020-01-21
Anticipated expiration: 2037-04-24
Also published as: CN107220285A

Abstract

The invention relates to a parallel space-time index construction method facing mass track point data, which takes track point data files as index units, reduces the storage consumption of indexes and ensures that an index structure has high expandability; meanwhile, the Hilbert curve is used for dividing the data file, compared with other multi-dimensional to one-dimensional mapping modes, the Hilbert curve has better dividing effect due to excellent space filling characteristics, and the probability of data inclination can be reduced.

Description

Space-time index construction method for massive trajectory point data

Technical Field

The invention relates to the field of information retrieval, in particular to a space-time index construction method for massive trajectory point data.

Background

With the development of science and technology, the world has entered the big data era nowadays. Due to the rapid increase of the data scale, big data needs to have global expressive force, and space-time big data becomes one of important big data because the space-time big data can embody the incidence relation among time, space and objects. However, the relatively complex relationship between big spatio-temporal data and its dynamic evolution also bring the difficulty of searching query. The trajectory point data belongs to space-time big data, and specifically refers to data information obtained by sampling the motion process of a moving object in a space-time environment. In recent years, with the rapid development of satellites, wireless networks and positioning devices, the trajectory point data of a large number of moving objects tends to increase rapidly, and the index construction and optimization query of the trajectory point data become popular researches in recent years.

Hadoop is a popular distributed computing framework at present, is suitable for computing processing scenes of various large-scale data, has a wide application foundation, and currently, some space-time index methods proposed based on the framework and ecological software thereof, such as Q tree space-time index based on HBase, grid R tree mixed space-time index based on HBase and the like, are provided. Most of the existing space-time index construction methods use data recording strips as index units, and the mode causes large storage consumption and low index construction efficiency, and cannot meet the requirement of rapid increase of space-time big data of different types.

Disclosure of Invention

The invention aims to provide a space-time index construction method facing mass track point data, which can overcome the defects of the prior art and can parallelly construct space-time indexes of the track point data in a distributed environment with higher efficiency; and the data file is used as an index unit, so that the index structure has flexible expansibility.

The technical scheme adopted by the invention is as follows: a space-time index construction method for massive trajectory point data comprises the following steps:

step 1), storing track point data in a track point data file;

and step 2), constructing an index tree by taking the track point data file in the step 1) as an index unit.

Preferably, the trace point data in step 1) at least includes time information and two-dimensional position information.

Preferably, the step 2) further comprises:

step 21), dividing the track point data file into at least one computing unit;

step 22), the computing unit constructs a space-time index based on the space index structure.

Preferably, when the computing unit is a plurality of parallel computing units, the track point data file is divided into ordered partitions in step 21).

Preferably, the ordered division of step 21) is implemented by using a space-filling curve.

Preferably, the space-filling curve is a hilbert curve.

Preferably, the step 21) further comprises:

step 211) calculating a two-dimensional Hilbert value of two-dimensional space information for representing the track point data file;

step 212) calculating a three-dimensional Hilbert value used for representing the three-dimensional space information of the track point data file according to the two-dimensional Hilbert value calculated in the step 211);

step 213) dividing the track point data file according to the three-dimensional Hilbert value calculated in the step 212).

Preferably, the spatial index structure in step 22) is an R-tree structure.

Preferably, the construction of the multi-level spatiotemporal index tree can be realized based on a MapReduce or Spark programming framework.

According to another aspect of the present invention, a method for querying trajectory point data based on the index tree constructed by the above method is provided, including:

step a), traversing the root nodes of the index tree to obtain a root node list;

step b), inquiring the root node list obtained in the step a) to obtain a child node list;

and c) traversing the child node list obtained in the step b) in parallel to obtain a track point data file list.

Preferably, the query method can be implemented based on a MapReduce or Spark programming framework.

Has the advantages that: according to the time-space index construction method facing the mass track point data, the data file containing the motion information is used as the index unit, the storage consumption of the index is reduced, and the storage mode of the data file can be adjusted according to the requirement, so that the index structure has high expandability; meanwhile, the Hilbert curve is used for dividing the data file, compared with other multi-dimensional to one-dimensional mapping modes, such as mapping longitude and latitude into grid numbers, the Hilbert curve has a better dividing effect due to excellent space filling characteristics, and the probability of data inclination can be reduced.

Drawings

FIG. 1 is a schematic view of a spatiotemporal cube of a trace point data file according to the present invention

FIG. 2 is a schematic diagram of an index tree structure for massive trace point data according to the present invention

FIG. 3 is a multi-level spatiotemporal index tree parallel construction process based on R tree implemented by using MapReduce programming framework,

FIG. 4 is a parallel query process of the spatio-temporal index tree constructed based on FIG. 3, which is implemented by adopting a MapReduce programming framework

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the spatio-temporal index construction method for massive trajectory point data provided in the embodiments of the present invention is further described in detail below with reference to the accompanying drawings.

The trace point data is a series of data records obtained by sampling the real-time position of the moving object, and each record comprises a sampling time stamp, two-dimensional position information, other motion information such as speed and direction and other related information such as sampling time. The trace point data is structured data, the number of data columns is fixed, and the meaning, the type and the value range of each column are also fixed; meanwhile, once the trace point data collected by sampling is generated, the trace point data can not be modified and deleted any more, and the trace point data can be regularly accumulated in batches in the process of switching from online acquisition to offline analysis.

The inventor finds that the track point data can be stored and organized into a plurality of data files according to the characteristics of the track point data, and the track point data can be subjected to space-time index by taking the data file stored with the track point data record as a unit.

When recording track point data, each track point data record can be abstracted into an n-dimensional vector, and the ith record can be described as:

(t_i，x_i，y_i，o_i1，...，o_in-3)

wherein, t_iFor time stamp at sampling time, x_i、y_iFor moving objects at t_iTwo-dimensional position information (typically latitude and longitude), o, of a time of day_i1，...，o_in-3Is the information of the rest n-3 dimensions.

FIG. 1 shows a schematic diagram of a spatiotemporal cube of a track point data file of the spatiotemporal index construction method for massive track point data provided by the present invention. As shown in fig. 1, a plurality of trace point data are stored as a trace point data file, the trace point data contained in each trace point data file has a space-time three-dimensional value range, and three vertexes representing the cube range are respectively:

(t_min，X_min，y_min)、(t_min，X_max，y_max)、(t_max，X_max，y_max)

the storage mode of the track point data can be various according to the requirement, for example, the track point data is stored according to ascending/descending order of time, the track point data is stored according to spatial area grids, or the track point data in a single spatial area grid is stored according to ascending/descending order of time, and the like. The following will describe a process of constructing a spatiotemporal index of trajectory point data by taking a trajectory point data file stored in ascending order of time as an example.

According to one embodiment of the invention, a space-time index construction method for massive trajectory point data is provided. The method comprises two steps of dividing a track point data file and constructing an index tree, and comprises the following specific contents:

s10, dividing the track point data file into at least one computing unit;

according to the number of the track point data files, when the index tree is constructed, the track point data files can be divided into a single computing unit to be executed or distributed to a plurality of computing units to be executed in parallel, so that the processing speed and efficiency are improved. Taking the division into w computing units as an example, distributing all track point data files to the w computing units, traversing all track point data records for each track point data file by the computing units, and counting to obtain the time and space value range of the record contained in the track point data file:

(t_min，t_max)、(X_max，X_min)、(y_max，y_min)

and taking the central point of the space-time cube of the track point data file as an identifier, wherein each track point data file can be characterized as a three-dimensional coordinate:

((t_min+t_max)/2，(X_min+X_max)/2，(y_min+y_max)/2)

according to an embodiment of the invention, when multiple computing units are adopted for parallel execution, in order to reduce the parallel traversal overhead during query and improve the indexing performance, a space filling curve can be used for computing the similarity degree of a central point in a three-dimensional space, and track point data files of similar time and similar geographic spaces are divided into one computing unit.

A hilbert curve is a space filling curve, which can map points in a two-dimensional space into one-dimensional values, i.e., hilbert values, and tuples in two-dimensional spaces with similar hilbert values often have similar properties in the two-dimensional space. The detailed description will be given below by taking a hilbert curve as an example, and the specific steps are as follows:

and S101, taking the central point of the space-time cube of the track point data file as a vector for identifying the track data file. Let the identification vector of the ith track point data file be (t'_i，x′_i，y′_i) First, calculate (x'_i，y′_i) Two-dimensional Hilbert value of

For representing the position information of the track point data file in two-dimensional space and then calculating

Three-dimensional Hilbert value of

And the method is used for representing the position information of the track point data file in the three-dimensional space, and the position information is used as the Hilbert value of the file. Is recorded as

And (5) vector quantity.

And S102, sampling all track point data files according to a sampling rate p. Assuming that the number of samples is m, the number of preset parallel computing units is w, and Hilbert values representing m data files are arranged in an ascending order

Dividing m samples into w sets approximately uniformly, and taking the maximum Hilbert in each setThe value (divided by the last set) is taken as the division point, i.e. it is

And w-1 in total. The sampling rate is in the range of p e (0, 1)]The larger the p value is, the closer the sampling result is to the distribution of the real data file, and the better the dividing effect is; the smaller the p-value, the faster the running speed.

S103, traversing all track point data files, and judging the Hilbert value of the ith data fileThe location of the interval is determined by the position of the target,

if it is

Dividing the ith file into 1 st computing unit;

if it is

Dividing the ith file into the jth computing unit;

if it is

The ith file is divided into the w-th cells.

The Hilbert curve has good space filling characteristic, and when the track point data file is divided, the probability of data inclination can be reduced. Although in the above embodiments, the track point data file is divided by using a hilbert curve, it should be understood by those skilled in the art that in other embodiments, the track point data file may be divided in order by using a plurality of dividing manners, such as by using a chronological order or using other types of space filling curves.

S20, constructing a multi-level space-time index tree based on the R-tree;

r-trees are a variant of spatial index structure R-trees. The R-tree is a highly balanced tree in which the B-tree is expanded in a multidimensional space. Compared with the R tree, the R tree has not much change in structure, and the difference is mainly that the overlap is considered in the insertion of the index tree, and the R tree selectively re-inserts the unit inserted in the index first, so that the index tree is optimized.

After step S10 is completed, the computing unit takes the assigned track point data file as an index unit, and constructs a multi-level spatio-temporal index tree of the computing unit in parallel based on the R × tree.

Fig. 2 is a schematic diagram of an index tree structure provided by the present invention, and as shown in fig. 2, a construction process of the index tree is the same as a basic R-tree construction process. The index tree uses the storage path of each track point data file and the range of the track point data file, namely the space-time value range (t) of the track point data file_min，t_muax)、(x_max，x_min)、(y_max，y_min) As an index unit file. The method comprises the following specific steps:

s201, leaf node construction: each leaf node comprises at least one index unit and a minimum space-time rectangle which can frame all the index units;

the minimum space-time rectangle refers to the space-time value range of all track point data files contained in the minimum space-time rectangle, and the following is the same.

S202, constructing non-leaf nodes: each non-leaf node comprises a pointer array of its child nodes and a minimum spatio-temporal rectangle that can frame all its child nodes;

s203, constructing a root node of the index sub-tree: the index subtree root node on each computing unit comprises a pointer array of the subtree root node and a minimum space-time rectangle which can frame all the subnodes of the root node, and if the index subtree root node is a leaf node, the index subtree root node comprises the space-time value range of all track point data files on the computing unit;

s204, constructing a root node of the index tree: each index tree root node comprises the recording paths of the trace point data files on all the computing units and the minimum space-time rectangle which can frame all the child nodes of the root node.

S205, constructing a total index tree file: because the total amount of the track point data or the accumulation time is different, the construction of the index tree is generally required to be executed in batches, and the index tree root nodes generated by each construction can be stored in an ascending order according to the time range contained by the index tree root nodes, namely, the index tree root nodes are the total index tree file of the highest level.

Compared with the traditional method of taking each data record as an index unit, the method of taking the track point data file as the index unit provided by the invention greatly reduces the construction complexity of the index tree, improves the running speed, saves the system overhead, can obviously improve the track point data management efficiency and the query efficiency, and can meet the requirement of continuously accumulating track point data.

The track point data file is internally stored with track point data, namely the size of the track point data file influences the depth of constructing the index tree, and the larger the track point data file is, the smaller the depth of the index tree is, and the faster the running speed is; and correspondingly, the smaller the track point data file is, the greater the index tree depth is, and the higher the query accuracy is.

In another embodiment of the present invention, the construction of the index tree may be implemented based on a MapReduce programming model, taking a parallel computing unit as an example, and fig. 3 shows a flowchart of the construction of the index tree implemented based on a MapReduce programming framework, as shown in fig. 3, the specific flow is as follows:

step 101: uniformly distributing the track point data files to parallel computing units, wherein each parallel computing unit stores the storage path p of the track point data files each time_iAs Map end input, traversing all records in the trace point data file, and counting the time-space value range (t) of the trace point data contained in the data point file_imin，t_imax)、(x_imax，x_imin)、(y_imax，y_imin)；

Step 102: after the statistics is completed, the parallel computing unit forms the storage path of the trace point data file and the space-time value range of the trace point data file in the step 101 into a tuple (p)_i，(t_imin，t_imax，x_imax，x_imin，y_imax，y_imin) Output as Map terminal;

step 103: the parallel computing unit takes the tuples of all the track point data files in the step 102 as the input of the Reduce end and the output of the Map end and stores the tuples as the index record files of the track point data files, namely the index units when the index trees are constructed.

Step 201: traversing all track point data files by the parallel computing unit, and taking the tuple in the step 102 as the input of the Map end; the center point of the spatio-temporal cube ((t) for each trace point data file i)_imin+t_imax)/2，(x_imin+x_max)/2，(y_imin+y_imax) And/2) as the mark of the track point data file, setting the mark as (t)^f，x^f，y^f) (ii) a Calculating Hilbert value h of the data file according to the central point^ts: let H (x, y) be the original Hilbert function, round (x) be a rounded function, and the Hilbert value of the data file calculate function H^tsIs defined as:

h^ts＝H^ts(T，X，Y)＝

H(round(aT)，round(bH(round(x)，round(y))))

and a and b are adjusting parameters calculated according to requirements and used for optimizing and calculating the Hilbert value. In the calculation process, a [0, 1 ] is generated simultaneously]R is less than or equal to 0.1, (h)^tsAnd 1) the output is used as the output of the Map end, otherwise, the output is not generated, and therefore sampling is achieved. H to be obtained^tsThe values are arranged in ascending order, and a specific h is taken according to the requirement^tsThe value is taken as a division point (same as step S102);

step 202: the parallel computing unit traverses all track point data files, takes the tuple in each step 102 as the input of a Map end, and computes the Hilbert value of the track point data files

Dividing the track point data file according to the step S103, and outputting each calculation unit obtained by dividing in the step S103 as a Map end;

step 203: the parallel computing unit respectively stores the Map end output in the step 202, namely the Reduce end input, as an index record file of the track point data file according to the divided parallel computing unit numbers, and each row of records are (p)_i，t_imin，t_imax，x_imax，x_imin，y_imax，y_imin)；

Step 301: each parallel computing unit respectively constructs an index subtree (the same as the steps S201-S203);

step 401: adding each index subtree into index subtree file for storing index subtree root node, the index tree root node is recorded as

Wherein the content of the first and second substances,

representing the total time value of the batch of track point data;

representing a file path of a root node of a storage index sub-tree;indicating the offset of the index subtree root node record in the file (same as step S204);

step 501: the index tree root records of the batch are added to the index tree file storing the existing index tree root nodes, and then all the index tree root nodes are arranged in ascending order according to the maximum value of the time range (same as step S205).

The inventor researches, taking a track point data set with the size of 1TB as an example, and respectively stores the track point data into a plurality of track point data files with the same size as the default value of the HDFS block, namely 128 MB. By adopting the method provided by the invention, the spatio-temporal index tree is constructed aiming at the track point data file, and about 9000 index records can be generated in total. The method not only greatly reduces the scale of the data participating in the space-time index construction, but also can improve the construction speed.

In another embodiment of the invention, a method for performing parallel query on a large amount of trace point data based on the index tree constructed above is also provided. For example, the spatio-temporal value ranges of a large amount of trajectory point data to be queried are:

{t∈[t_min，t_max]∩ x∈[X_min，X_max]∩y∈[y_min，y_max]}

wherein t is a time value condition, and x and y are two-dimensional geographic space value conditions. The specific query steps are as follows:

s30, traversing the root nodes of all the index trees and comparing the time range with t_min，t_max]Adding the root nodes of the index tree with the intersection into a root node list with query;

s40 according to t_mainA/2, the root node list obtained in the step A is searched in two ways, and index sub-tree root nodes of the index tree root nodes with intersection in the space-time value range are added into the list of the sub-nodes to be traversed in parallel;

s50, parallelly traversing each child node of the root node of the index child tree in the child node list obtained in the step B, traversing all child nodes of the node if the node is a non-leaf node, and adding the child nodes with intersection in the space-time value range into the child node list to be parallelly traversed; if the node is a leaf node, traversing all records of the node, and adding a data file path with intersection between the space-time value range and the space-time value range of the query condition into a file list to be queried;

and S60, when the child node list to be traversed becomes empty, further inquiring the required track point data in the file list to be inquired, namely the track point data file set containing all the track points to be inquired.

In another embodiment of the present invention, the query method may be implemented based on a MapReduce programming model, and fig. 4 shows a flowchart of the query method implemented based on a MapReduce programming framework, as shown in fig. 4, the specific flow is as follows:

step 101': read the index tree file, line i records as

Suppose that

And

if there is an intersection, then will

Put into a queue Q of root nodes to be traversed₀In (1). (same as step S30)

Step 102': reading the index sub-tree file to obtain all the root node queues Q to be traversed in step 101₀The index sub-tree root node of the root node performs overlapping judgment on the index sub-tree root node and the query condition, and the index sub-tree root node meeting the overlapping condition is placed into a node queue Q to be traversed_nodeIn (1).

The above-mentioned overlap judgment means that the space-time value range of the index sub-tree root node E obtained in step 101 is used as the reference

With overlapping parts of spatio-temporal rectangles formed by query conditions, i.e.

And is

And is

Step 201': with Q_nodeThe root node of the index sub-tree in (1) is used as Map input, and for each root node E of the index sub-tree, if E is a non-leaf node, the child node of E is

Where m is the number of child nodes contained in E, then

Outputting as a Map end; if E is a leaf node, E contains an index unit of

Then will be

And outputting the signal as the Map end.

Step 202': setting a record as a record of which the input Key value of the Reduce end is 0, namely the record of a non-leaf node

Suppose that

If there is an overlap with the spatio-temporal rectangle formed by the query conditions as defined in step 102, then it will be

As Reduce output; and for the record with the Key value of 1 input by the Reduce end, namely the index unit of the leaf node, executing SQL query based on the file path set of the Reduce end, and storing the obtained query result into a query temporary table.

Step 301': steps 201 'and 202' are parallel MapReduce processes, the output of the Reduce end is used as the input of the Map end after the execution is finished, and the step 201 'is re-entered until the output of the Reduce end is empty after the step 202' is finished.

Step 401': and returning the content in the query temporary table to obtain a final track point data file query result.

Through research of the inventor, the spatio-temporal index construction and query method provided by the invention only comprises projection and aggregation operations during the traversal of the nodes of the index tree, does not relate to complex operation processes such as trace point data sequencing in a single trace point data file, reduces the system overhead, improves the construction and query speed, and has higher flexibility and expansibility by adopting a mode of taking a trace point data file as a unit.

In another embodiment of the present invention, the construction and query of the index tree can be implemented based on a Spark programming framework. The RDD abstract data structure programming model provided by Spark is mainly realized based on memory operation, and can optimize iterative workload besides providing interactive query.

Although in the foregoing embodiment, an R-tree-based structure is used to construct an index tree for a mass of track point data files, those skilled in the art will understand that in other embodiments, a variety of spatial index structures may be used to implement the method for constructing an index by using a track point data file as an index unit.

Although the present invention has been described by way of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims

1. A space-time index construction method for massive trajectory point data comprises the following steps:

step 1), storing track point data into a plurality of track point data files, wherein the track point data at least comprises time information and two-dimensional position information;

step 2), acquiring a space-time value range of the trace point data contained in each trace point data file;

step 3), constructing an index tree by taking the track point data file as an index unit;

wherein the index tree is constructed by the following steps:

1) constructing leaf nodes: each leaf node comprises at least one index unit and a minimum space-time rectangle which can frame all the index units; the minimum space-time rectangle refers to the space-time value range of all track point data files contained in the minimum space-time rectangle;

2) construction of non-leaf nodes: each non-leaf node comprises a pointer array of its child nodes and a minimum spatio-temporal rectangle that can frame all its child nodes;

3) constructing a root node of the index subtree: the index subtree root node on each computing unit comprises a pointer array of the subtree root node and a minimum space-time rectangle which can frame all the subnodes of the root node, and if the index subtree root node is a leaf node, the index subtree root node comprises the space-time value range of all track point data files on the computing unit;

4) constructing a root node of the index tree: each index tree root node contains the recording paths of the trace point data files on all the computing units and the minimum space-time rectangle which can frame all the child nodes of the root node.

2. The method for constructing the spatio-temporal index for the massive amounts of trajectory point data as claimed in claim 1, wherein said step 3) further comprises:

step 31), dividing the track point data file into at least one computing unit;

step 32), the computing unit constructs a space-time index based on the space index structure.

3. The method for constructing the spatio-temporal index for the mass trace point data according to claim 2, wherein when the computing unit is a plurality of parallel computing units, the track point data file is divided into the ordered partitions in the step 31).

4. The method for constructing the spatio-temporal index for the massive trace point data as claimed in claim 3, wherein the ordered division of the step 31) is realized by using a space filling curve.

5. The method for constructing the spatio-temporal index for the massive amounts of trajectory point data as claimed in claim 4, wherein the space filling curve is a Hilbert curve.

6. The method for constructing the spatio-temporal index for the massive amounts of trajectory point data as claimed in claim 5, wherein the step 31) further comprises:

step 311) calculating a two-dimensional Hilbert value of two-dimensional space information for representing the track point data file;

step 312) calculating a three-dimensional Hilbert value used for representing the three-dimensional space information of the track point data file according to the two-dimensional Hilbert value calculated in the step 311);

step 313) dividing the track point data file according to the three-dimensional Hilbert value calculated in the step 312).

7. The method for constructing spatio-temporal index facing mass trajectory point data according to any one of claims 3 to 6, wherein the spatial index structure in the step 32) is an R-tree structure.

8. The method for constructing the spatio-temporal index for the massive trajectory point data as claimed in any one of claims 3 to 6, wherein the construction of the index tree can be realized based on a MapReduce or Spark programming framework.

9. A method of querying trajectory point data using an index tree constructed as claimed in any one of claims 1 to 8, comprising:

10. The method of claim 9, wherein the method is implemented based on MapReduce or Spark programming framework.