CN113032391B - Distributed sub-track connection query processing method - Google Patents
Distributed sub-track connection query processing method Download PDFInfo
- Publication number
- CN113032391B CN113032391B CN202110162264.7A CN202110162264A CN113032391B CN 113032391 B CN113032391 B CN 113032391B CN 202110162264 A CN202110162264 A CN 202110162264A CN 113032391 B CN113032391 B CN 113032391B
- Authority
- CN
- China
- Prior art keywords
- track
- time
- partition
- query
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 9
- 238000005192 partition Methods 0.000 claims abstract description 82
- 238000000638 solvent extraction Methods 0.000 claims abstract description 5
- 238000001914 filtration Methods 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims 2
- 238000000034 method Methods 0.000 abstract description 17
- 238000010586 diagram Methods 0.000 description 3
- 238000013138 pruning Methods 0.000 description 3
- 208000035473 Communicable disease Diseases 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006806 disease prevention Effects 0.000 description 1
- 238000005111 flow chemistry technique Methods 0.000 description 1
- 208000015181 infectious disease Diseases 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Navigation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a distributed sub-track connection query processing method. Firstly, carrying out mixed partition processing on track data, namely firstly carrying out time partition on the track data based on time information and then carrying out space partition on the track data in the same time partition based on space position information; establishing an index in each time partition; in the subsequent query process, firstly partitioning the query tracks according to the same time interval, and performing parallel query in corresponding time partitions to obtain a series of candidate tracks; then loading the space partition data corresponding to each candidate track into a memory, and verifying the space partition data one by one; and finally, merging the data obtained by each time partition. The method can support the inquiry of the city-level GPS points, effectively reduce the processing overhead of I/O and CPU, accelerate the inquiry processing and have good performance.
Description
Technical Field
The invention belongs to the technical field of space database systems, and particularly relates to a distributed sub-track connection query processing method on GPS track data.
Background
In the public health field, close contact person tracking is a process of identifying persons who have close contact with infected patients, and plays a key role in preventing further spread of infectious diseases. The method is widely used for close contact tracking between normal people and confirmed patients in infectious disease prevention and treatment due to high identification accuracy rate. To find a person in long-term contact with an infected patient, a formalized representation of close contact tracking can be expressed as a sub-track connection. In order to support the tracking of the modern city-scale close contacts, sub-track connection query needs to be performed in a large-scale track database with millions of users and weeks of GPS data.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a distributed sub-track connection query processing method. The query processed by the method is as follows: a track is input and returned to all tracks in the track database which are in close contact with the track in a certain continuous time. The method supports the inquiry of the urban level track GPS points, can effectively reduce the calculation cost of disk I/O and CPU, and accelerates the processing.
The purpose of the invention is realized by the following technical scheme: a distributed sub-track connection query processing method specifically comprises the following steps:
first we describe the design steps of the memory part, as follows:
(1) and performing mixed partition processing on the original track data. Firstly, time partitioning is carried out on track data based on time information of a track to obtain a series of time partitions, and then distributed parallel processing is carried out on each time partition.
(2) The individual time partitions are further spatially partitioned based on the spatial location information of the trace points. The method comprises the steps of obtaining a Minimum Bounding Rectangle (Minimum Bounding Rectangle) corresponding to each track in the same time period, obtaining the central point of the Minimum Rectangle of each track, and finally segmenting according to the central points by using a space filling curve Hilbert curve to obtain a series of space partitions, wherein each space partition is an independent storage file.
Then we describe the construction process of the index as follows:
(1) precise indices are built at the edges of the time partitions. And for each time partition, establishing indexes for the accurate longitude and latitude of all tracks at the current time by using an index structure R-Tree at the head and tail moments of the time partition.
(2) A coarse index is built inside the time partition. And uniformly segmenting the whole two-dimensional space at each moment of time partition by using a grid index, and then obtaining a high-dimensional vector according to the spatial distribution of all track points at the moment. And introducing a parameter N, and clustering high-dimensional vectors at all moments in the time partition into N classes by using a clustering algorithm K-Means to obtain grid indexes at the N moments.
Finally, we describe the query steps as follows:
(1) and (5) filtering. And performing data segmentation on the input query trajectory in a time dimension according to the same time interval, and then performing parallel query on the query trajectory in a corresponding time partition according to a corresponding time period.
(2) And (6) verifying. And loading the corresponding space partition into a memory according to the candidate track obtained in the filtering stage, and sequentially verifying according to the candidate track ID.
(3) And (6) merging. Processing the result obtained in the verification stage, namely determining that the track is in close contact in the single subarea and directly determining that the track is in close contact; and for the single partition which cannot be determined to be in close contact with the track, combining the two adjacent partitions before and after the time partition to perform auxiliary judgment.
Compared with the prior art, the invention has the beneficial effects that: the processing method provided by the invention is based on distributed system design, has natural parallelism, and has high-speed and effective pruning performance in the filtering part, so that the processing method has the following advantages:
1) compared with the prior art, the method provided by the invention can simultaneously perform parallel processing in a large-scale cluster and has high expandability.
2) The method provided by the invention adopts the filtration-verification idea, has good pruning effect, effectively reduces the load of disk I/O and CPU, and has better system performance.
Drawings
FIG. 1 is a flow diagram of a hybrid partitioning section;
FIG. 2 is a schematic diagram of the index construction within each partition;
fig. 3 is an overall process flow diagram of the present invention.
Detailed Description
The technical solutions of the present invention are further described below with reference to the accompanying drawings, and it should be understood that the specific examples described herein are only for the purpose of explaining the present invention and are not intended to limit the present invention.
The attached drawing is a flow processing chart of the invention, and the method specifically comprises the following steps:
first we describe the storage design part and the index building process as follows:
(1) and performing mixed partition processing on the original track data. Firstly, time partitioning is carried out on all track data based on the time information of the track to obtain N time partitions. Assuming that there are 10000 users 1 day of original trajectory data, if the original trajectory data is partitioned at 30 minutes, all trajectory data can be divided into 48 time partitions, which are 10000 users 00: 00: 00-00: 30: 00 one partition, 10000 users 00: 30: 00-01: 00: 00 one partition, …, 10000 users 23: 30: 00-24: 00: 00 one partition.
(2) The individual time partitions are further spatially partitioned based on the spatial location information of the trace points. And finally, solving a connection sequence by using a space filling curve Hilbert curve according to two-dimensional coordinates of the central points, and segmenting according to the connection sequence to obtain a series of space partitions, wherein each space partition is an independent storage file. Assume for 10000 users 00 obtained in the first step: 00: 00-00: 30: for 00 single time partition, 10000 users can obtain a Minimum rectangular frame (Minimum Bounding Rectangle) corresponding to the track in the time period, then obtain the central point of the 10000 Minimum rectangular frames, connect 10000 two-dimensional coordinate points by using a space filling curve Hilbert curve, obtain the precedence order of the central points, if the central points are divided into 10 space partitions, the 10000 tracks can be divided into 10 parts according to the order, and then 10 space partitions 0-999 users 00 can be obtained: 00: 00-00: 30: 00 one partition, 1000-: 00: 00-00: 30: 00 one partition, …, 9000-: 00: 00-00: 30: 00 one partition.
(3) And establishing an accurate index at the edge of the partition. And for each time partition, establishing indexes for the accurate longitude and latitude of all track points at the current time by using an R-Tree data structure at the head and the tail of the time partition. Let 10000 users 00 obtained in the first step: 00: 00-00: 30: 00 partition, at 00: 00: 00 and 00: 30: and 00, using an R-Tree data structure to build indexes for 1000 track points at two moments, caching the indexes into a memory after the indexes are built, and then directly inquiring in the memory.
(4) A coarse index is built inside a partition. And uniformly segmenting the whole two-dimensional space at each moment of time partition by using a grid index, and then obtaining a high-dimensional vector according to the spatial distribution of all track points at the moment. And introducing a parameter N, and clustering high-dimensional vectors at all moments in the time partition into N classes by using a clustering algorithm K-Means to obtain grid indexes at the N moments. Let 10000 users 00 obtained in the first step: 00: 00-00: 30: and 00 partitions, setting the sampling frequency of GPS points of the tracks to be 10s and the size of the grid index to be 100 x 100, establishing a 100 x 100 grid index every 10s, wherein each track corresponds to a grid number at each moment, each moment can obtain a 100 x 100 high-dimensional vector, and each element in the vector represents the number of the tracks falling into the grid at the moment. Then, a parameter N is introduced, a clustering algorithm K-Means is used for clustering high-dimensional vectors at all times into N classes, then two-dimensional grid indexes are built at N clustering center times, the indexes are cached in a memory after the indexes are built, and then the indexes can be directly inquired in the memory.
For the query processing process, the specific implementation steps are as follows:
(1) and (5) filtering. And performing data segmentation on the input query trajectory in a time dimension according to the same time interval, and then performing parallel query on the query trajectory in a corresponding time partition according to a corresponding time period. Let there be a query trajectory at this time, the duration of the trajectory being one day. The trace is first sliced in the time dimension for a length of 30 minutes to yield 48 sub-queries. Then, distributed parallel processing is performed, that is, the 1 st sub-query is distributed to the time partition 00: 00: 00-00: 30: 00, distribute the 2 nd sub-query to time partition 00: 30: 00-01: 00: 00, …, distribute the 48 th sub-query to time partition 23: 30: 00-24: 00: 00. When inquiring in each partition, firstly making a Range Query on the R-Tree at the head and the tail of two moments, and directly taking the obtained result as a candidate track; for other moments in the partitions, making Range Query on the two-dimensional grid index of each moment, then taking results generated by the Range Query at a plurality of continuous moments as intersections, finally taking the results of all the intersections of the partitions as a union set, and finally generating candidate tracks corresponding to the partitions.
(2) And (6) verifying. And loading the corresponding space partition into a memory according to the candidate track obtained in the filtering stage, and sequentially verifying according to the candidate track ID. The method comprises the specific steps that all spatial partitions are processed on a cluster in parallel, when a single partition is processed, track point information of the track in the period of time is read according to a track ID which is generated in a filtering stage and needs to be verified, and the track point information and query track points are calculated one by one to obtain a verification result.
(3) And (6) merging. Processing the result obtained in the verification stage, namely determining that the track is in close contact in the single subarea and directly determining that the track is in close contact; and for the single partition which cannot be determined to be in close contact with the track, combining the two adjacent partitions before and after the time partition to perform auxiliary judgment. Assuming that the time window of the query is 20 minutes, the result obtained in the verification stage for the track with ID 1 is in partition 00: 30: 00-01: 00: 00, the ID is 1, and the track can be directly determined as the close contact track if the track is in close contact with the query track in 25 minutes; if the track with ID 2 gets the result in the verification stage as in partition 00: 30: 00-01: 00: 00 in 00: 30: 00-00: 45: when the phase 00 is in close contact with the query track, since it cannot be determined whether the track is in close contact with the query track, the phase 00 needs to be combined with the partition 00: 00: 00-00: 30: 00 in 00: 25: 00-00: 30: and further judging the contact condition of the 00 stage.
Compared with the prior art, the method provided by the invention can simultaneously perform parallel processing in a large-scale cluster, and has good expandability. The method provided by the invention adopts the filtration verification idea, has good pruning effect, effectively reduces the disk I/O and CPU load and has better system performance.
Claims (1)
1. A distributed sub-track connection query processing method is characterized by comprising the following steps:
(1) a storage section designing step, comprising the substeps of:
(1.1) firstly, time partitioning is carried out on track data based on time information of track points, and track segments are segmented according to equal time intervals to obtain a series of time partitions;
(1.2) solving a corresponding minimum rectangular frame for each track in each time partition, and calculating to obtain a central point of the minimum rectangular frame of each track;
(1.3) sequencing the central points of the minimum rectangular frames of each track by using a space filling curve Hilbert curve in each time partition, and segmenting according to sequencing results to obtain a series of space partitions, wherein each space partition is an independent storage file;
(2) the index part constructing step comprises the following substeps:
(2.1) for each time partition, establishing indexes for the accurate longitude and latitude of all tracks at the current time by using an index structure R-Tree at the head and tail moments of the time partition;
(2.2) uniformly dividing the whole two-dimensional space into m × m two-dimensional grids by using a grid index at each time of time partition, and then obtaining an m × m-dimensional high-dimensional vector according to the spatial distribution of all track points at the time;
(2.3) introducing a parameter N, and clustering m-dimensional high-dimensional vectors of all moments in the time partition into N classes by using a clustering algorithm K-Means to finally obtain grid indexes of N moments;
(3) a query step, comprising the following substeps:
(3.1) filtration: for the input query track, performing data segmentation on the time dimension according to the same time interval, and then performing parallel query in corresponding time partitions according to corresponding time periods;
(3.2) verifying: loading the corresponding space partition data into a memory according to the candidate track ID obtained in the filtering stage, and sequentially verifying according to the candidate track ID;
(3.3) merging: processing the result obtained by the verification part, namely determining the track to be in close contact in a single partition, and directly determining the track to be in close contact; and for the single partition which cannot be determined to be in close contact with the track, combining the two adjacent partitions before and after the time partition to perform auxiliary judgment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110162264.7A CN113032391B (en) | 2021-02-05 | 2021-02-05 | Distributed sub-track connection query processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110162264.7A CN113032391B (en) | 2021-02-05 | 2021-02-05 | Distributed sub-track connection query processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113032391A CN113032391A (en) | 2021-06-25 |
CN113032391B true CN113032391B (en) | 2022-04-12 |
Family
ID=76460107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110162264.7A Active CN113032391B (en) | 2021-02-05 | 2021-02-05 | Distributed sub-track connection query processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113032391B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US2426217A (en) * | 1942-09-14 | 1947-08-26 | Standard Telephones Cables Ltd | Direction and distance indicating system |
CN102567497A (en) * | 2011-12-23 | 2012-07-11 | 浙江大学 | Inquiring method of best matching with fuzzy trajectory problems |
CN111652446A (en) * | 2020-06-15 | 2020-09-11 | 深圳前海微众银行股份有限公司 | Method, apparatus and storage medium for predicting risk of infection of infectious disease |
-
2021
- 2021-02-05 CN CN202110162264.7A patent/CN113032391B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US2426217A (en) * | 1942-09-14 | 1947-08-26 | Standard Telephones Cables Ltd | Direction and distance indicating system |
CN102567497A (en) * | 2011-12-23 | 2012-07-11 | 浙江大学 | Inquiring method of best matching with fuzzy trajectory problems |
CN111652446A (en) * | 2020-06-15 | 2020-09-11 | 深圳前海微众银行股份有限公司 | Method, apparatus and storage medium for predicting risk of infection of infectious disease |
Non-Patent Citations (2)
Title |
---|
A Smart Low-consumption IoT Framework for Location Tracking and Its Real Application;Hao Tang等;《 2016 6th International Conference on Electronics Information and Emergency Communication (ICEIEC)》;20161013;全文 * |
面向室内空间的语义轨迹提取框架;骆歆远等;《清华大学学报(自然科学版)》;20191231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113032391A (en) | 2021-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pelanis et al. | Indexing the past, present, and anticipated future positions of moving objects | |
Uddin et al. | Finding regions of interest from trajectory data | |
CN111523577A (en) | Mass trajectory similarity calculation method based on improved LCSS algorithm | |
CN109241126A (en) | A kind of space-time trajectory accumulation mode mining algorithm based on R* tree index | |
CN106528793A (en) | Spatial-temporal fragment storage method for distributed spatial database | |
CN106156528A (en) | A kind of track data stops recognition methods and system | |
CN112131325A (en) | Track determination method, device and equipment and storage medium | |
CN105760548A (en) | Vehicle first appearance analysis method and system based on big data cross-domain comparison | |
CN102004771B (en) | Method for querying reverse neighbors of moving object based on dynamic cutting | |
CN111611900B (en) | Target point cloud identification method and device, electronic equipment and storage medium | |
CN117893383B (en) | Urban functional area identification method, system, terminal equipment and medium | |
CN114238491B (en) | Heterogeneous graph-based multi-mode traffic operation situation association rule mining method | |
CN108566620A (en) | A kind of indoor orientation method based on WIFI | |
CN113722415B (en) | Point cloud data processing method and device, electronic equipment and storage medium | |
CN111833224A (en) | Urban main and auxiliary center boundary identification method based on population grid data | |
CN113779105B (en) | Distributed track flow accompanying mode mining method | |
CN104778355B (en) | The abnormal track-detecting method of traffic system is distributed based on wide area | |
CN113032391B (en) | Distributed sub-track connection query processing method | |
CN112052405B (en) | Passenger searching area recommendation method based on driver experience | |
CN112307286B (en) | Vehicle track clustering method based on parallel ST-AGNES algorithm | |
CN109800231A (en) | A kind of real-time track co-movement motion pattern detection method based on Flink | |
Rslan et al. | Spatial R-tree index based on grid division for query processing | |
CN114564521A (en) | Method and system for determining working time period of agricultural machine based on clustering algorithm | |
Chen et al. | Detecting trajectory outliers based on spark | |
CN110222022B (en) | Intelligent algorithm optimized data library construction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |