CN113032391B

CN113032391B - Distributed sub-track connection query processing method

Info

Publication number: CN113032391B
Application number: CN202110162264.7A
Authority: CN
Inventors: 陈刚; 常志豪; 张东祥; 陈珂; 寿黎但; 伍赛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2022-04-12
Anticipated expiration: 2041-02-05
Also published as: CN113032391A

Abstract

The invention discloses a distributed sub-track connection query processing method. Firstly, carrying out mixed partition processing on track data, namely firstly carrying out time partition on the track data based on time information and then carrying out space partition on the track data in the same time partition based on space position information; establishing an index in each time partition; in the subsequent query process, firstly partitioning the query tracks according to the same time interval, and performing parallel query in corresponding time partitions to obtain a series of candidate tracks; then loading the space partition data corresponding to each candidate track into a memory, and verifying the space partition data one by one; and finally, merging the data obtained by each time partition. The method can support the inquiry of the city-level GPS points, effectively reduce the processing overhead of I/O and CPU, accelerate the inquiry processing and have good performance.

Description

Distributed sub-track connection query processing method

Technical Field

The invention belongs to the technical field of space database systems, and particularly relates to a distributed sub-track connection query processing method on GPS track data.

Background

In the public health field, close contact person tracking is a process of identifying persons who have close contact with infected patients, and plays a key role in preventing further spread of infectious diseases. The method is widely used for close contact tracking between normal people and confirmed patients in infectious disease prevention and treatment due to high identification accuracy rate. To find a person in long-term contact with an infected patient, a formalized representation of close contact tracking can be expressed as a sub-track connection. In order to support the tracking of the modern city-scale close contacts, sub-track connection query needs to be performed in a large-scale track database with millions of users and weeks of GPS data.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a distributed sub-track connection query processing method. The query processed by the method is as follows: a track is input and returned to all tracks in the track database which are in close contact with the track in a certain continuous time. The method supports the inquiry of the urban level track GPS points, can effectively reduce the calculation cost of disk I/O and CPU, and accelerates the processing.

The purpose of the invention is realized by the following technical scheme: a distributed sub-track connection query processing method specifically comprises the following steps:

first we describe the design steps of the memory part, as follows:

(1) and performing mixed partition processing on the original track data. Firstly, time partitioning is carried out on track data based on time information of a track to obtain a series of time partitions, and then distributed parallel processing is carried out on each time partition.

(2) The individual time partitions are further spatially partitioned based on the spatial location information of the trace points. The method comprises the steps of obtaining a Minimum Bounding Rectangle (Minimum Bounding Rectangle) corresponding to each track in the same time period, obtaining the central point of the Minimum Rectangle of each track, and finally segmenting according to the central points by using a space filling curve Hilbert curve to obtain a series of space partitions, wherein each space partition is an independent storage file.

Then we describe the construction process of the index as follows:

(1) precise indices are built at the edges of the time partitions. And for each time partition, establishing indexes for the accurate longitude and latitude of all tracks at the current time by using an index structure R-Tree at the head and tail moments of the time partition.

(2) A coarse index is built inside the time partition. And uniformly segmenting the whole two-dimensional space at each moment of time partition by using a grid index, and then obtaining a high-dimensional vector according to the spatial distribution of all track points at the moment. And introducing a parameter N, and clustering high-dimensional vectors at all moments in the time partition into N classes by using a clustering algorithm K-Means to obtain grid indexes at the N moments.

Finally, we describe the query steps as follows:

(1) and (5) filtering. And performing data segmentation on the input query trajectory in a time dimension according to the same time interval, and then performing parallel query on the query trajectory in a corresponding time partition according to a corresponding time period.

(2) And (6) verifying. And loading the corresponding space partition into a memory according to the candidate track obtained in the filtering stage, and sequentially verifying according to the candidate track ID.

(3) And (6) merging. Processing the result obtained in the verification stage, namely determining that the track is in close contact in the single subarea and directly determining that the track is in close contact; and for the single partition which cannot be determined to be in close contact with the track, combining the two adjacent partitions before and after the time partition to perform auxiliary judgment.

Compared with the prior art, the invention has the beneficial effects that: the processing method provided by the invention is based on distributed system design, has natural parallelism, and has high-speed and effective pruning performance in the filtering part, so that the processing method has the following advantages:

1) compared with the prior art, the method provided by the invention can simultaneously perform parallel processing in a large-scale cluster and has high expandability.

2) The method provided by the invention adopts the filtration-verification idea, has good pruning effect, effectively reduces the load of disk I/O and CPU, and has better system performance.

Drawings

FIG. 1 is a flow diagram of a hybrid partitioning section;

FIG. 2 is a schematic diagram of the index construction within each partition;

fig. 3 is an overall process flow diagram of the present invention.

Detailed Description

The technical solutions of the present invention are further described below with reference to the accompanying drawings, and it should be understood that the specific examples described herein are only for the purpose of explaining the present invention and are not intended to limit the present invention.

The attached drawing is a flow processing chart of the invention, and the method specifically comprises the following steps:

first we describe the storage design part and the index building process as follows:

(1) and performing mixed partition processing on the original track data. Firstly, time partitioning is carried out on all track data based on the time information of the track to obtain N time partitions. Assuming that there are 10000 users 1 day of original trajectory data, if the original trajectory data is partitioned at 30 minutes, all trajectory data can be divided into 48 time partitions, which are 10000 users 00: 00: 00-00: 30: 00 one partition, 10000 users 00: 30: 00-01: 00: 00 one partition, …, 10000 users 23: 30: 00-24: 00: 00 one partition.

(2) The individual time partitions are further spatially partitioned based on the spatial location information of the trace points. And finally, solving a connection sequence by using a space filling curve Hilbert curve according to two-dimensional coordinates of the central points, and segmenting according to the connection sequence to obtain a series of space partitions, wherein each space partition is an independent storage file. Assume for 10000 users 00 obtained in the first step: 00: 00-00: 30: for 00 single time partition, 10000 users can obtain a Minimum rectangular frame (Minimum Bounding Rectangle) corresponding to the track in the time period, then obtain the central point of the 10000 Minimum rectangular frames, connect 10000 two-dimensional coordinate points by using a space filling curve Hilbert curve, obtain the precedence order of the central points, if the central points are divided into 10 space partitions, the 10000 tracks can be divided into 10 parts according to the order, and then 10 space partitions 0-999 users 00 can be obtained: 00: 00-00: 30: 00 one partition, 1000-: 00: 00-00: 30: 00 one partition, …, 9000-: 00: 00-00: 30: 00 one partition.

(3) And establishing an accurate index at the edge of the partition. And for each time partition, establishing indexes for the accurate longitude and latitude of all track points at the current time by using an R-Tree data structure at the head and the tail of the time partition. Let 10000 users 00 obtained in the first step: 00: 00-00: 30: 00 partition, at 00: 00: 00 and 00: 30: and 00, using an R-Tree data structure to build indexes for 1000 track points at two moments, caching the indexes into a memory after the indexes are built, and then directly inquiring in the memory.

(4) A coarse index is built inside a partition. And uniformly segmenting the whole two-dimensional space at each moment of time partition by using a grid index, and then obtaining a high-dimensional vector according to the spatial distribution of all track points at the moment. And introducing a parameter N, and clustering high-dimensional vectors at all moments in the time partition into N classes by using a clustering algorithm K-Means to obtain grid indexes at the N moments. Let 10000 users 00 obtained in the first step: 00: 00-00: 30: and 00 partitions, setting the sampling frequency of GPS points of the tracks to be 10s and the size of the grid index to be 100 x 100, establishing a 100 x 100 grid index every 10s, wherein each track corresponds to a grid number at each moment, each moment can obtain a 100 x 100 high-dimensional vector, and each element in the vector represents the number of the tracks falling into the grid at the moment. Then, a parameter N is introduced, a clustering algorithm K-Means is used for clustering high-dimensional vectors at all times into N classes, then two-dimensional grid indexes are built at N clustering center times, the indexes are cached in a memory after the indexes are built, and then the indexes can be directly inquired in the memory.

For the query processing process, the specific implementation steps are as follows:

(1) and (5) filtering. And performing data segmentation on the input query trajectory in a time dimension according to the same time interval, and then performing parallel query on the query trajectory in a corresponding time partition according to a corresponding time period. Let there be a query trajectory at this time, the duration of the trajectory being one day. The trace is first sliced in the time dimension for a length of 30 minutes to yield 48 sub-queries. Then, distributed parallel processing is performed, that is, the 1 st sub-query is distributed to the time partition 00: 00: 00-00: 30: 00, distribute the 2 nd sub-query to time partition 00: 30: 00-01: 00: 00, …, distribute the 48 th sub-query to time partition 23: 30: 00-24: 00: 00. When inquiring in each partition, firstly making a Range Query on the R-Tree at the head and the tail of two moments, and directly taking the obtained result as a candidate track; for other moments in the partitions, making Range Query on the two-dimensional grid index of each moment, then taking results generated by the Range Query at a plurality of continuous moments as intersections, finally taking the results of all the intersections of the partitions as a union set, and finally generating candidate tracks corresponding to the partitions.

(2) And (6) verifying. And loading the corresponding space partition into a memory according to the candidate track obtained in the filtering stage, and sequentially verifying according to the candidate track ID. The method comprises the specific steps that all spatial partitions are processed on a cluster in parallel, when a single partition is processed, track point information of the track in the period of time is read according to a track ID which is generated in a filtering stage and needs to be verified, and the track point information and query track points are calculated one by one to obtain a verification result.

(3) And (6) merging. Processing the result obtained in the verification stage, namely determining that the track is in close contact in the single subarea and directly determining that the track is in close contact; and for the single partition which cannot be determined to be in close contact with the track, combining the two adjacent partitions before and after the time partition to perform auxiliary judgment. Assuming that the time window of the query is 20 minutes, the result obtained in the verification stage for the track with ID 1 is in partition 00: 30: 00-01: 00: 00, the ID is 1, and the track can be directly determined as the close contact track if the track is in close contact with the query track in 25 minutes; if the track with ID 2 gets the result in the verification stage as in partition 00: 30: 00-01: 00: 00 in 00: 30: 00-00: 45: when the phase 00 is in close contact with the query track, since it cannot be determined whether the track is in close contact with the query track, the phase 00 needs to be combined with the partition 00: 00: 00-00: 30: 00 in 00: 25: 00-00: 30: and further judging the contact condition of the 00 stage.

Compared with the prior art, the method provided by the invention can simultaneously perform parallel processing in a large-scale cluster, and has good expandability. The method provided by the invention adopts the filtration verification idea, has good pruning effect, effectively reduces the disk I/O and CPU load and has better system performance.

Claims

1. A distributed sub-track connection query processing method is characterized by comprising the following steps:

(1) a storage section designing step, comprising the substeps of:

(1.1) firstly, time partitioning is carried out on track data based on time information of track points, and track segments are segmented according to equal time intervals to obtain a series of time partitions;

(1.2) solving a corresponding minimum rectangular frame for each track in each time partition, and calculating to obtain a central point of the minimum rectangular frame of each track;

(1.3) sequencing the central points of the minimum rectangular frames of each track by using a space filling curve Hilbert curve in each time partition, and segmenting according to sequencing results to obtain a series of space partitions, wherein each space partition is an independent storage file;

(2) the index part constructing step comprises the following substeps:

(2.1) for each time partition, establishing indexes for the accurate longitude and latitude of all tracks at the current time by using an index structure R-Tree at the head and tail moments of the time partition;

(2.2) uniformly dividing the whole two-dimensional space into m × m two-dimensional grids by using a grid index at each time of time partition, and then obtaining an m × m-dimensional high-dimensional vector according to the spatial distribution of all track points at the time;

(2.3) introducing a parameter N, and clustering m-dimensional high-dimensional vectors of all moments in the time partition into N classes by using a clustering algorithm K-Means to finally obtain grid indexes of N moments;

(3) a query step, comprising the following substeps:

(3.1) filtration: for the input query track, performing data segmentation on the time dimension according to the same time interval, and then performing parallel query in corresponding time partitions according to corresponding time periods;

(3.2) verifying: loading the corresponding space partition data into a memory according to the candidate track ID obtained in the filtering stage, and sequentially verifying according to the candidate track ID;

(3.3) merging: processing the result obtained by the verification part, namely determining the track to be in close contact in a single partition, and directly determining the track to be in close contact; and for the single partition which cannot be determined to be in close contact with the track, combining the two adjacent partitions before and after the time partition to perform auxiliary judgment.