CN106407378B

CN106407378B - Method for re-representing road network track data

Info

Publication number: CN106407378B
Application number: CN201610817878.3A
Authority: CN
Inventors: 孙未未; 韩韵衡
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-09-11
Filing date: 2016-09-11
Publication date: 2020-05-26
Anticipated expiration: 2036-09-11
Also published as: CN106407378A

Abstract

The invention belongs to the technical field of track data calculation, and particularly relates to a method for re-representing road network track data. The road network track data obtained by original GPS sampling is not easy to encode and compress, so the original three-dimensional real number sequence needs to be changed before compression. The method comprises the steps of decomposing a map-matched road network track into spatial data and time data, wherein the spatial data is a road network road sequence, and the time data is a distance-time binary sequence; the data before and after decomposition can be transformed losslessly in linear time. In the track calculation, the invention can reduce the track storage and query cost in the database.

Description

Method for re-representing road network track data

Technical Field

The invention belongs to the technical field of trajectory data calculation, and particularly relates to a method for re-representing road network trajectory data.

Background

Trajectory data is a basic spatiotemporal data, generally defined as a function of position with respect to time. The track points sampled by the vehicle-mounted positioning device are represented by (x, y, t) triples, wherein x and y are longitude and latitude respectively, and t is the time stamp of the sampling point. The original road network trajectory can then be represented by a sequence of triplets, i.e.<(x₁，y₁，t₁)，(x₂，y₂，t₂），…，(x_n，y_n，t_n)>Where n is the length of the track and (x)_i，y_i) Is that the vehicle is at t_iThe location of the time of day. With the popularization of vehicle-mounted positioning equipment, vehicles in cities generate massive road network tracks. The road network track data carries a large amount of information, and is often used as an important decision basis and an information source in the problems of analyzing urban traffic conditions, mining behavior patterns of people, predicting vehicle flow directions and the like. The urban road network is represented by a directed graph G ═ V, E, where V is the set of intersections of roads and E is the set of segments between intersections.

Data compression algorithms are classified into lossless compression and lossy compression. Lossless compression does not generate information loss, namely, compressed data can be completely restored to original data; in contrast, lossy compression achieves a higher compression rate by directly discarding portions of the data that do not affect the accuracy requirement, but there is a loss of information in the data after ω compression. Lossless compression algorithms are divided into entropy coding and dictionary coding. Commonly used entropy coding includes Huffman coding and arithmetic coding; lexicographic coding is commonly used with Lempel-Ziv coding and other algorithms derived. Lossy compression algorithms are specialized algorithms designed for certain specific data, such as JPEG compression for images and MPEG compression for audio, and these methods are directed only to specific data (such as images and audio), unlike lossless compression. Specific lossy compression algorithms are also provided for general track data and road network track data, and the methods usually directly delete sampling points which do not influence data precision in an original track.

The raw road network trajectory data is typically represented as a triplet sequence T ═ x (x) due to being sampled from the mobile positioning device₁，y₁，t₁)，(x₂，y₂，t₂)，…，(x_n，y_n，t_n) However, such representations contain unnecessary redundancy, which is detrimental to data compression. According to the limitation of the original data representation method, a new track format and a corresponding data decomposition method are provided to directly reduce data redundancy, so that the data redundancy is easily processed by a compression algorithm.

Disclosure of Invention

The compression rate is one of the key indicators for measuring the performance of a data compression algorithm, and is generally defined as the ratio of the size of original data to the size of compressed data. Giving an original track T, and setting the size of the original track T as | T |; and the track after compression is T^cOf size | T^cIf the compression ratio is

For example, the original data size is 2KB, and the compressed data size is 1KB, the data compression rate is 2.

First consider the use of a general lossless compression algorithm (sourcecoding) on the original track. If we use classical lossless compression algorithm directly for original track data, since the theoretical background of data compression is information theory, analyzing the problem of track data compression from the perspective of information entropy, it can be proved that, no matter entropy coding or dictionary coding, the compression ratio of algorithm to track data becomes very low when the precision of real data is improved.

Theorem: when the precision of real number data is improved, both entropy encoding and dictionary encoding tend to have a compression rate of 1 for trajectory data.

And (3) proving that: first we demonstrate that entropy coding is inefficient for high precision real data. Let X be a continuous distribution with a probability density function of p (X). To calculate the entropy of X, we first equally divide the sample space ω ═ a, b) of X into n parts, each cell interval having a length Δ ═ b/n. Let [ a, b) be divided into { [ a ═ x { [ a { [ x { ]₀，x₁)，[x₁，x₂)，…[x_n-1，x_nB), the probability of x falling within each interval constitutes a discrete distribution whose column of probability distributions can be calculated by integration:

according to the median theorem of integral, there must be

Such that:

in other words, the discrete distribution has an entropy of

If the function p (x) log p (x) Riemannian's product, then there is

Where h (X) is the differential entropy of the continuous distribution X.

N in the above equation is the precision of the data because the finer the interval division is, the more symbols are used to represent different data, i.e., the higher the precision of the data. If the data is not compressed, the data can be directly used

Bits to store each symbol. According to Shannon's source coding theorem and the optimality of entropy coding algorithms, when n goes to infinity, there are:

in summary, when the data precision is improved, the compression rate of the entropy coding approaches 1, i.e. the data cannot be compressed.

Unlike entropy coding, in order to compute the compression effect of lexicographic coding, the joint entropy of the source distribution needs to be analyzed, but the process of proving is similar. Given any k characters, let p (x)₁，x₂，…x_k) Is X₁，X₂，…X_kThe joint probability density function of (a). The optimal average code length L_kIt is inevitable to satisfy:

i.e. we need at least H (X)₁，X₂，…X_k) Bits to represent k characters.

Similarly, we will sample space ω ═ a₁，b₁)×[a₂，b₂)×…[a_k，b_k) Is divided into n^kPart, each block of size

The entropy of the discrete distribution is then calculated as:

therefore, if there are k items of data and their precision is n, then at least H Δ (X) is needed₁，X₂，…X_k) Bits to encode the data. If p (x)₁，x₂，…x_k)log p(x₁，x₂，…x_k) Riemannianghui, then:

wherein, h (X)₁，X₂，…X_k) Is the joint differential entropy. Finally we can calculate the compression ratio r:

after the syndrome is confirmed.

Although lossy compression algorithms can achieve very high compression rates, they do so at the expense of data accuracy. Most existing algorithms delete sample points directly from the original trajectory, which results in a large deviation between the original trajectory and the compressed trajectory. The compression rate of these lossy compression algorithms is also low if the accuracy requirements are high, i.e. the loss of information due to tight framing compression. At the most extreme, these lossy compression algorithms would also be as inefficient as the common lossless compression algorithms if a zero loss of information is required, when lossy compression is equivalent to lossless compression.

Based on the above discussion, it can be seen that the key factor limiting the data compression rate is the representation method of the trajectory data, not the compression algorithm used. In fact, both entropy coding algorithms and lexicographic coding algorithms have proven to be optimal compression algorithms, i.e. they both reach the entropy (or entropy rate) of the source. In the information theory, the information entropy measures the uncertainty of the information source, and the higher the uncertainty of the information source is, the more information we obtain from the output of the information source, that is, more data is needed for encoding. Consider an existing triplet representation of a trajectory, which is suitable for representing an arbitrary trajectory in two dimensions. However, the shape of the road network trajectory is severely limited by the roads and its uncertainty is significantly smaller than any two-dimensional trajectory. In other words, the original trajectory representation method introduces unnecessary uncertainty (unnecessary information), which makes the trajectory data represented by the original triplet difficult to compress.

The invention removes unnecessary uncertainty in the data by reducing the dimension (dimensionality) of the data. Suppose that three-dimensional trajectory data (x)_i，y_i，t_i) Instead, in two-dimensional form (d)_i，t_i) Indicating that its base compression ratio has reached 1.5. Note that this conversion must be lossless, i.e., there must be a one-to-one correspondence between data before and after conversion, otherwise it is equivalent to lossy compression directly on the data. The expression form of the data is buffered in advance before the data compression, so that the data compression rate is directly improved, and the data is easy to process by a subsequent compression algorithm.

In the original trace, sample point (x)_i，y_i，t_i) Is shown at time t_iWith the target at position (x)_i，y_i). Let (x)₁，y₁) Is the starting sample point of the trace, from the starting position (x)₁，y₁) To the current position (x)_i，y_i) Distance d of_iIs determined. Conversely, if the travel from the starting point is known, the corresponding position (x)_i，y_i) But are difficult to determine. Therefore, in order to establish the one-to-one correspondence relationship between the original track and the decomposed track, the road sequence needs to be additionally stored<e₁，e₂，…，e_m>Wherein e is_iIs the edge in E and the number of roads traversed by the m track.

Up to this point, the trajectory has been decomposed into two parts, namely spatial data-road sequences, temporal data-distance-time sequences, i.e. a new format of trajectory data: the road sequence of the trajectory T is a series of successive roads, i.e. SPs, traversed by T in the road network G ═ (V, E)_T＝<e₁，e₂，…，e_m>(ii) a (where V is a set of graph vertices (i.e., intersections of roads) and E is a set of edges connecting the graph vertices (i.e., links connecting intersections of roads) in the road network G ═ V, E ═ V ═ E<v₀，v₁，v₂，…，v_m>，E＝<e₁，e₂，…，e_m>，v_iAs a directed edge e_i-1End point, or edge e of_iThe starting point of (2). The distance-time series of the trajectory T is a series (d)_i，t_i) A doublet of where d_iIs that the target starts moving from the starting point to time t_iTotal distance to, i.e. TS_T＝<(d₁，t₁)，(d₂，t₂)，…，(d_n，t_n)>。

Given any road network track T, the original track decomposition or the decomposition track reduction can be completed in O (| T |). After data decomposition, the trajectory data is converted into a road sequence and a distance-time sequence. Next, the comp uses lossless compression for the road sequence and lossy compression for the distance-time sequence. The lossless compression is used for the road sequence because the road sequence is an integer sequence and has low information entropy; while the range-time series are still real data, which is still high in information, so lossy compression is required.

According to the analysis, the method for re-representing road network track data provided by the invention is to decompose the road network track data into two parts of spatial data and time data; wherein:

(1) original GPS sampling track format is T ═<(x₁，y₁，t₁)，(x₂，y₂，t₂)，…，(x_n，y_n，t_n)>Wherein the sampling point (x)_i，y_i，t_i) Is shown at time t_iWith the moving object at a two-dimensional coordinate position (x)_i，y_i) Coordinate value x_i，y_iAnd a time stamp t_iAre all real data;

(2) the trajectory is broken into two parts: spatial data and temporal data;

the spatial data is a road number sequence and is used for representing the spatial shape of the track;

the time data is a distance-time binary sequence and is used for representing the change of the track speed;

the road number sequence is specifically represented by the following formula:

(1) the track data after map matching does not contain GPS sampling errors any more, namely track points are corrected, and the position distance of sampling points does not have deviation corresponding to map roads;

(2) after map matching, each sampling point is on a map road, so that road numbers corresponding to the sampling points can be obtained. Road number sequence SP corresponding to original sampling point sequence_T＝<e₁，e₂，…，e_m>Namely, the spatial data after decomposition is obtained; wherein e_iIs the edge in E, and m is the number of roads through which the track passes;

(3) spatial data, i.e. SPs, can also be represented by sequences of vertices of a map_T＝<v₀，v₁，v₂，…，v_m>Wherein v is_iAs a directed edge e_i-1End point, or edge e of_iThe vertex representation of the link sequence and the edge representation of the start point of (1) are equivalent.

The distance-time binary sequence representation form is as follows: (d)_i，t_i)，d_iIs that the target starts moving from the starting point to time t_iTotal distance to, i.e. sequence of doublets TS_T＝<(d₁，t₁)，(d₂，t₂)，…，(d_n，t_n)>As time data after decomposition.

The track decomposition method for decomposing the original track into the format comprises the following specific steps:

(1) matching the input track by a map to ensure that each sampling point corresponds to a road;

(2) output each sample point (x)_i，y_i，t_i) Corresponding road number e_iFor consecutive repeated entries, only one of the entries is retained;

(3) calculating every two adjacent sampling points (x) of the track_i-1，y_i-1) And (x)_i，y_i) The distance traveled in the road network is denoted l_iWherein, as₁＝0；

(4) For theEach sample point (x)_i，y_i，t_i) Output of

As a distance-time doublet (d)_i，t_i) D in (1)_iAnd the time stamp is not changed.

In the track calculation, the method can reduce the track storage and query cost in the database.

Drawings

FIG. 1 is a sample road network, including 12 intersections and 17 roads.

FIG. 2 shows two sample traces on the road network.

Detailed Description

The data format and trajectory decomposition method are described below in conjunction with example road networks and trajectories.

As shown in fig. 1, a given road network contains 12 vertices (intersections) and 17 edges (roads). Considering track 1 (blue track), all sample points have been mapped onto the road since the tracks have all been map matched. In FIG. 2, the sampling points₁₁Corresponding edge₁₅(ii) a Sampling point₁₂Corresponding edge₁₆(ii) a Sampling point₁₃Corresponding edge₁₃(ii) a Sampling point₁₄Corresponding edge₁₆(ii) a Sampling point₁₅Corresponding edge₃. Note that if the sampling point happens to fall at the intersection, the next side should be uniformly taken instead of the previous side as the corresponding road sequence item, such as the sampling point₁₃Corresponding edge₁₃Rather than to₁₆. Therefore, it is not only easy to use₁Road sequence SP₁＝<e₁₅，e₁₆，e₁₃，e₆，e₃>。

In order to calculate the corresponding distance-time series, a trajectory decomposition method needs to be applied. From the calculated road sequence and road shape, the road network distance between two sampling points can be calculated in turn, as in figure 1,₁₁and₁₂a distance therebetween of₁₅)+Δ₁₁Wherein (a)₁₅) For roads₁₅Total length of (d), Δ₁₁Is composed of₁₂Distance between two adjacent plates₁₆Distance of starting point. In order to calculate the distance between two adjacent points, the geographical shape of the road needs to be known, the roads in a general road network are all stored as a broken line and comprise a plurality of two-dimensional coordinate points, and the shape of the actual road can be simulated by sequentially linking the two-dimensional coordinate points. The distance between sampling points on the road can be calculated according to the shape of the road, and the Euclidean distance or the spherical distance (when longitude and latitude coordinates are used) can be calculated only according to the two-dimensional coordinates. As shown in the figure 1 of the drawings,₁₁and₁₂is a distance of₁＝(₁₅)+Δ₁₁；₁₂And₁₃the distance between them is:₂＝(₁₆)-Δ₁₁；₁₃and₁₄is a distance of₃＝(₁₃)+Δ₁₂；t₁₄And t₁₅Is a distance of l₄＝w(e₆)-Δ₁₂+Δ₁₃。

Then according to the track decomposition method

The summation yields:

actually, to facilitate processing the time data, the starting point of the road where the first sampling point is located may be used as the starting point of the whole track, such as T in fig. 2₂(Red trace), we calculate the sample point distance v₅Instead of the sample point distance t₂₁The distance of (c). Thus obtaining

And time data

Claims

1. A method for re-representing road network track data is characterized in that road network track data is decomposed into two parts of spatial data and time data; wherein:

(1) setting original GPS sampling track format as T ═<(x₁,y₁,t₁),(x₂,y₂,t₂),…,(x_n,y_n,t_n)>N is the length of the trace, sample point (x)_i,y_i,t_i) Is shown at time t_iWith the moving object at a two-dimensional coordinate position (x)_i,y_i) Coordinate value x_i,y_iAnd a time stamp t_iAre all real data;

(2) the trajectory is broken into two parts: spatial data and temporal data;

the road number sequence is specifically represented in the form:

(2) after map matching, each sampling point is on a map road, and road numbers corresponding to the sampling points can be obtained; road number sequence SP corresponding to original sampling point sequence_T＝<e₁,e₂,…,e_m>Namely, the spatial data after decomposition is obtained; wherein e_iIs an edge in E; m is the number of roads passed by the track; e is the side between the vertexes of the connection graph, namely the road section between the connection intersections;

(3) representing spatial data by a sequence of vertices of a map, i.e. SPs_T＝＜v₀,v₁,v₂,…,v_m> (wherein v)_iAs a directed edge e_i-1End point, or edge e of_iA starting point of (a);

the distance-time binary sequence representation form is as follows: (d)_i,t_i)，d_iIs that the target starts moving from the starting point to time t_iTotal distance to, i.e. sequence of doublets TS_T＝＜(d₁,t₁),(d₂,t₂),…,(d_n,t_n)>As time data after decomposition.

2. The method for re-representing road network trajectory data according to claim 1, wherein said decomposing road network trajectory data into two parts of spatial data and temporal data comprises the following steps:

(2) output each sample point (x)_i,y_i,t_i) Corresponding road number e_iFor consecutive repeated entries, only one of the entries is retained;

(3) calculating every two adjacent sampling points (x) of the track_i-1,y_i-1) And (x)_i,y_i) The distance traveled in the road network is denoted l_iWherein, as₁＝0；

(4) For each sample point (x)_i,y_i,t_i) Output of

As a distance-time doublet (d)_i,t_i) D in (1)_iAnd the time stamp is not changed.