CN103136239B

CN103136239B - Transportation data loss recovery method based on tensor reconstruction

Info

Publication number: CN103136239B
Application number: CN201110384954.3A
Authority: CN
Inventors: 谭华春; 王武宏; 冯广东; 冯建帅; 成斌; 夏红卫; 吴艳新; 朱湧; 阳钟兴
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2011-11-29
Filing date: 2011-11-29
Publication date: 2015-03-25
Anticipated expiration: 2031-11-29
Also published as: CN103136239A

Abstract

The invention discloses a transportation data loss recovery method based on tensor reconstruction. The transportation data loss recovery method based on the tensor reconstruction aims to resolve the problem that precision is low and loss in a plurality of days can not be processed when an existing traditional transportation data loss recovery method based on a vector or a matrix form is used for recovering loss data. The transportation data loss recovery method based on the tensor reconstruction comprises that (a) transportation data are set in a multi-dimensional tensor form, loss tensor data are expressed through marked tensor, (b) the tensor data are spread on each mode, the relevance of all modes is calculated, and the weight of each mode is obtained, and (c) an objective function of loss data value recovery is set up and the loss data value of the objective function is solved according to the set tensor data and the calculation of the weight of each mode. The transportation data loss recovery method based on the tensor reconstruction is based on a multi-dimensional tensor model, all transportation time-space information is contained, the relevance of multi-mode is fully utilized, at the same time the original structure of multi-dimensional properties and the like of the transportation data is maintained, recovery precision is obviously superior to the traditional recovery method based on the vector or the matrix form, and an extreme case of the loss of a plurality of days can be solved well.

Description

Traffic data loss recovery method based on tensor reconstruction

Technical Field

The invention belongs to the field of intelligent traffic, and particularly relates to a traffic data loss recovery method.

Background

The traffic data loss recovery is a significant problem in an intelligent traffic system, and the recovery of the traffic loss data can improve the functions of the intelligent traffic system, for example, a traffic information distribution system, a traffic management system and the like all need complete and accurate traffic data, but in actual traffic, the traffic data is often incomplete due to equipment failure, transmission errors and the like, and the loss rate is reported to be 16% -93% according to related research, so that part of intelligent traffic subsystems cannot work normally, and therefore, the loss value of the incomplete traffic data needs to be estimated.

At present, the recovery methods for traffic data loss are mainly divided into two types: the recovery method based on the vector form is used for constructing the traffic data into the vector form and recovering the lost value by adopting an interpolation method or a regression method; the recovery method based on the matrix form builds the traffic data into the matrix form and adopts the matrix reconstruction theory to recover the lost value. However, both of these two types of recovery methods have their limitations and disadvantages, the former must be adopted when the loss rate is extremely low, and can only rely on the information of nearby points of the lost point in a single mode for recovery, and the recovery accuracy is low; the latter can improve the recovery accuracy to a certain extent by utilizing the correlation between traffic information and data in two modes, but the method cannot be used when the loss rate is high. In addition, the above two methods cannot fully utilize the multidimensional correlation characteristics of traffic data, and the improvement of recovery precision is severely restricted. And when the traffic data is lost for one or even several days, the two methods cannot be processed.

And recovering and researching multidimensional correlation characteristics of the traffic data based on the lost data reconstructed by the tensor, establishing a correlation criterion, judging the weight of each mode, and further fully utilizing the multi-mode correlation and the traffic space-time information so as to obtain the optimal estimation value of the lost point, thereby perfecting the traffic data. Compared with the prior art, the method keeps the original structure of the traffic data, obtains more accurate results, and can still obtain more ideal effect when losing data for one or more days. Some scholars perform loss value estimation on traffic matrix data by using a probability model based on Principal Component Analysis (PCA), wherein a matrix traffic data loss recovery method based on Bayesian Principal Component Analysis (BPCA) recently adopted by yield and the like is most interesting. For the introduction of this Method, reference may be made to the paper "A BPCA Based Missing value analyzing Method for Traffic Flow Volume Data" (a Traffic Flow Data loss recovery Method Based on Bayesian principal component analysis) (authors: yield, etc., by 2008 IEEEIntelligent Vehicles Symposium, 6.2008) and the paper "PPCA-Based Missing Data improvement for Traffic Flow Vvolume: a systematic approach (a method for recovering traffic flow data loss based on probabilistic principal component analysis) (authors of yield, etc. in IEEE Transaction on Intelligent Transport Systems (ITS), 2009, 9). The core of the method lies in acquiring the principal component and the global structure of the matrix traffic data through PCA, but the method can only obtain certain effect when the loss rate is lower than 50%, so that the method is difficult to meet the requirement of recovering the traffic data loss extreme condition for one or more days.

Under the background, it is important to research a recovery method that can improve the recovery accuracy and adaptively handle various loss situations.

Disclosure of Invention

Aiming at the limitation of the existing traffic data loss recovery method, the invention aims to provide a traffic data loss recovery method based on tensor reconstruction, which can improve the recovery precision of lost values and can handle special conditions of random loss of up to 90 percent, loss of one day or more and the like.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a traffic data loss recovery method based on tensor reconstruction comprises the following steps:

A. according to the multiple distribution rule of the traffic data, the traffic data is constructed into a multi-dimensional tensor data form, and the lost points of the traffic data are represented by marks;

B. normalizing the correlation coefficient of each mode by calculating the correlation of data of each mode to obtain the weight of each mode;

C. the traffic data loss recovery target function in the tensor form is established, the tensor reconstruction theory is adopted to convert the target function, the lost point markers and the weights of all modes are combined, and a traffic data recovery model based on the tensor reconstruction theory is established, and the model can achieve a good recovery effect on random loss and special loss.

The expression for solving the complete traffic data is as follows:

constraint conditions are as follows:

wherein, A represents the original traffic data,representing the recovered traffic data; and omega is a mark tensor, marks a lost point, and has an element value of 0 and the rest of 1 at a place where the traffic data is lost. C represents the maximum traffic capacity, and the recovery value is ensured to meet the actual traffic condition.

The target function is converted by adopting a Lagrange method to obtain:

<math> <mrow> <msub> <mi>f</mi> <mi>Ω</mi> </msub> <mo>:</mo> <munder> <mi>min</mi> <mover> <mi>A</mi> <mo>&OverBar;</mo> </mover> </munder> <msubsup> <mrow> <mo>|</mo> <mo>|</mo> <mi>Ω</mi> <mo>*</mo> <mrow> <mo>(</mo> <mi>A</mi> <mo>-</mo> <mover> <mi>A</mi> <mo>^</mo> </mover> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> <mi>F</mi> <mn>2</mn> </msubsup> <mo>+</mo> <mfrac> <mi>λ</mi> <mn>2</mn> </mfrac> <msubsup> <mrow> <mo>|</mo> <mo>|</mo> <mover> <mi>A</mi> <mo>^</mo> </mover> <mo>-</mo> <mi>C</mi> <mo>|</mo> <mo>|</mo> </mrow> <mi>F</mi> <mn>2</mn> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

in this method, in order to avoid singular value decomposition, the data to be obtained is subjected toAnd performing a Tucker decomposition, wherein the Tucker decomposition is a decomposition mode in tensor decomposition.

Where S is_×1X_×2Y_×3Z represents the Tucker decomposition in tensor decomposition; λ is the lagrange coefficient. The rank problem of the tensor can be well solved by adopting the Tucker decomposition, the Tucker decomposition is the mode rank of the tensor, namely the rank of the matrix after matrix expansion is carried out on the tensor, and in practical application, a Tucker model can more easily express the low-rank property of multi-dimensional data.

In addition, the multidimensional traffic data are decomposed to X, Y, Z modes through a Tucker, different weights are given to the modes according to a weight distribution algorithm on each mode, the optimization of multi-mode information utilization is achieved, and the solution is further carried out through a first-order (or second-order) optimization method.

Because the existing common Lagrangian method has the defect of convergence sub-linearity, the invention adopts an Augmented Lagrangian Multipliers method for optimization. The augmented Lagrange multiplier method is well applied to a matrix recovery algorithm, convergence is proved to have the property of Q linearity, and compared with a general Lagrange method, the speed is remarkably increased.

Firstly, the road traffic capacity limit in the formula (1)Appropriate relaxation of this constraint can be made, transforming into the following form:

L-C≤0 (4)

and each element value in the C is the maximum traffic capacity value of the road. Equation (1) can be converted to:

min_L，S：||L||_*+η||S||₁ (5)

constraint conditions are as follows: a is L + S L-C is less than or equal to 0

Its augmented Lagrangian function can be expressed as:

wherein L represents the finally recovered traffic data (low rank part), S represents the data polluted by noise (sparse part), A is distorted traffic data, alpha and beta are positive numbers, and M and N are Lagrange multipliers. Compared with the common Lagrange method, the convergence rate of the Lagrange multiplier is greatly improved due to the introduction of the Lagrange multiplier in the formula.

The step A is as follows:

traffic data is typically statistical in time series, and is usually viewed as a one-dimensional vector or two-dimensional matrix form. However, traffic data has multi-dimensional correlations, for example, high similarities (approximately periodicity) exist between day and day, week and week, and hour, and when traffic data is estimated, in order to utilize the correlations as much as possible, the traffic data needs to be constructed into multi-dimensional tensors.

The traffic data multidimensional tensor data form is determined according to the following expression:

A∈R^{week×Day×Hour} (7)

wherein A represents traffic data; week denotes "Week" mode; day represents "Day" mode; hour stands for "hours" mode.

For traffic data where a loss occurs, it is determined as represented by:

wherein,

wherein, omega is the labeled tensor,representing traffic data.

The step B comprises the following steps:

b1, spreading the multi-dimensional traffic data to each mode, and calculating the correlation of each mode by using a similarity coefficient;

b2, normalizing the relevance of each mode and giving each mode weight;

the weights of the modes are calculated according to the following expressions:

wherein, W_kRepresents the weight on the k-th mode (0 ≦ w_k≤1)；s_kDenotes a similarity coefficient (0. ltoreq. s) in the k-th mode_k≤1)。

The step B1 includes:

the tensor traffic data are expanded to each mode, and similarity coefficients of the modes are calculated according to the following expressions:

<math> <mrow> <msub> <mi>s</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>Σ</mi> <mrow> <mi>n</mi> <mo>&GreaterEqual;</mo> <mi>i</mi> <mo>></mo> <mi>j</mi> <mo>&GreaterEqual;</mo> <mn>1</mn> </mrow> </msub> <msub> <mi>R</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>n</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mi>k</mi> </msub> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>/</mo> <mn>2</mn> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein R is_k(i, j) a correlation coefficient matrix representing a k-th mode; n is_kThe number of data points representing the kth mode; s_kDenotes the similarity of the kth mode, 0. ltoreq.s_k≤1。

The invention has the beneficial effects that:

the detection method provided by the invention has the following advantages:

the method has the advantages of high running speed and good effect, can process traffic data with the loss rate of 99 percent, can process the extreme condition of losing one or more days, has good universality, and greatly improves the recovery precision compared with the prior recovery method. The data of 16 × 12 × 24 actual traffic data were selected and compared experimentally with a PC of Matlab 7.0 in pentium (r) D, and fig. 4 and 5 show the effect of the method of the present invention compared to the conventional vector form-based recovery method and the latest traffic data loss recovery method, such as the methods mentioned in ITS 09 'and ITS 10'. From the standpoint of recovering accuracy, the conventional vector form-based method fails when the loss rate is large, and cannot handle the loss extreme case. The newly proposed Zhang method has the defects that the recovery precision is sharply reduced when the loss rate is higher than 50%, and the recovery capability effect is poor when extreme conditions are processed. The method is established on the basis of a tensor reconstruction theory, the recovery precision is slightly better than that of a Zhang method when the loss rate is lower than 50%, the recovery precision can still maintain a good effect when the loss rate is higher than 50%, the recovery precision is improved to different degrees when the loss rate is high and the loss rate is low, and the method shows a better effect when extreme conditions are processed.

In addition, the method of the invention adopts the weight distribution method based on the correlation, thus greatly improving the utilization efficiency of the correlation of each mode and further improving the recovery precision.

Drawings

Fig. 1 is a flowchart of a traffic data loss recovery method based on tensor reconstruction in an embodiment of the present invention;

FIG. 2 is a flowchart of obtaining weights of modes in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-dimensional traffic data tensor form of step A of the invention;

FIG. 4a and FIG. 4b are diagrams illustrating the processing effect of the random loss of traffic data and the recovery methods in embodiment 1; wherein, fig. 4a is the original traffic data; FIG. 4b is a graph comparing the recovery effect of the Drift method and the method of the present invention; wherein, RMSE (root Mean Square error) is the root Mean Square error, the smaller the error, the better the effect;

FIG. 5 is a comparison of the processing effects of the recovery method of the present invention and the Evrim Acar recovery method when traffic data is lost for one or more days in example 2; wherein, rmse (root mean square error) is a root mean square error, and a smaller value indicates better effect.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

Example 1

In this embodiment, for the random loss of traffic data, as shown in fig. 1, the recovery process is performed in the following three steps:

1. constructing traffic data into tensor form and marking lost points

Since the traffic data of fig. 4(a) is 16-day traffic data, it can be constructed as tensor traffic data by the following expression:

A∈R^16×12×24 (11)

wherein, A comprises 16 days, 24 hours a day, 12 minutes an hour and 5 minutes. According to the A size of 16 × 12 × 24, the mark tensorSize is also 16 x 12 x 24, said lost trafficThe data is determined by the following expression:

A′＝Ω*A (12)

where a represents lost traffic data.

2. Determining weights of modes

Please refer to fig. 2 for a detailed flowchart of the process, which includes the following steps:

first, tensor traffic data is expanded to each mode, and since the traffic data is 3-dimensional in the embodiment, the tensor traffic data can be expanded to 3 modes, and the sizes of the tensor traffic data are 16 × 288, 12 × 364 and 192 × 24 respectively.

Then, similarity coefficients are respectively solved for the three mode data, the step can be completed by statistical software such as SPSS, and the concrete solving is determined according to the following expression:

<math> <mrow> <msub> <mi>s</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>Σ</mi> <mrow> <mi>n</mi> <mo>&GreaterEqual;</mo> <mi>i</mi> <mo>></mo> <mi>j</mi> <mo>&GreaterEqual;</mo> <mn>1</mn> </mrow> </msub> <msub> <mi>R</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>n</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mi>k</mi> </msub> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>/</mo> <mn>2</mn> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> </mrow> </math>

And finally, carrying out normalization processing on the 3 mode similarity coefficients to obtain the weight of each mode.

3. Estimating missing values

According to the calculation of the weight of each mode, assignment can be carried out on the proportion occupied by each mode after Tucker decomposition in a lost value objective function, and the converted objective function is solved and optimized, so that the recovered traffic data can be obtained as shown in fig. 4 (b);

the expression for the missing value is found as follows:

<math> <mrow> <msub> <mi>f</mi> <mi>Ω</mi> </msub> <mo>:</mo> <munder> <mi>min</mi> <mover> <mi>A</mi> <mo>&OverBar;</mo> </mover> </munder> <msubsup> <mrow> <mo>|</mo> <mo>|</mo> <mi>Ω</mi> <mo>*</mo> <mrow> <mo>(</mo> <mi>A</mi> <mo>-</mo> <mover> <mi>A</mi> <mo>^</mo> </mover> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> <mi>F</mi> <mn>2</mn> </msubsup> <mo>+</mo> <mfrac> <mi>λ</mi> <mn>2</mn> </mfrac> <msubsup> <mrow> <mo>|</mo> <mo>|</mo> <mover> <mi>A</mi> <mo>^</mo> </mover> <mo>-</mo> <mi>C</mi> <mo>|</mo> <mo>|</mo> </mrow> <mi>F</mi> <mn>2</mn> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>14</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein, A represents the original traffic data,representing the recovered traffic data; omega is a mark tensor, a lost point is marked, the element value of the lost point is 0 in the place where the traffic data is lost, and the rest is 1; c represents the maximum traffic capacity, and the recovery value is ensured to meet the actual traffic condition.

Further, for the case where one or more days of traffic data are lost, the processing according to embodiment 2 is sufficient.

Example 2

For the traffic data of 10 x 12 x 24, the tensor form is A e R^10×12×24Corresponding to a tensor of label ofThe value of the marker tensor lost for k days is determined according to the following expression:

wherein,

wherein k represents the number of missing days, and the missing traffic tensor data can be obtained.

The remaining steps are the same as in embodiment 1, and the recovery effect of fig. 5 is obtained by calculating the weights of the respective modes and substituting them into the objective function, and finally estimating the missing value. The recovery effects in fig. 4 and fig. 5 are both obtained in Matlab environment, and if the method of the present invention is implemented by using C + + programming, the running time will be greatly reduced, thereby implementing the automaticity and real-time performance of traffic data loss recovery.

It should be noted that the above disclosure is only specific examples of the present invention, and those skilled in the art can devise various modifications according to the spirit and scope of the present invention.

Claims

1. A traffic data loss recovery method based on tensor reconstruction comprises the following steps:

A. according to the multiple distribution rule of the traffic data, the traffic data is constructed into a multi-dimensional tensor data form, and the lost points of the traffic data are represented by marks, wherein the expression of the multi-dimensional tensor data is as follows:

A∈R^{week×Day×Hour}

wherein, A multidimensional tensor data; week for "Week" mode, Day for "Day" mode; hour stands for "hours" mode;

the traffic data for which loss occurs is expressed as follows:

wherein, omega is the labeled tensor,representing traffic data;

B. normalizing the correlation coefficient of each mode by calculating the correlation of each mode data to obtain each mode weight, wherein each mode weight expression is as follows:

wherein, W_kRepresents a weight on the k-th mode, 0 ≦ w_k≤1，s_kDenotes a similarity coefficient in the k-th mode, 0. ltoreq. s_k≤1；

C. Establishing a traffic data loss recovery target function in a tensor form, converting the target function by adopting an augmented Lagrange multiplier method, and constructing a traffic data recovery model based on a tensor reconstruction theory by combining lost point markers and weights of all modes; wherein the converted objective function is as follows:

wherein, A represents the original traffic data,representing the recovered traffic data; omega is a mark tensor, a lost point is marked, the element value of the lost point is 0 in the place where the traffic data is lost, and the rest is 1; c represents the maximum traffic capacity;

to avoid singular value decomposition, the data is evaluatedPerforming Tucker decomposition to obtain:

wherein S is_×1X_×2Y_×3Z represents the Tucker decomposition of the tensor; λ is the lagrange coefficient.