CN110837888A

CN110837888A - Traffic missing data completion method based on bidirectional cyclic neural network

Info

Publication number: CN110837888A
Application number: CN201911106967.7A
Authority: CN
Inventors: 申彦明; 徐文权; 齐恒; 尹宝才
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-02-25

Abstract

The invention provides a traffic missing data completion method based on a bidirectional cyclic neural network, and belongs to the field of traffic. According to the method, the time sequence characteristic of data in time is utilized, the influence of data before and after the completion time point on the current time point is considered, the utilization and completion precision of the data are greatly improved, the influence of external characteristics and adjacent sensor data on the current sensor data is considered, the external characteristics and the influence of the adjacent sensor data on the current sensor data are added into a completion model, and the completion precision is greatly improved. The method not only greatly improves the completion accuracy under the condition of low data loss rate, but also improves the completion accuracy under the condition of high data loss rate.

Description

Traffic missing data completion method based on bidirectional cyclic neural network

Technical Field

The invention belongs to the field of traffic, and particularly relates to a traffic missing data completion method based on a bidirectional cyclic neural network.

Background

The road coil traffic data has periodicity, time-series and trend. At present, the method for complementing the traffic data is mainly based on the time sequence.

And (3) supplementing the traffic flow data based on the time sequence, taking data in a period of time before the current missing point, and supplementing the missing point data through a neural network. For example, to complement the traffic data at 16 points of today, the data from 8 to 15 points of the day are taken as input, and the data at the next time point, 16 points, is obtained through the recurrent neural network. The completion method based on the historical data well utilizes the characteristic of time sequence of the data to complete, the completion result is relatively good, but the method has limitation. When a special event occurs, the current missing point is also preceded by a series of missing points, such as: a power outage can result in the loss of a continuous piece of data, and when the last missing point is complemented, the complementing effect is very poor in this case because the input data is seriously missing.

Neural networks were originally inspired by biological nervous systems and appeared to simulate biological nervous systems, and consist of a large number of nodes (or neurons) interconnected with each other. The neural network adjusts the weight according to the input change, improves the system behavior and automatically learns a model capable of solving the problem. The LSTM (long and short memory network) is a special form of RNN (recurrent neural network), effectively solves the problems of gradient disappearance and gradient explosion of multi-layer neural network training, and can process long-time dependent sequences. The LSTM can capture the time series characteristics of the charging quantity data, and the completion precision can be effectively improved by using the LSTM model.

The LSTM network consists of LSTM units, and the LSTM units consist of units, input gates, output gates and forgetting gates.

Forget the door: deciding how much information to discard from the output state of the last cell, the formula is as follows:

f_t＝σ_g(W_fx_t+U_fh_t-1+b_f)

wherein f is_tIs the output of a forgetting gate, x_tIs an input sequence, h_t-1Is the output of the last cell, σ_gDenotes the sigmoid function, W_fA matrix of weight parameters, U, representing the input_fA matrix of weight parameters representing the output of the last cell, b_fRepresenting a deviation parameter vector.

An input gate: determining how much new information to add to the Cell state and updating the Cell state C, the formula is as follows:

i_t＝σ_g(W_ix_t+U_ih_t-1+b_i)

wherein, c_tRepresenting the cell state, σ, of the current cell_gAnd σ_cA sigmoid function is represented as a function,

representing the matrix product, W_iA matrix of weight parameters, U, representing the input_iA matrix of weight parameters representing the output of the last cell, b_iRepresenting deviation parameter vectors, f_tIs the output of a forgetting gate, c_t-1Is the cell state of the last cell,representing the matrix product, W_cA matrix of weight parameters, U, representing the input_cA matrix of weight parameters representing the output of the last cell, b_cRepresenting a deviation parameter vector.

An output gate: the result is output based on the current cell state.

o_t＝σ_g(W_ox_t+U_oh_t-1+b_o)

Wherein h is_tRepresenting the output of the current cell, σ_gAnd σ_hA sigmoid function is represented as a function,

representing the matrix product, W_oA matrix of weight parameters, U, representing the input_oA matrix of weight parameters representing the output of the last cell, b_oRepresenting a deviation parameter vector.

Disclosure of Invention

The invention provides a traffic loss data completion method based on a bidirectional cyclic neural network, which is a deep learning completion method based on time sequence, periodicity and spatiality and aims to improve the completion precision of road traffic flow data.

The technical scheme of the invention is as follows:

a traffic loss data completion method based on a bidirectional cyclic neural network comprises the following steps:

firstly, preprocessing the traffic flow data

The preprocessing comprises time granularity division and data standardization;

and secondly, performing random data point loss processing on the preprocessed data to construct a data set with missing points, and then recording the position information of the missing points to be used as verification values, thereby verifying the completion effect of the method.

Meanwhile, a time dimension influence attenuation matrix is constructed. Due to the fact that continuous missing can occur when data are missing, for example, data loss in a period of time can be caused by damage of a power supply element of a sensor, and as time is accumulated, influence of historical data on data of a missing point is smaller and smaller, and the completion accuracy can be influenced, the attenuation of influence of time dimension data needs to be recorded. The time dimension influence attenuation matrix is defined as follows:

wherein n is_tWhich indicates the current time of day and,is defined as follows:

and thirdly, dividing the traffic data after loss processing into a training set, a verification set and a test set. In each data set, the data used by the different models are of the following types:

data for the forward time series deep learning module:

data of the reverse time series deep learning module:

extrinsic feature data employed in the extrinsic feature module: f_n；

Periodic sequence data employed in the periodic signature module:

where n denotes the current time, t denotes the step of the time series, and p denotes the step of the period series. S denotes traffic data, and T denotes an inverted sequence of S in the time dimension. s_iIndicating the traffic flow data at time n,the traffic flow data indicating the same time on the day i days before the nth time,

represents a set of traffic flow data at the t-th time point before the nth time point,

representing a set of traffic data including the same time of day p days before the time of day n, F_nThe appearance characteristics at the nth time are shown, including holidays, location areas, weather, and air temperature.

And fourthly, constructing a completion model, wherein the completion model comprises a forward time series deep learning module, a reverse time series deep learning module, a periodic characteristic module and an external characteristic module, and the structure and the training mechanism of each module are as follows:

(1) a forward time series deep learning module: the LSTM model is a linear regression network and multi-layer long and short memory networks combined LSTM model, and continuity information of a current missing point in time is added through one layer of linear regression network to deal with the situation of long-time sequence missing, so that completion accuracy is improved.

Implementation details of the forward sequence deep learning module: the time dimension attenuation matrix is input into a linear regression network, and then the output of the linear regression network and the forward time sequence data are input

Inputting the value x to the LSTM network_tIf the data point is not lost, the data point is directly input, when the data point is lost, the hidden state of the previous moment is taken as the input of the current moment, after the input is processed, the deep learning network is trained, and the final output of the forward sequence deep learning module is obtained in continuous iteration updating.

(2) The reverse time series deep learning module: the network structure is consistent with the forward sequence deep learning module, and the difference is that the input of the forward time sequence deep learning module is processed in a reverse direction in a time dimension to be used as the input of the module.

(3) A periodic feature learning module: the system is a module formed by three layers of fully-connected networks, and is used for acquiring the change rule of the traffic flow of the same sensor and the same time period in historical data by extracting the characteristics of periodic data and then outputting the extracted characteristics. Implementation details: and inputting the periodic sequence data into the full-connection layer, extracting the time sequence characteristics of the periodic data through the three full-connection layers, and then outputting.

(4) An external feature module: the device consists of two parts: the first part processes holiday and weather characteristics and is a characteristic coding layer. Implementation details: and inputting the external feature data into a feature coding layer, converting the data into a vector form, and combining the obtained vector with the outputs of the three modules.

The second part processes the spatial features. In order to take information on the road space into consideration, all sensors on the road section are simultaneously input into the second part, then the implicit states of other sensors at the same time as the missing point of the current sensor are used as input, the weight is calculated through a Softmax network, output is obtained, and the output is input into the forward and reverse time series deep learning modules.

And finally, combining the outputs of the four modules into a one-dimensional vector, and obtaining a final completion result through a layer of fully-connected network.

And fifthly, pre-training the pre-training parts of the forward time sequence deep learning module and the reverse time sequence deep learning module by using the training set data, optimizing the parameters of the time sequence deep learning module in advance, and avoiding optimizing the parameters to a local optimal point during integral training.

And sixthly, performing overall training on the four modules established in the step four by using the training set data and the verification set data:

and respectively inputting the preprocessed data into corresponding modules, and simultaneously carrying out overall training on all the modules. And calculating loss function values of the supplement value after each training and the true value of the traffic flow data, and training the parameters of the model to the target values. And continuously debugging the hyper-parameters of the model according to the effects of the model on the training set and the verification set, and improving the completion precision under the condition of reducing overfitting.

The input data comprises: forward time series data

(front t)₁Hourly traffic data), reverse directionTime series data

(after t)₂Hourly traffic data), periodic sequence data

(front t)₃Traffic data at the same time of day), time dimension impact attenuation matrix

Extrinsic feature data F_n(external characteristic data of holidays, regions, weather and air temperature at the nth time) and truth value of traffic flow data

(traffic flow data at the present time).

After one iteration, the traffic flow data after one completion operation is obtained. The data after the iteration is used as the input of the next iteration, the previous missing points have completion values but still represent missing due to labels, and in the subsequent iteration process, the target is to complete the data of the missing points, but due to the existence of the data relatively close to the true value, the prior knowledge is provided, and the convergence speed and the completion precision of the model can be improved.

And seventhly, completing the traffic flow data by using the test set and utilizing the model trained in the sixth step.

The input data is: forward time series data

Reverse time series dataPeriodic sequence dataTime dimension influence attenuation matrix

Extrinsic feature data

Truth value of sum traffic flow data

And obtaining a completion value of the missing traffic flow data through the model in the sixth step, and comparing the completion value with the verification value obtained after the loss processing in the second step to verify the completion effect of the model.

In the first step, the specific process of pretreatment is as follows:

(1) time granularity division: processing all traffic flow data into traffic flow data of every k minutes according to the time granularity of k minutes;

(2) data were normalized: the traffic flow data is normalized using the minimum and maximum values, as follows:

wherein x represents the original value, x_minMinimum value, x, representing the original value_maxRepresents the maximum value of the original values, max is the normalized upper limit value, min is the normalized lower limit value, [ min, max]Denotes the normalized interval, x^*Is the result after standardization.

In the fourth step, the road space information section (Softmax process) is considered: let all sensors at the current moment have their hidden states h ═<h₁，h₂，h₃，…，h_i，…，h_t>，h_iIs the implicit state of the ith sensor at the current time, then for each h_iCalculating weight to obtain new implicit state h 'of current sensor'_i。

After processing using Softmax, the sum of all weights is 1. Wherein l represents the number of sensors, h_ijIndicating an implicit state at the time of the jth sensor i.

And in the sixth step, calculating the mean square error MAE of the data obtained by completion and the truth value of the traffic flow data obtained by each iteration, and minimizing the MAE by using an Adam method.

Wherein, x'_iTrue value of sensor, x, representing the i-th moment_iIndicating the sensor full value at the i-th time.

The invention has the beneficial effects that: the invention is different from the existing method in that firstly, the use of the data time sequence characteristic is improved, the influence of historical data on the current time point data is usually considered when the data time sequence characteristic is utilized by the traditional method, but the information of the subsequent time point has influence on the data of the current time point in the completion application of the traffic flow data, and the invention considers the forward time sequence and the reverse time sequence simultaneously, thereby greatly improving the completion precision. And secondly, considering the influence of external characteristic holidays and adjacent areas of the sensor on traffic flow data, adding the influence into a completion model, and greatly improving the completion precision and completion of special values. Finally, the attenuation of the influence of data missing on the time dimension is also considered, and the completion precision is improved. The method not only greatly improves the completion precision of the low-loss-rate traffic flow data, but also can achieve a good completion effect under the condition of higher data loss rate.

Drawings

Fig. 1 is a diagram of a completion model structure according to the present invention.

Fig. 2 is a graph comparing the low dropout completion result with a data dropout of 20% with the real value.

Fig. 3 is a graph comparing the high dropout completion result with a data dropout of 50% with the real value.

Detailed description of the invention

The technical solution of the present invention will be further described with reference to the following specific embodiments and accompanying drawings.

first, preprocessing the traffic flow data

(1) Time granularity division: processing all traffic flow data into traffic flow data of every 5 minutes according to the time granularity of 5 minutes;

(2) data were normalized: and (3) standardizing the traffic flow data by adopting the minimum value and the maximum value, wherein the formula is as follows:

And secondly, performing random data point loss on the preprocessed data, marking missing labels on data in a certain proportion (set according to experimental requirements) by adopting a random number method to serve as missing points, and recording values of the points to serve as true values to verify the final completion effect of the model.

At the same time, a time dimension influence attenuation matrix is established. Due to the fact that continuous missing of data occurs, for example, a power failure may cause a sensor to not collect data within several hours, and as time accumulates, influence of historical data on data of a missing point is smaller and smaller, and completion accuracy is affected, so that attenuation of influence of time dimension data needs to be recorded. The time dimension influence attenuation matrix is defined as follows:

wherein n is_tWhich indicates the current time of day and,

is defined as follows:

thirdly, dividing the preprocessed traffic flow data into a training set, a verification set and a test set, and performing the following steps according to the data ratio of 8: 1: a ratio of 1. In each data set, the data used by the different models are of the following types:

data for the forward time series deep learning module:

data of the reverse time series deep learning module:

extrinsic feature data employed in the extrinsic feature model: f_n；

Periodic sequence data employed in the periodic signature module:

where n denotes the current time, t denotes the step of the time series, and p denotes the step of the period series. S denotes traffic data, and T denotes an inverted sequence of S in the time dimension. s_iIndicating the traffic flow data at time n,

the traffic flow data indicating the same time on the day i days before the nth time,represents a set of traffic flow data at the t-th time point before the nth time point,

And fourthly, constructing a completion model, wherein the completion model comprises a forward sequence deep learning module, a reverse time sequence deep learning module, a periodic characteristic module and an external characteristic module, and the structure and the training mechanism of each module are as follows:

(1) a forward sequence deep learning module: the LSTM model is a linear regression network and multi-layer long and short memory networks combined LSTM model, and continuity information of a current missing point in time is added through one layer of linear regression network to deal with the situation of long-time sequence missing, so that completion accuracy is improved.

(2) The reverse sequence deep learning module: the network structure is consistent with the forward sequence deep learning module, and the difference is that the input of the forward sequence deep learning module is processed in a reverse direction in the time dimension and is used as the input of the module.

(3) A periodic feature module: the system is a module formed by three layers of fully-connected networks, and is used for acquiring the change rule of the traffic flow of the same sensor and the same time period in historical data by extracting the characteristics of periodic data and then outputting the extracted characteristics. Implementation details: and inputting the periodic sequence data into the full-connection layer, extracting the time sequence characteristics of the periodic data through the three full-connection layers, and then outputting.

(4) An external feature module: is a feature coding layer; implementation details: inputting external feature data into a feature coding layer, and classifying external features of weather, holidays and the like described in a text mode: for example, the cycle sequence data is converted into a vector form based on whether it is a holiday, which is represented by 1, and not by 0, and the obtained vector is output to the next step.

In order to take information on a road space into consideration, a spatial characteristic learning module is added, all sensors on a road section are simultaneously input into a model, then the implicit states of other sensors at the same time as the missing point of the current sensor are used as input, the output is obtained after the weight is calculated through a Softmax network, and the output is input into a forward sequence module and a reverse sequence module.

And finally, combining the outputs of the modules into a one-dimensional vector, and then obtaining a final completion result through a layer of fully-connected network.

And fifthly, pre-training a pre-training part of the time series deep learning model by using the training set data, optimizing parameters of the time series deep learning model in advance, and avoiding optimizing the parameters to a local optimal point during integral training.

And sixthly, performing overall training on the four modules established in the step four by using training set data and verification set data (replacing points with missing data by a completion value, and keeping the original data unchanged if the data is not missing):

and respectively inputting the preprocessed data into corresponding modules, and simultaneously carrying out overall training on all the modules. And calculating loss function values of the supplement value after each training and the true value of the traffic flow data, and training the parameters of the model to the target values. And continuously debugging the hyper-parameters of the model according to the effects of the model on the training set and the verification set, and improving the completion precision under the condition of reducing overfitting. In the training process, the MAE (mean square error) of the data obtained by completion and the truth value of the traffic flow data obtained by each iteration is calculated, and the MAE is minimized by using an Adam method.

The input data comprises: forward time series data

(front t)₁Hourly traffic data), reverse time series data

(after t)₂Hourly traffic data), time dimension impact attenuation matrix

Periodic sequence data

(front t)₃Traffic data at the same time of day), extrinsic feature data F_n(external characteristic data of holidays, regions, weather and air temperature at the nth time) and truth value of traffic flow data

(traffic flow data at the present time).

The input data is: forward time series data

Reverse time series data

Periodic sequence data

Extrinsic feature data

Truth value of sum traffic flow data

Time dimension influence attenuation matrix

Fig. 2 is a comparison graph of the completion result with the real value of the data loss rate of 20%, and the mean square error MAE of the model completion result with the real value of the traffic flow is 29.18. (the first 100 points of absence are selected in the figure)

Fig. 3 is a comparison graph of the completion result with the data loss rate of 50% and the true value, and the mean square error MAE of the model completion result and the true value of the traffic flow is 31.94. (the first 100 missing points are selected in the figure).

Claims

1. A traffic loss data completion method based on a bidirectional cyclic neural network is characterized by comprising the following steps:

firstly, preprocessing the traffic flow data

The preprocessing comprises time granularity division and data standardization;

secondly, performing random data point loss processing on the preprocessed data to construct a data set with missing points, and then recording position information of the missing points to be used as verification values; meanwhile, constructing a time dimension influence attenuation matrix:

wherein n is_tWhich indicates the current time of day and,

is defined as follows:

thirdly, dividing the traffic data after loss processing into a training set, a verification set and a test set; in each data set, the data used by the different models are of the following types:

data for the forward time series deep learning module:

data of the reverse time series deep learning module:

extrinsic feature data employed in the extrinsic feature module: f_n；

Periodic sequence data employed in the periodic signature module:

wherein n represents the current time, t represents the step length of the time sequence, and p represents the step length of the period sequence; s represents the traffic flow data, and T represents the reverse sequence of S in the time dimension; s_iIndicating the traffic flow data at time n,

the traffic flow data indicating the same time on the day i days before the nth time,

indicating the amount of traffic at the same time within the previous p days including the current day at the nth timeData set, F_nIndicating the appearance at the nth time, including holidays, location areas, weather, and air temperature;

(1) a forward time series deep learning module: the LSTM model is a linear regression network and multi-layer long and short memory networks, and continuity information of a current missing point in time is added through one layer of linear regression network to deal with the condition of long-time sequence missing and improve completion precision;

Inputting the value x to the LSTM network_tIf the data point is not lost, the data is directly input, when the data point is lost, the hidden state of the previous moment is taken as the input of the current moment, after the input is processed, the deep learning network is trained, and the final output of the forward sequence deep learning module is obtained in continuous iteration updating;

(2) the reverse time series deep learning module: the network structure is consistent with the forward sequence deep learning module, and the difference is that the input of the forward time sequence deep learning module is reversely processed in the time dimension and is used as the input of the module;

(3) a periodic feature learning module: the system is a module consisting of three layers of fully-connected networks, and is used for acquiring the change rule of the traffic flow of the same sensor and the same time period in historical data by extracting the characteristics of periodic data and then outputting the extracted characteristics; implementation details: inputting the periodic sequence data into a full-connection layer, extracting the time sequence characteristics of the periodic data through three full-connection layers, and then outputting;

(4) an external feature module: the module consists of two parts: the first part processes holiday and weather characteristics and is a characteristic coding layer; implementation details: inputting external feature data into a feature coding layer, converting the data into a vector form, and then combining the obtained vector with the outputs of the three modules;

the second part processes spatial characteristics, all sensors on a road section are simultaneously input into the second part, then the implicit states of other sensors at the same time as the missing point of the current sensor are used as input, the output is obtained after the weight is calculated through a Softmax network, and the output is input into a forward time sequence deep learning module and a reverse time sequence deep learning module;

finally, combining the outputs of the four modules into a one-dimensional vector, and obtaining a final completion result through a layer of fully-connected network;

fifthly, pre-training the pre-training parts of the forward time sequence deep learning module and the reverse time sequence deep learning module by using training set data, optimizing the parameters of the time sequence deep learning module in advance, and avoiding optimizing the parameters to a local optimal point during integral training;

inputting the preprocessed data into corresponding modules respectively, and simultaneously carrying out overall training on all the modules; calculating loss function values of the supplement value and the true value of the traffic flow data after each training, and training the parameters of the model to target values; continuously debugging hyper-parameters of the model according to the effects of the model on a training set and a verification set, and improving the completion accuracy under the condition of reducing overfitting;

the input data comprises:

forward time series data: front t₁Hourly traffic flow data

Reverse time-series data: after t₂Hourly traffic flow data

Cycle sequence data: front t₃Traffic data at the same time of day

Time dimension influences the attenuation matrix:

external characteristic data: external characteristic data F of holidays, areas, weather and air temperatures at the nth moment_n；

True value of traffic flow data: traffic flow data at the present time

After one iteration, obtaining the traffic flow data after one completion operation; taking the data after the iteration as the input of the next iteration, wherein the previous missing points have completion values but the labels still represent the missing, and the target still completes the data of the missing points in the subsequent iteration process;

seventhly, completing traffic flow data by using the test set and utilizing the model trained in the sixth step;

the input data is: forward time series dataReverse time series data

Periodic sequence data

Time dimension influence attenuation matrix

Extrinsic feature data

Truth value of sum traffic flow data

2. The method for completing the traffic loss data based on the bidirectional recurrent neural network as claimed in claim 1, wherein in the first step, the preprocessing comprises:

3. The traffic loss data completion method based on the bidirectional recurrent neural network as claimed in claim 1 or 2, wherein in the fourth step, the specific process of processing spatial features: let all sensors at the current moment have their hidden states h ═<h₁，h₂，h₃，…，h_i，…，h_t>，h_iIs the implicit state of the ith sensor at the current time, then for each h_iCalculating weight to obtain new implicit state h 'of current sensor'_i；

Wherein l represents the number of sensors, h_ijIndicating an implicit state at the time of the jth sensor i.

4. The traffic loss data completion method based on the bidirectional cyclic neural network as claimed in claim 1 or 2, wherein in the sixth step, the mean square error MAE of the data obtained by completion and the truth value of the traffic flow data obtained by each iteration is calculated, and the MAE is minimized by using an Adam method;

5. The traffic loss data completion method based on the bidirectional cyclic neural network as claimed in claim 3, wherein in the sixth step, the mean square error MAE of the data obtained by completion and the truth value of the traffic flow data obtained by each iteration is calculated, and the MAE is minimized by using an Adam method;