WO2020010717A1

WO2020010717A1 - Short-term traffic flow prediction method based on spatio-temporal correlation

Info

Publication number: WO2020010717A1
Application number: PCT/CN2018/107987
Authority: WO
Inventors: 戚湧; 熊亭; 张伟斌; 高盼军
Original assignee: 南京理工大学
Priority date: 2018-07-13
Filing date: 2018-09-27
Publication date: 2020-01-16
Also published as: AU2018432145A1; CN108877223A; AU2018102176A4

Abstract

A short-term traffic flow prediction method based on spatio-temporal correlation. The method comprises the following steps: selecting a road section requiring traffic prediction and break points in the road section, acquiring short-term traffic flow historical data of all of the break points in the selected road section, determining a prediction time period of the short term traffic flow prediction, and verifying whether the historical traffic flow data of the prediction break points has periodicity; after using a normalisation method to normalise the traffic flow data, dividing the data set into a training data set and a testing data set; using a SARIMA model to perform predictive analysis on the testing data set to obtain an initial prediction result; using the prediction result as an input feature, entering same into a random forest model to obtain a final prediction result; comparing the testing data with the final prediction data and analysing errors. The present method breaks down flow data into periodic parts with evident trends and random fluctuation parts for analysis, increasing the precision of traffic flow data prediction.

Description

Short-term traffic flow prediction method based on spatio-temporal correlation

Technical field

The invention relates to the technical fields of machine learning methods, traffic flow prediction, and the like, and in particular, to a short-term traffic flow prediction method based on spatio-temporal correlation.

Background technique

With the acceleration of modernization of society and the continuous improvement of the level of urbanization, the number of vehicles has also increased rapidly. The existing road network conditions are difficult to meet the increasing demand for traffic. In the early 20th century, the concept of intelligent transportation systems (ITS) also came into being. In ITS, real-time and accurate short-term traffic flow prediction plays a vital role. It not only affects people's control and induction of traffic flow, but also the key to the system's transition from passive to active control.

With the continuous deepening of short-term traffic flow analysis and prediction, researchers have proposed many models based on different analysis angles and application conditions. These models can be divided into three categories: the first type is prediction models based on mathematical statistics and calculus, such as reference 1 (traffic flow parameter prediction method based on fuzzy Kalman filter, publication number: CN102629418A), and this type of model is based on observation data Internal statistical characteristics, dynamically processing traffic flow data, and predicting future traffic flow; however, most of these models only use historical flow data to predict, and ignore other factors such as season, climate, and upstream and downstream flow, which is difficult to adapt to traffic Due to the strong randomness of the stream, the accuracy of this type of prediction method is not very high; the second type is a prediction model based on modern science and technology such as machine learning, including support vector machines, neural networks, and models based on chaos theory Etc., such as reference 2 (traffic parameter prediction method based on deep confidence network, publication number: CN106295874A), this type of model usually uses machine learning or artificial intelligence to predict short-term traffic flow, but the disadvantage is that the traffic flow data is often ignored. Some inherent characteristics. The third type is a combination prediction model, such as reference 3 (a traffic flow prediction method based on a combination of support vector machines and BP neural network, publication number: CN107705556A). As the name suggests, the combination model is to use multiple models together. However, most combination models do not consider the characteristics of the traffic flow, but simply combine them randomly, which results in the model's prediction effect has not significantly improved, and even increased the complexity of the model. Obviously, it is difficult for a single prediction model to take into account the inherent characteristics of traffic flow data and the external effects caused by seasonal, climatic or human factors. Therefore, it is difficult to reflect the complex characteristics inherent in traffic flow data, and it is impossible to comprehensively consider the external spatial correlation to forecast. Study the impact of other flaws.

Summary of the invention

The purpose of the present invention is to provide a short-term traffic flow prediction method based on spatio-temporal correlation, so as to improve the good analysis ability and feature mining ability of traffic flow data, and further improve the prediction accuracy of the model.

The technical solution to achieve the purpose of the present invention is: a short-term traffic flow prediction method based on spatio-temporal correlation, including the following steps:

Step 1: Select a road segment to be predicted for traffic flow and the breakpoints in the road segment, and obtain historical short-term traffic flow data of all breakpoints in the selected road segment;

Step 2: Determine a prediction period of the short-term traffic flow prediction based on the obtained short-term traffic flow historical data;

Step 3: Verify whether the historical traffic flow data of the predicted breakpoint is periodic based on the short-term traffic flow historical data of the breakpoint;

Step 4. Use the normalization method to perform normalization processing on the traffic flow data, and divide the normalized data set into a training data set and a test data set;

Step 5. Use the SARIMA model to perform a predictive analysis on the test data set to obtain an initial prediction result;

Step 6. Take the prediction result obtained by the SARIMA model as an input feature and bring it into the random forest model to obtain the final prediction result;

Step 7. Compare the test data set with the final prediction data and analyze the errors.

Further, the short-term traffic flow historical data of the breakpoint in step 1 refers to data collection date, time, traffic flow speed value and traffic flow value at the breakpoint.

Further, the prediction period described in step 2 is 5 minutes.

Further, verifying whether the historical traffic flow data of the prediction breakpoint is periodic in step 3 refers to periodic verification using an autocorrelation function, and the specific process is as follows:

For each of the sequence values X _t , X _t-1 , ... X _tk constituting the time series, the autocorrelation coefficient r _{k is used to} measure the degree of autocorrelation between the sequence values, and r _{k is} the number of observations separated by k periods. The degree of correlation is calculated by the following formula:

Where n represents the length of the time series,

That is the average of the time series data, and X _tk represents the sequence value that is k periods away from X _t .

Further, the normalization method described in step 4 is as follows:

Calculate the minimum min and maximum max in a sample of historical traffic flow data, and use the min-max normalization method to normalize the data so that the normalized traffic flow data results are mapped to [0,1] In other words, the maximum value max and the minimum value min in the set are obtained according to the traffic flow data set F = {f _t | t = 1,2, ... T}, and each data in the set is calculated:

Where x 'represents the traffic flow data after normalization processing, min represents the minimum value of the sample data, max represents the maximum value of the sample data, and x represents the data to be normalized.

Further, the normalized data set is divided into a training data set and a test data set as described in step 4, specifically: after normalization processing, 80% of the data in the historical traffic flow data is used as the training set, 20% of the data is used as the test set.

Further, in step 5, the SARIMA model is used to perform a predictive analysis on the test data set to obtain an initial prediction result, which specifically includes the following steps:

(5.1) Check whether the original traffic flow data is a stable sequence: The test result is that the traffic flow data is non-stationary, and it is stabilized; the test result is that the traffic flow data is stable, and directly enter step (5.2);

(5.2) According to the ACF function and PACF function of the stationary time series data and the AIC minimum criterion, the four parameters p, q, P, Q of the SARIMA model are valued;

(5.3) During the prediction process, the amount of data d days before the prediction time t is used as training data, and dynamic prediction is performed in the form of a sliding window, and the model is refitted every n times, and the parameters are adjusted to finally obtain the steps The initial prediction results described in 5.

Further, in step 6, the prediction result obtained by the SARIMA model is taken as an input feature and is brought into a random forest model to obtain the final prediction result, which specifically includes the following steps:

The initial prediction results obtained by the SARIMA model are used as input features reflecting the periodic pattern, and are combined with other input feature combinations into the random forest model. The parameters are adjusted using the grid method to finally obtain the predicted values.

Further, comparing the test data set with the final prediction data and analyzing the error described in step 7, specifically includes the following steps:

The error analysis is performed on the forecast data through the average percentage error MAPE and root mean square error RMSE. The calculation formula is as follows:

Where n represents the number of test data selected in total, and u _i is the actual traffic volume value in the i-th period.

The flow value obtained by the model for the i-th period.

Compared with the prior art, the present invention has significant advantages: (1) it can deeply dig the characteristics of the periodic and non-linear parts of the traffic flow data; (2) analyze the traffic data from the perspective of the space-time correlation of the traffic flow, Decomposing it into periodic parts and random fluctuation parts with obvious trends can further improve the prediction accuracy of the model.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a layout topology diagram of a metadata exchange system of the present invention.

FIG. 2 is a structural diagram of a metadata exchange system of the present invention.

FIG. 3 is a layout diagram of functional modules of a metadata synchronization subsystem of the present invention.

FIG. 4 is a flowchart of metadata exchange and clipping according to the present invention.

FIG. 5 is a layout diagram of functional modules of the metadata and directory management subsystem of the present invention.

FIG. 6 is a flowchart of a metadata collection and publishing module of the present invention.

FIG. 7 is a diagram of a metadata directory service architecture of the present invention.

FIG. 8 is a structural diagram of a metadata and directory management subsystem of the present invention.

detailed description

The short-term traffic flow prediction method based on the spatio-temporal correlation of the present invention includes the following steps:

The short-term traffic flow historical data of the breakpoint refers to data collection date, time, traffic flow speed value and traffic flow value at the breakpoint.

For example, the prediction period is 5 minutes.

Step 3: According to the short-term traffic flow historical data of the breakpoint, verify whether the historical traffic flow data of the predicted breakpoint is periodic. The specific process is as follows:

Where n represents the length of the time series,

The normalization method is as follows:

Calculate the minimum min and maximum max in a sample of historical traffic flow data, and use the min-max normalization method to normalize the data so that the normalized traffic flow data results are mapped to [0,1]. In other words, the maximum value max and the minimum value min in the set are obtained according to the traffic flow data set F = {f _t | t = 1,2, ... T}, and each data in the set is calculated:

The dividing the normalized data set into a training data set and a test data set is specifically: after normalization processing, 80% of the data in the historical traffic flow data is used as the training set, and 20% of the data As a test set.

Step 5. Use the SARIMA model to perform a predictive analysis on the test data set to obtain the initial prediction result, which specifically includes the following steps:

(5.1) Check whether the original traffic flow data is a stable sequence: The test result is that the traffic flow data is non-stationary, and it is stabilized; the test result is that the traffic flow data is stable, and it proceeds directly to step (5.2);

(5.3) During the prediction process, the amount of data d days before the prediction time t is used as training data, and dynamic prediction is performed in the form of a sliding window, and the model is refitted every n times, and the parameters are adjusted to finally obtain the steps. The initial prediction results described in 5.

Step 6. Take the prediction result obtained by the SARIMA model as an input feature and bring it into the random forest model to obtain the final prediction result, which specifically includes the following steps:

Step 7. Compare the test data set with the final prediction data and analyze the error, which specifically includes the following steps:

The flow value obtained by the model for the i-th period.

In order to better understand the present invention, the content of the present invention will be further described below with reference to the accompanying drawings and specific embodiments.

Example 1

The short-term traffic flow prediction method based on the spatio-temporal correlation in this embodiment, the main flowchart and its structure diagram are shown in Fig. 1 and Fig. 2, including the following steps:

Step 1: Select the road segment to be predicted for traffic flow and the breakpoints in the road segment, and obtain the historical short-term traffic flow data of all breakpoints in the selected road segment;

Step 2: Determine the prediction period of the short-term traffic flow prediction based on the obtained short-term traffic flow historical data;

Step four: normalize the traffic flow data by using a normalization method, and divide the normalized data set into a training data set and a test data set;

Step 5: Use the SARIMA model to perform prediction analysis on the test data set to obtain the initial prediction result;

Step 6. Take the prediction result obtained by the SARIMA model as an input feature and bring it into the random forest model to obtain the final prediction result.

Step 7: Compare the test data set with the final prediction data and analyze the errors.

In the use case of this embodiment, the traffic flow data is collected through a coil, and the obtained traffic flow data is the number of vehicles passing by a specific breakpoint within a certain time interval. In this example, the time interval is 5 minutes. The historical observation data set is expressed as F = {f _t | t = 1,2, ... T}, where f _t represents the traffic flow parameter of the specific breakpoint of the road network at time t, and the difference between time T and time T + 1 The value is the prediction time interval. The prediction time interval used in this example is 5 minutes.

If you analyze and analyze the periodic pattern of traffic flow, you must first verify that the data set is periodic. In this example, periodic verification is performed through an autocorrelation function. Taking the data from 6 am to 24 pm in the day with a time interval of 5 minutes as experimental data, it has been verified that the traffic flow data has a daily periodicity and is 216, which is consistent with the actual situation. The periodic verification chart is shown in Figures 3 and 4.

Next, calculate the minimum min and maximum max in a sample of historical traffic flow data, and use the min-max normalization method to normalize the data, so that the normalized traffic flow data results are mapped to [0,1 ], That is, the maximum value max and the minimum value min in the set are obtained according to the traffic flow data set F = {f _t | t = 1,2, ... T}, and each data in the set is calculated:

In this example, data of 25 working days are used as experimental data, of which 20 days of traffic flow data are used as training data, and 5 days of traffic flow data are used as test data.

The SARIMA model is a model that can describe seasonal time series. It is a variant of the Autoregressive Integral Moving Average (ARIMA) model [14].

Assume that a traffic flow sequence {X _t } can be fitted by the SARIMA (p, d, q) (P, D, Q) S model, where the parameter S represents the length of the set seasonal period, and the parameter d represents the conversion into a stationary sequence. The required number of differences, the meaning of the parameter D is the order of the required seasonal difference; let the stationary time series after the difference be {Y _t }, as shown in equation (2), where B represents the backward shift operator, The traffic flow has the relationship shown in equation (3):

Y _t = (1-B) ^d (1-B ^S ) ^D X _t (2)

B ^j X _t = X _tj (3)

Then the SARIMA model can be expressed in the form of equation (4):

φ (B) Φ (B ^S ) (1-B) ^d (1-B ^S ) ^D Y _t = c + θ (B) Θ (B ^S ) ε _t (4)

The parameter c represents a constant term, ε _t represents the residual term of the model, and satisfies ε _t ～ N (0, δ ² ), and BS represents a post-season shift operator, and satisfies the following relationship:

φ (B) = 1-φ ₁ B-φ ₂ B ² -...- φ _p B ^p , (5)

φ (B ^S ) = 1-φ ₁ B ^{S, 1} -φ ₂ B ^{S, 2-} … -φ _p B ^{S, p} , (6)

θ (B) = 1-θ ₁ B-θ ₂ B ² -...- φ _q B ^q , (7)

θ (B ^S ) = 1-θ ₁ B ^{S, 1} -θ ₂ B ^{S, 2} -...- φ _q B ^{S, Q} , (8)

The basic steps of SARIMA (p, d, q) (P, D, Q) _S model prediction are shown in Figure 5. In this example, it is first checked whether the original traffic flow data is a stationary sequence. The test result is that the traffic flow data is non-stationary, so it is stabilized, and it is obtained that d takes 1, D takes 1, and S is 156. The second step is based on the processed ACF function and PACF function of the stabilized time series and AIC minimum criterion. Values for p, q, P, Q. In the prediction process, the data amount of the three days before the prediction time t is used as training data, and dynamic prediction is performed in the form of a sliding window, and the model is refitted every 12 times, the parameters are adjusted, and finally the vehicle in the test set is predicted for one week. Traffic data.

Random forest (RF) is a powerful tool for data mining and machine learning. It is an integrated learning method that combines a large number of regression trees and then obtains prediction results. It combines a large number of weak models into a strong model. The prediction process of RF can be intuitively explained by evaluating the importance of the predictive factor. The algorithm is robust to noise and outliers in the data, can effectively run on big traffic data, and is also good for high-dimensional data. Adaptability. In this example, the initial prediction result obtained by the SARIMA model is used as a feature reflecting the periodic pattern, combined with other input features, and brought into the random forest model to obtain the final prediction result. And select three time periods: 7 am to 20 pm (period 1), 8 am to 10 pm (period 2), 14 pm to 16 pm (period 3), and compare the prediction data of the test data set with errors. analysis. The error is evaluated by two indicators: the average percentage error (MAPE) and the root mean square error (RMSE). The calculation formula is as follows:

The predicted flow value for the i-th period of the model. The comparison between the prediction results of the method of the present invention and the prediction results of the existing methods is shown in Figures 6, 7, and 8.

In summary, the method of the present invention deeply explores the randomness and uncertainty of traffic flow data, fully considers the spatio-temporal correlation in traffic flow data, and decomposes the flow data into a periodic part and a random fluctuation part with a clear trend. It is analyzed to improve the prediction accuracy of traffic flow data.

Claims

A short-term traffic flow prediction method based on spatio-temporal correlation, which is characterized by including the following steps:

Step 1: Select a road segment to be predicted for traffic flow and the breakpoints in the road segment, and obtain historical short-term traffic flow data of all breakpoints in the selected road segment;

Step 2: Determine a prediction period of the short-term traffic flow prediction based on the obtained short-term traffic flow historical data;

Step 3: Verify whether the historical traffic flow data of the predicted breakpoint is periodic based on the short-term traffic flow historical data of the breakpoint;

Step 4. Use the normalization method to perform normalization processing on the traffic flow data, and divide the normalized data set into a training data set and a test data set;

Step 5. Use the SARIMA model to perform a predictive analysis on the test data set to obtain an initial prediction result;

Step 6. Take the prediction result obtained by the SARIMA model as an input feature and bring it into the random forest model to obtain the final prediction result;

Step 7. Compare the test data set with the final prediction data and analyze the errors.
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, characterized in that the short-term traffic flow historical data of the breakpoint in step 1 refers to data collection date, time, and traffic at the breakpoint Flow speed value and traffic flow value.
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, wherein the prediction period described in step 2 is 5 minutes.
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, characterized in that, in step 3, verifying whether the historical traffic flow data of the prediction breakpoint has periodicity refers to using the autocorrelation function to perform the periodicity. Sexual verification, the specific process is as follows:

For each of the sequence values X t , X t-1 , ... X tk constituting the time series, the autocorrelation coefficient r k is used to measure the degree of autocorrelation between the sequence values, and r k is the number of observations separated by k periods. The degree of correlation is calculated by the following formula:

Where n represents the length of the time series,
That is the average of the time series data, and X tk represents the sequence value that is k periods away from X t .
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, wherein the normalization method described in step 4 is as follows:

Calculate the minimum min and maximum max in a sample of historical traffic flow data, and use the min-max normalization method to normalize the data so that the normalized traffic flow data results are mapped to In other words, the maximum value max and the minimum value min in the set are obtained according to the traffic flow data set F = {f t | t = 1,2, ... T}, and each data in the set is calculated:

Where x 'represents the traffic flow data after normalization processing, min represents the minimum value of the sample data, max represents the maximum value of the sample data, and x represents the data to be normalized.
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, characterized in that, in step 4, the normalized data set is divided into a training data set and a test data set, specifically: After the normalization process, 80% of the historical traffic data is used as the training set, and 20% of the data is used as the test set.
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, characterized in that, in step 5, the SARIMA model is used to perform prediction analysis on the test data set to obtain an initial prediction result, which specifically includes the following steps:

(5.1) Check whether the original traffic flow data is a stable sequence: The test result is that the traffic flow data is non-stationary, and it is stabilized; the test result is that the traffic flow data is stable, and it proceeds directly to step (5.2);

(5.2) According to the ACF function and PACF function of the stationary time series data and the AIC minimum criterion, the four parameters p, q, P, Q of the SARIMA model are valued;

(5.3) During the prediction process, the amount of data d days before the prediction time t is used as training data, and dynamic prediction is performed in the form of a sliding window, and the model is refitted every n times, and the parameters are adjusted to finally obtain the steps The initial prediction results described in 5.
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, characterized in that, in step 6, the prediction result obtained by the SARIMA model is used as an input feature to be brought into a random forest model to obtain the final prediction. As a result, it includes the following steps:

The initial prediction results obtained by the SARIMA model are used as input features reflecting the periodic pattern, and are combined with other input feature combinations into the random forest model. The parameters are adjusted using the grid method to finally obtain the predicted values.
The short-term traffic flow prediction method based on spatio-temporal correlation according to claim 1, characterized in that, in step 7, the test data set is compared with the final prediction data, and the error is analyzed, which specifically includes the following steps:

The error analysis is performed on the forecast data through the average percentage error MAPE and root mean square error RMSE. The calculation formula is as follows:

Where n represents the number of test data selected in total, and u i is the actual traffic volume value in the i-th period.
The flow value obtained by the model for the i-th period.