一种基于时空相关性的短时交通流预测方法Short-term traffic flow prediction method based on spatio-temporal correlation
技术领域Technical field
本发明涉及机器学习方法和交通流预测等技术领域,具体涉及一种基于时空相关性的短时交通流预测方法。The invention relates to the technical fields of machine learning methods, traffic flow prediction, and the like, and in particular, to a short-term traffic flow prediction method based on spatio-temporal correlation.
背景技术Background technique
随着当今社会现代化进程不断加快,城市化水平不断提高,车辆的数目也随之迅速增长,现有的路网通行条件难以满足日益增长的交通通行需求。20世纪初期,智能交通系统(ITS)的概念也应运而生。在ITS中,实时且准确的短时交通流量预测有着至关重要的作用,它不仅影响人们对交通流的控制和诱导,更是系统从被动应变到主动控制的关键。With the acceleration of modernization of society and the continuous improvement of the level of urbanization, the number of vehicles has also increased rapidly. The existing road network conditions are difficult to meet the increasing demand for traffic. In the early 20th century, the concept of intelligent transportation systems (ITS) also came into being. In ITS, real-time and accurate short-term traffic flow prediction plays a vital role. It not only affects people's control and induction of traffic flow, but also the key to the system's transition from passive to active control.
随着短时交通流分析与预测工作的不断深入,研究人员依据不同的分析角度以及应用条件提出了许多模型。这些模型可以分为三类:第一类是基于数理统计以及微积分的预测模型,如文献1(基于模糊卡尔曼滤波的交通流参数预测方法,公开号:CN102629418A),该类模型通过观测数据内部的统计特征,动态地处理交通流数据,预测未来的交通流量;但该类模型大多仅利用历史流量数据来预测,而忽略了季节、气候,上下游流量的影响等其他因素,难以适应交通流随机性强的特点,故该类预测方法的准确率并不是很高;第二类是基于机器学习等现代科学技术为基础的预测模型,包括支持向量机,神经网络,基于混沌理论的模型等,如文献2(基于深信度网络的交通流参数预测方法,公开号:CN106295874A),该类模型通常采用机器学习或者人工智能的方法预测短时交通流量,缺点是往往会忽视交通流数据所固有的一些特性。第三类就是组合预测模型,如文献3(一种基于支持向量机和BP神经网络结合的交通流预测方法,公开号:CN107705556A),顾名思义,组合模型即将多个模型综合在一起使用。但大多数组合模型并没有考虑交通流特征,只是单纯地随机组合,这导致模型的预测效果并没有显著提升,甚至增加了模型的复杂程度。显然,单一预测模型难以兼顾交通流数据中的固有特征,以及季节,气候或者人为因素所造成的外部影响,因此存在难以反映交通流数据中固有的复杂特性,并且无法全面考虑外部空间关联对预测研究的影响等缺陷。With the continuous deepening of short-term traffic flow analysis and prediction, researchers have proposed many models based on different analysis angles and application conditions. These models can be divided into three categories: the first type is prediction models based on mathematical statistics and calculus, such as reference 1 (traffic flow parameter prediction method based on fuzzy Kalman filter, publication number: CN102629418A), and this type of model is based on observation data Internal statistical characteristics, dynamically processing traffic flow data, and predicting future traffic flow; however, most of these models only use historical flow data to predict, and ignore other factors such as season, climate, and upstream and downstream flow, which is difficult to adapt to traffic Due to the strong randomness of the stream, the accuracy of this type of prediction method is not very high; the second type is a prediction model based on modern science and technology such as machine learning, including support vector machines, neural networks, and models based on chaos theory Etc., such as reference 2 (traffic parameter prediction method based on deep confidence network, publication number: CN106295874A), this type of model usually uses machine learning or artificial intelligence to predict short-term traffic flow, but the disadvantage is that the traffic flow data is often ignored. Some inherent characteristics. The third type is a combination prediction model, such as reference 3 (a traffic flow prediction method based on a combination of support vector machines and BP neural network, publication number: CN107705556A). As the name suggests, the combination model is to use multiple models together. However, most combination models do not consider the characteristics of the traffic flow, but simply combine them randomly, which results in the model's prediction effect has not significantly improved, and even increased the complexity of the model. Obviously, it is difficult for a single prediction model to take into account the inherent characteristics of traffic flow data and the external effects caused by seasonal, climatic or human factors. Therefore, it is difficult to reflect the complex characteristics inherent in traffic flow data, and it is impossible to comprehensively consider the external spatial correlation to forecast. Study the impact of other flaws.
发明内容Summary of the invention
本发明的目的在于提供一种基于时空相关性的短时交通流预测方法,以提高对交通 流数据的良好的分析能力以及特征的挖掘能力,并进一步提高模型的预测精度。The purpose of the present invention is to provide a short-term traffic flow prediction method based on spatio-temporal correlation, so as to improve the good analysis ability and feature mining ability of traffic flow data, and further improve the prediction accuracy of the model.
实现本发明目的的技术解决方案为:一种基于时空相关性的短时交通流预测方法,包括以下步骤:The technical solution to achieve the purpose of the present invention is: a short-term traffic flow prediction method based on spatio-temporal correlation, including the following steps:
步骤1,选定需要进行交通流预测的路段以及该路段中的断点,获取所选路段中所有断点的短时交通流量历史数据;Step 1: Select a road segment to be predicted for traffic flow and the breakpoints in the road segment, and obtain historical short-term traffic flow data of all breakpoints in the selected road segment;
步骤2,根据获取的短时交通流历史数据,确定短时交通流预测的预测时段;Step 2: Determine a prediction period of the short-term traffic flow prediction based on the obtained short-term traffic flow historical data;
步骤3,根据断点的短时交通流量历史数据,验证预测断点的历史交通流量数据是否具有周期性;Step 3: Verify whether the historical traffic flow data of the predicted breakpoint is periodic based on the short-term traffic flow historical data of the breakpoint;
步骤4,利用归一化方法对交通流数据进行归一化处理,将归一化后的数据集划分为训练数据集和测试数据集;Step 4. Use the normalization method to perform normalization processing on the traffic flow data, and divide the normalized data set into a training data set and a test data set;
步骤5,利用SARIMA模型对测试数据集进行预测分析,得到初始预测结果;Step 5. Use the SARIMA model to perform a predictive analysis on the test data set to obtain an initial prediction result;
步骤6,将SARIMA模型得到的预测结果作为一项输入特征,带入随机森林模型,得到最终的预测结果;Step 6. Take the prediction result obtained by the SARIMA model as an input feature and bring it into the random forest model to obtain the final prediction result;
步骤7,将测试数据集与最终的预测数据进行比较,并分析误差。Step 7. Compare the test data set with the final prediction data and analyze the errors.
进一步地,步骤1中所述断点的短时交通流历史数据是指数据采集日期,时间,断点处的交通流速度值以及交通流量值。Further, the short-term traffic flow historical data of the breakpoint in step 1 refers to data collection date, time, traffic flow speed value and traffic flow value at the breakpoint.
进一步地,步骤2中所述的预测时段为5分钟。Further, the prediction period described in step 2 is 5 minutes.
进一步地,步骤3中所述验证预测断点的历史交通流量数据是否具有周期性,是指利用自相关函数进行周期性验证,具体过程如下:Further, verifying whether the historical traffic flow data of the prediction breakpoint is periodic in step 3 refers to periodic verification using an autocorrelation function, and the specific process is as follows:
对于构成时间序列的每个序列值X
t,X
t-1,…X
t-k,利用自相关系数r
k度量序列值之间的自相关程度,r
k即为相隔k期的观测值之间的相关程度,通过以下的公式计算:
For each of the sequence values X t , X t-1 , ... X tk constituting the time series, the autocorrelation coefficient r k is used to measure the degree of autocorrelation between the sequence values, and r k is the number of observations separated by k periods. The degree of correlation is calculated by the following formula:
其中n代表时间序列的长度,
即为时序数据的平均值,X
t-k则表示与X
t相距k期的序列值。
Where n represents the length of the time series, That is the average of the time series data, and X tk represents the sequence value that is k periods away from X t .
进一步地,步骤4中所述的归一化方法,具体过程如下:Further, the normalization method described in step 4 is as follows:
分别计算历史交通流数据某一个样本中的最小值min和最大值max,使用min-max 标准化方法对数据进行归一化,使得归一化之后的交通流数据结果映射到[0,1]之间,即根据交通流数据集合F={f
t|t=1,2,...T}求得集合中最大值max和最小值min,对集合中的每个数据计算:
Calculate the minimum min and maximum max in a sample of historical traffic flow data, and use the min-max normalization method to normalize the data so that the normalized traffic flow data results are mapped to [0,1] In other words, the maximum value max and the minimum value min in the set are obtained according to the traffic flow data set F = {f t | t = 1,2, ... T}, and each data in the set is calculated:
其中x’表示归一化处理后的交通流数据,min表示样本数据中的最小值,max表示样本数据最大值,x表示待归一化处理的数据。Where x 'represents the traffic flow data after normalization processing, min represents the minimum value of the sample data, max represents the maximum value of the sample data, and x represents the data to be normalized.
进一步地,步骤4中所述将归一化后的数据集划分为训练数据集和测试数据集,具体为:归一化处理后将历史交通流量数据中百分之80的数据作为训练集,百分之20的数据作为测试集。Further, the normalized data set is divided into a training data set and a test data set as described in step 4, specifically: after normalization processing, 80% of the data in the historical traffic flow data is used as the training set, 20% of the data is used as the test set.
进一步地,步骤5所述利用SARIMA模型对测试数据集进行预测分析,得到初始预测结果,具体包括以下步骤:Further, in step 5, the SARIMA model is used to perform a predictive analysis on the test data set to obtain an initial prediction result, which specifically includes the following steps:
(5.1)检验原始交通流数据是否为平稳序列:检验结果为交通流数据是非平稳的,对其进行平稳化处理;检验结果为交通流数据是平稳的,直接进入步骤(5.2);(5.1) Check whether the original traffic flow data is a stable sequence: The test result is that the traffic flow data is non-stationary, and it is stabilized; the test result is that the traffic flow data is stable, and directly enter step (5.2);
(5.2)依据平稳化时间序列数据的ACF函数与PACF函数以及AIC最小准则,对SARIMA模型的四个参数p,q,P,Q取值;(5.2) According to the ACF function and PACF function of the stationary time series data and the AIC minimum criterion, the four parameters p, q, P, Q of the SARIMA model are valued;
(5.3)预测过程中以预测时刻t前d天的数据量当做训练数据,并采用滑动窗口的形式动态预测,且设定模型每执行n次就重新拟合,并调整参数,最终得出步骤5所述的初始预测结果。(5.3) During the prediction process, the amount of data d days before the prediction time t is used as training data, and dynamic prediction is performed in the form of a sliding window, and the model is refitted every n times, and the parameters are adjusted to finally obtain the steps The initial prediction results described in 5.
进一步地,步骤6所述将SARIMA模型得到的预测结果作为一项输入特征,带入随机森林模型,得到最终的预测结果,具体包括以下步骤:Further, in step 6, the prediction result obtained by the SARIMA model is taken as an input feature and is brought into a random forest model to obtain the final prediction result, which specifically includes the following steps:
将SARIMA模型得到的初始预测结果作为反映周期性模式的输入特征,与其他输入特征组合共同带入随机森林模型中,采用网格法对参数进行调整,最终得出预测值。The initial prediction results obtained by the SARIMA model are used as input features reflecting the periodic pattern, and are combined with other input feature combinations into the random forest model. The parameters are adjusted using the grid method to finally obtain the predicted values.
进一步地,步骤7所述将测试数据集与最终的预测数据进行比较,并分析误差,具体包括以下步骤:Further, comparing the test data set with the final prediction data and analyzing the error described in step 7, specifically includes the following steps:
通过平均百分比误差MAPE和均方根误差RMSE来对预测数据进行误差分析,计算公式如下:The error analysis is performed on the forecast data through the average percentage error MAPE and root mean square error RMSE. The calculation formula is as follows:
其中n代表共选取测试数据的个数,u
i为第i个时段实际车流量值,
为模型对第i个时段预测得到的流量值。
Where n represents the number of test data selected in total, and u i is the actual traffic volume value in the i-th period. The flow value obtained by the model for the i-th period.
本发明与现有技术相比,其显著优点为:(1)能够深入挖掘对交通流数据的周期性部分、非线性部分特征;(2)从交通流的时空相关性角度分析,将流量数据分解成带有明显趋势的周期性部分和随机波动部分,可以进一步提高模型的预测精度。Compared with the prior art, the present invention has significant advantages: (1) it can deeply dig the characteristics of the periodic and non-linear parts of the traffic flow data; (2) analyze the traffic data from the perspective of the space-time correlation of the traffic flow, Decomposing it into periodic parts and random fluctuation parts with obvious trends can further improve the prediction accuracy of the model.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明的元数据交换系统布局拓扑图。FIG. 1 is a layout topology diagram of a metadata exchange system of the present invention.
图2为本发明的元数据交换系统组成结构图。FIG. 2 is a structural diagram of a metadata exchange system of the present invention.
图3为本发明的元数据同步子系统功能模块布局图。FIG. 3 is a layout diagram of functional modules of a metadata synchronization subsystem of the present invention.
图4为本发明的元数据交换与裁剪流程图。FIG. 4 is a flowchart of metadata exchange and clipping according to the present invention.
图5为本发明的元数据与目录管理子系统功能模块布局图。FIG. 5 is a layout diagram of functional modules of the metadata and directory management subsystem of the present invention.
图6为本发明的元数据采集发布模块流程图。FIG. 6 is a flowchart of a metadata collection and publishing module of the present invention.
图7为本发明的元数据目录服务体系结构图。FIG. 7 is a diagram of a metadata directory service architecture of the present invention.
图8为本发明的元数据与目录管理子系统结构图。FIG. 8 is a structural diagram of a metadata and directory management subsystem of the present invention.
具体实施方式detailed description
本发明基于时空相关性的短时交通流预测方法,包括以下步骤:The short-term traffic flow prediction method based on the spatio-temporal correlation of the present invention includes the following steps:
步骤1,选定需要进行交通流预测的路段以及该路段中的断点,获取所选路段中所有断点的短时交通流量历史数据;Step 1: Select a road segment to be predicted for traffic flow and the breakpoints in the road segment, and obtain historical short-term traffic flow data of all breakpoints in the selected road segment;
所述断点的短时交通流历史数据是指数据采集日期,时间,断点处的交通流速度值以及交通流量值。The short-term traffic flow historical data of the breakpoint refers to data collection date, time, traffic flow speed value and traffic flow value at the breakpoint.
步骤2,根据获取的短时交通流历史数据,确定短时交通流预测的预测时段;Step 2: Determine a prediction period of the short-term traffic flow prediction based on the obtained short-term traffic flow historical data;
例如,所述的预测时段为5分钟。For example, the prediction period is 5 minutes.
步骤3,根据断点的短时交通流量历史数据,验证预测断点的历史交通流量数据是否具有周期性,具体过程如下:Step 3: According to the short-term traffic flow historical data of the breakpoint, verify whether the historical traffic flow data of the predicted breakpoint is periodic. The specific process is as follows:
对于构成时间序列的每个序列值X
t,X
t-1,…X
t-k,利用自相关系数r
k度量序列值之间的自相关程度,r
k即为相隔k期的观测值之间的相关程度,通过以下的公式计算:
For each of the sequence values X t , X t-1 , ... X tk constituting the time series, the autocorrelation coefficient r k is used to measure the degree of autocorrelation between the sequence values, and r k is the number of observations separated by k periods. The degree of correlation is calculated by the following formula:
其中n代表时间序列的长度,
即为时序数据的平均值,X
t-k则表示与X
t相距k期的序列值。
Where n represents the length of the time series, That is the average of the time series data, and X tk represents the sequence value that is k periods away from X t .
步骤4,利用归一化方法对交通流数据进行归一化处理,将归一化后的数据集划分为训练数据集和测试数据集;Step 4. Use the normalization method to perform normalization processing on the traffic flow data, and divide the normalized data set into a training data set and a test data set;
所述的归一化方法,具体过程如下:The normalization method is as follows:
分别计算历史交通流数据某一个样本中的最小值min和最大值max,使用min-max标准化方法对数据进行归一化,使得归一化之后的交通流数据结果映射到[0,1]之间,即根据交通流数据集合F={f
t|t=1,2,...T}求得集合中最大值max和最小值min,对集合中的每个数据计算:
Calculate the minimum min and maximum max in a sample of historical traffic flow data, and use the min-max normalization method to normalize the data so that the normalized traffic flow data results are mapped to [0,1]. In other words, the maximum value max and the minimum value min in the set are obtained according to the traffic flow data set F = {f t | t = 1,2, ... T}, and each data in the set is calculated:
其中x’表示归一化处理后的交通流数据,min表示样本数据中的最小值,max表示样本数据最大值,x表示待归一化处理的数据。Where x 'represents the traffic flow data after normalization processing, min represents the minimum value of the sample data, max represents the maximum value of the sample data, and x represents the data to be normalized.
所述将归一化后的数据集划分为训练数据集和测试数据集,具体为:归一化处理后将历史交通流量数据中百分之80的数据作为训练集,百分之20的数据作为测试集。The dividing the normalized data set into a training data set and a test data set is specifically: after normalization processing, 80% of the data in the historical traffic flow data is used as the training set, and 20% of the data As a test set.
步骤5,利用SARIMA模型对测试数据集进行预测分析,得到初始预测结果,具体包括以下步骤:Step 5. Use the SARIMA model to perform a predictive analysis on the test data set to obtain the initial prediction result, which specifically includes the following steps:
(5.1)检验原始交通流数据是否为平稳序列:检验结果为交通流数据是非平稳的,对其进行平稳化处理;检验结果为交通流数据是平稳的,直接进入步骤(5.2);(5.1) Check whether the original traffic flow data is a stable sequence: The test result is that the traffic flow data is non-stationary, and it is stabilized; the test result is that the traffic flow data is stable, and it proceeds directly to step (5.2);
(5.2)依据平稳化时间序列数据的ACF函数与PACF函数以及AIC最小准则,对SARIMA模型的四个参数p,q,P,Q取值;(5.2) According to the ACF function and PACF function of the stationary time series data and the AIC minimum criterion, the four parameters p, q, P, Q of the SARIMA model are valued;
(5.3)预测过程中以预测时刻t前d天的数据量当做训练数据,并采用滑动窗口的形式动态预测,且设定模型每执行n次就重新拟合,并调整参数,最终得出步骤5所述 的初始预测结果。(5.3) During the prediction process, the amount of data d days before the prediction time t is used as training data, and dynamic prediction is performed in the form of a sliding window, and the model is refitted every n times, and the parameters are adjusted to finally obtain the steps. The initial prediction results described in 5.
步骤6,将SARIMA模型得到的预测结果作为一项输入特征,带入随机森林模型,得到最终的预测结果,具体包括以下步骤:Step 6. Take the prediction result obtained by the SARIMA model as an input feature and bring it into the random forest model to obtain the final prediction result, which specifically includes the following steps:
将SARIMA模型得到的初始预测结果作为反映周期性模式的输入特征,与其他输入特征组合共同带入随机森林模型中,采用网格法对参数进行调整,最终得出预测值。The initial prediction results obtained by the SARIMA model are used as input features reflecting the periodic pattern, and are combined with other input feature combinations into the random forest model. The parameters are adjusted using the grid method to finally obtain the predicted values.
步骤7,将测试数据集与最终的预测数据进行比较,并分析误差,具体包括以下步骤:Step 7. Compare the test data set with the final prediction data and analyze the error, which specifically includes the following steps:
通过平均百分比误差MAPE和均方根误差RMSE来对预测数据进行误差分析,计算公式如下:The error analysis is performed on the forecast data through the average percentage error MAPE and root mean square error RMSE. The calculation formula is as follows:
其中n代表共选取测试数据的个数,u
i为第i个时段实际车流量值,
为模型对第i个时段预测得到的流量值。
Where n represents the number of test data selected in total, and u i is the actual traffic volume value in the i-th period. The flow value obtained by the model for the i-th period.
为了更好地理解本发明,下面结合附图和具体实施例对本发明的内容做进一步的说明。In order to better understand the present invention, the content of the present invention will be further described below with reference to the accompanying drawings and specific embodiments.
实施例1Example 1
本实施例中基于时空相关性的短时交通流预测方法,主要流程图及其结构图如图1和图2所示,包括如下步骤:The short-term traffic flow prediction method based on the spatio-temporal correlation in this embodiment, the main flowchart and its structure diagram are shown in Fig. 1 and Fig. 2, including the following steps:
步骤一,选定需要进行交通流预测的路段以及该路段中的断点,获取所选路段中所有断点的短时交通流量历史数据;Step 1: Select the road segment to be predicted for traffic flow and the breakpoints in the road segment, and obtain the historical short-term traffic flow data of all breakpoints in the selected road segment;
步骤二,根据获取的短时交通流历史数据,确定短时交通流预测的预测时段;Step 2: Determine the prediction period of the short-term traffic flow prediction based on the obtained short-term traffic flow historical data;
步骤三,根据断点的短时交通流量历史数据,验证预测断点的历史交通流量数据是否具有周期性;Step 3: verify whether the historical traffic flow data of the predicted breakpoint is periodic based on the short-term traffic flow historical data of the breakpoint;
步骤四,利用归一化方法对交通流数据进行归一化处理,将归一化后的数据集划分为训练数据集和测试数据集;Step four: normalize the traffic flow data by using a normalization method, and divide the normalized data set into a training data set and a test data set;
步骤五,利用SARIMA模型对测试数据集进行预测分析,得到初始预测结果;Step 5: Use the SARIMA model to perform prediction analysis on the test data set to obtain the initial prediction result;
步骤六,将SARIMA模型得到的预测结果作为一项输入特征,带入随机森林模型,得到最终的预测结果;Step 6. Take the prediction result obtained by the SARIMA model as an input feature and bring it into the random forest model to obtain the final prediction result.
步骤七:将测试数据集与最终的预测数据进行比较,并分析误差。Step 7: Compare the test data set with the final prediction data and analyze the errors.
在本实施用例中,交通流数据通过线圈采集得到,获取得到的交通流量数据为特定断点在一定时间间隔内经过的车辆数目,在本实例中,该时间间隔为5分钟。历史观测数据集合表示为F={f
t|t=1,2,...T},其中f
t表示路网特定断点在t时刻的交通流参数,T时刻与T+1时刻的差值为预测时间间隔,本实例中采用的预测时间间隔为5分钟。
In the use case of this embodiment, the traffic flow data is collected through a coil, and the obtained traffic flow data is the number of vehicles passing by a specific breakpoint within a certain time interval. In this example, the time interval is 5 minutes. The historical observation data set is expressed as F = {f t | t = 1,2, ... T}, where f t represents the traffic flow parameter of the specific breakpoint of the road network at time t, and the difference between time T and time T + 1 The value is the prediction time interval. The prediction time interval used in this example is 5 minutes.
若挖掘分析交通流量的周期性模式,必须先验证数据集具有周期性,本实例通过自相关函数进行周期性验证。以一天中早六点至晚24点,时间间隔为5分钟的数据为实验数据,经验证,交通流量数据具有日周期性,且为216,与实际相符。周期性验证图如图3,4所示。If you analyze and analyze the periodic pattern of traffic flow, you must first verify that the data set is periodic. In this example, periodic verification is performed through an autocorrelation function. Taking the data from 6 am to 24 pm in the day with a time interval of 5 minutes as experimental data, it has been verified that the traffic flow data has a daily periodicity and is 216, which is consistent with the actual situation. The periodic verification chart is shown in Figures 3 and 4.
接着,分别计算历史交通流数据某一个样本中的最小值min和最大值max,使用min-max标准化方法对数据进行归一化,使得归一化之后的交通流数据结果映射到[0,1]之间,即根据交通流数据集合F={f
t|t=1,2,...T}求得集合中最大值max和最小值min,对集合中的每个数据计算:
Next, calculate the minimum min and maximum max in a sample of historical traffic flow data, and use the min-max normalization method to normalize the data, so that the normalized traffic flow data results are mapped to [0,1 ], That is, the maximum value max and the minimum value min in the set are obtained according to the traffic flow data set F = {f t | t = 1,2, ... T}, and each data in the set is calculated:
其中x’表示归一化处理后的交通流数据,min表示样本数据中的最小值,max表示样本数据最大值,x表示待归一化处理的数据。Where x 'represents the traffic flow data after normalization processing, min represents the minimum value of the sample data, max represents the maximum value of the sample data, and x represents the data to be normalized.
本实例选取25个工作日数据作为实验数据,其中20天的交通流数据作为训练数据,5天的交通流数据作为测试数据。In this example, data of 25 working days are used as experimental data, of which 20 days of traffic flow data are used as training data, and 5 days of traffic flow data are used as test data.
SARIMA模型就是一种可以描述季节性时间序列的模型,它是自回归积分移动平均(ARIMA)模型的一种变形形式[14]。The SARIMA model is a model that can describe seasonal time series. It is a variant of the Autoregressive Integral Moving Average (ARIMA) model [14].
假设一个交通流序列{X
t}可由SARIMA(p,d,q)(P,D,Q)S模型拟合,其中参数S表示设定的季节周期的长度,参数d表示转换成平稳序列所需的差分次数,参数D的含义为所需季节差分的阶数;设差分后的的平稳时间序列为{Y
t},如式(2)所示,其中B表示后移算子,它与交通流量有如式(3)所示关系:
Assume that a traffic flow sequence {X t } can be fitted by the SARIMA (p, d, q) (P, D, Q) S model, where the parameter S represents the length of the set seasonal period, and the parameter d represents the conversion into a stationary sequence. The required number of differences, the meaning of the parameter D is the order of the required seasonal difference; let the stationary time series after the difference be {Y t }, as shown in equation (2), where B represents the backward shift operator, The traffic flow has the relationship shown in equation (3):
Y
t=(1-B)
d(1-B
S)
DX
t (2)
Y t = (1-B) d (1-B S ) D X t (2)
B
jX
t=X
t-j (3)
B j X t = X tj (3)
则SARIMA模型可表示为式(4)的形式:Then the SARIMA model can be expressed in the form of equation (4):
φ(B)Φ(B
S)(1-B)
d(1-B
S)
DY
t=c+θ(B)Θ(B
S)ε
t (4)
φ (B) Φ (B S ) (1-B) d (1-B S ) D Y t = c + θ (B) Θ (B S ) ε t (4)
其中参数c表示常数项,ε
t表示模型的残差项,且满足ε
t~N(0,δ
2),BS表示季节后移算子,并满足以下关系:
The parameter c represents a constant term, ε t represents the residual term of the model, and satisfies ε t ~ N (0, δ 2 ), and BS represents a post-season shift operator, and satisfies the following relationship:
φ(B)=1-φ
1B-φ
2B
2-…-φ
pB
p, (5)
φ (B) = 1-φ 1 B-φ 2 B 2 -...- φ p B p , (5)
φ(B
S)=1-φ
1B
S,1-φ
2B
S,2-…-φ
pB
S,p, (6)
φ (B S ) = 1-φ 1 B S, 1 -φ 2 B S, 2- … -φ p B S, p , (6)
θ(B)=1-θ
1B-θ
2B
2-…-φ
qB
q, (7)
θ (B) = 1-θ 1 B-θ 2 B 2 -...- φ q B q , (7)
θ(B
S)=1-θ
1B
S,1-θ
2B
S,2-…-φ
qB
S,Q, (8)
θ (B S ) = 1-θ 1 B S, 1 -θ 2 B S, 2 -...- φ q B S, Q , (8)
SARIMA(p,d,q)(P,D,Q)
S模型预测的基本步骤如图5所示。本实例中首先检验原始交通流数据是否为平稳序列。检验结果为交通流数据是非平稳的,故对其进行平稳化处理,得出d取1,D取1,S为156;第二步依据处理后的平稳化时间序列的ACF函数与PACF函数以及AIC最小准则,对p,q,P,Q取值。在预测过程中以预测时刻t前三天的数据量当做训练数据,并采用滑动窗口的形式动态预测,且设定模型每执行12次就重新拟合,调整参数,最终预测测试集中一周的车流量数据。
The basic steps of SARIMA (p, d, q) (P, D, Q) S model prediction are shown in Figure 5. In this example, it is first checked whether the original traffic flow data is a stationary sequence. The test result is that the traffic flow data is non-stationary, so it is stabilized, and it is obtained that d takes 1, D takes 1, and S is 156. The second step is based on the processed ACF function and PACF function of the stabilized time series and AIC minimum criterion. Values for p, q, P, Q. In the prediction process, the data amount of the three days before the prediction time t is used as training data, and dynamic prediction is performed in the form of a sliding window, and the model is refitted every 12 times, the parameters are adjusted, and finally the vehicle in the test set is predicted for one week. Traffic data.
随机森林(Random Forest,简称RF)是数据挖掘和机器学习的强大工具,是将大量的回归树结合继而得出预测结果的集成学习方法,通过将大量弱模型组合构建成强模型。RF的预测过程可以通过评估预测因子的重要程度来直观地解释,该算法对于数据中的噪声和异常值具有鲁棒性,可有效地运行在交通大数据上,并且对于高维数据也有很好的适应性。本实例中,将SARIMA模型得到的初始预测结果作为反映周期性模式的特征,与其他输入特征组合在一起,带入随机森林模型中得到最终预测结果。并选取三个时段:早7点至晚20点(时段1),早8点至10点(时段2),下午14点至16点(时段3),对测试数据集合预测数据比较,进行误差分析。误差通过两个指标来评价:即平均百分比误差(MAPE)和均方根误差(RMSE),计算公式如下:Random forest (RF) is a powerful tool for data mining and machine learning. It is an integrated learning method that combines a large number of regression trees and then obtains prediction results. It combines a large number of weak models into a strong model. The prediction process of RF can be intuitively explained by evaluating the importance of the predictive factor. The algorithm is robust to noise and outliers in the data, can effectively run on big traffic data, and is also good for high-dimensional data. Adaptability. In this example, the initial prediction result obtained by the SARIMA model is used as a feature reflecting the periodic pattern, combined with other input features, and brought into the random forest model to obtain the final prediction result. And select three time periods: 7 am to 20 pm (period 1), 8 am to 10 pm (period 2), 14 pm to 16 pm (period 3), and compare the prediction data of the test data set with errors. analysis. The error is evaluated by two indicators: the average percentage error (MAPE) and the root mean square error (RMSE). The calculation formula is as follows:
其中n代表共选取测试数据的个数,u
i为第i个时段实际车流量值,
为模型第i个时段预测得到的流量值。本发明方法的预测结果与现有方法预测结果的对比图如图6、7、8所示。
Where n represents the number of test data selected in total, and u i is the actual traffic volume value in the i-th period. The predicted flow value for the i-th period of the model. The comparison between the prediction results of the method of the present invention and the prediction results of the existing methods is shown in Figures 6, 7, and 8.
综上所述,本发明方法深入挖掘交通流数据的随机性与不确定性,充分考虑交通流数据中的时空相关性,将流量数据分解成带有明显趋势的周期性部分和随机波动部分,加以分析,从而提高了交通流量数据的预测精度。In summary, the method of the present invention deeply explores the randomness and uncertainty of traffic flow data, fully considers the spatio-temporal correlation in traffic flow data, and decomposes the flow data into a periodic part and a random fluctuation part with a clear trend. It is analyzed to improve the prediction accuracy of traffic flow data.