AU2020102350A4

AU2020102350A4 - A Spark-Based Deep Learning Method for Data-Driven Traffic Flow Forecasting

Info

Publication number: AU2020102350A4
Application number: AU2020102350A
Authority: AU
Inventors: Xiaonan Gao; Huaqing Li; Dawen Xia
Original assignee: Guizhou Minzu University
Current assignee: Guizhou Minzu University
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2020-10-29
Anticipated expiration: 2028-09-21

Abstract

Taking advantage of big data technology has become a new concept and practice to improve the capability of traffic management and control in data-driven intelligent transportation systems, and especially timely and accurate traffic flow forecasting (TFF) is significant for mitigating traffic congestion. To solve the problems of calculation and storage in dealing with traffic big data using the traditional centralized models on a single machine, this invention presents a Spark-based Weighted Bidirectional Long Short-Term Memory (SW-BiLSTM) model to improve the robustness, accuracy, and timeliness of TFF in real time. Specifically, we utilize the resilient distributed dataset (RDD) to preprocess mobile trajectory big data (e.g., large-scale GPS trajectories of taxicabs) based on the Spark parallel distributed com puting platform and then employ the Kalman filter (KF) approach to eliminate abnormal GPS points and achieve discrete smoothing of traffic flow data. Moreover, a distributed SW-BiLSTM model on Spark is proposed to enhance the accuracy and efficiency of real time TFF, combined with the normal distribution to weigh the interaction between adjacent road segments and the time window for achieving the optimization of BiLSTM. Finally, the SW-BiLSTM model is implemented on a Spark parallel computing framework to improve the efficiency and scalability of TFF. The present invention has broad applications in big data analytics.

Description

1. Technical Field

[0001] The present invention relates to a field of big-data-driven intelligent transportation systems (ITSs).

2. Background

[0002] With the rapid development of urbanization, increased vehicle ownership has resulted in an increasingly severe problem - traffic congestion. Therefore, Traffic flow forecasting (TFF) has become one of the essential research fields of intelligent transportation systems (ITS) and advanced traffic management systems (ATMS). The underlying issue in this field is how to utilize the past (historical) and current traffic flow big data to forecast traffic flow accurately at the future time interval. Accurate and timely TFF is increasingly crucial for alleviating traffic congestion, improving the urban environment, and helping drivers make better travel decisions. However, real-time TFF is uncertain owing to the stochastic and nonlinear characteristics of traffic flow. Meanwhile, the traffic flow on the targeted road segment (TRS) is often affected by the adjacent road segments, which poses a threat to TFF, especially in complex transportation networks.

[0003] In recent years, data-driven TFF has become a hot research in the fields of ITS. Many cutting-edge models proposed by researchers to solve the TFF problem could be roughly di vided into different groups: (1) Parametric models, such as autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA), and Kalman filter (KF). (2) Nonparamet ric models, such as k-nearest neighbor (KNN), support vector regression (SVR), and neural networks (NNs). Although these methods have achieved ideal forecasting performance in processing small-scale sample data through a shallow structure model, they still have lim itations in handling large-scale sample data, especially for complex and variable nonlinear traffic flow data. Recently, deep learning has been successfully utilized to deal with these problems.

[0004] As an emerging machine learning method, deep learning has achieved remarkable performance in image processing, voice recognition, medical diagnosis, and intelligent trans portation. In comparison to conventional shallow learning architectures, the deep neural network can model deep complex nonlinear models using distributed and hierarchical fea ture representation Therefore, deep learning methods have been increasingly prominent in TFF in recent years. The LSTM neural network can capture the features on a comparatively long-time span, which is widely employed in TFF, and it is common that LSTM can utilize the past information rather than future information in the existing methods. However, the

BiLSTM neural network can process bi-directional data based on two different hidden layers to capture better information using past and future data information. Real-time TFF is a time series forecasting problem, and not only are traffic conditions related to the state at the current time interval, but also they are associated with the historical periods. Currently, when it comes to large-scale traffic flow data, there are still some issues in processing big traffic data such as low accuracy and inability to use the past and future traffic flow infor mation effectively, which significantly reduces the forecasting performance in accuracy and timeliness. BiLSTM has the capability of taking full advantage of past and future informa tion and capture the non-linear data from traffic flow, particularly the data with a more extended period, to enhance the forecasting accuracy of TFF. With the above-mentioned prominent advantages, the existing studies demonstrate that BiLSTM can produce a better performance than RNN, LSTM, ARIMA, SAE, SVM, CNN-LSTM, and DBN. However, only a few works have been done for real-time TFF with large-scale sample data on the Spark parallel distributed computing platform. This invention may be the first attempt to imple ment the optimization of BiLSTM based on the Spark framework for TFF. The optimized BiLSTM model implemented on Spark is combined with the normal distribution to weigh the influence of adjacent road segments, and real-time TFF on the TRS can be accurately forecasted by the time window.

[0005] In this invention, to improve the accuracy and timeliness of TFF, we develop a dis tributed Spark-based optimization model on the weighted BiLSTM neural network (SW BiLSTM). More specifically, the traditional BiLSTM model is optimized by combining with a normal distribution and time window, and then the optimized BiLSTM model is imple mented on a Spark parallel computing platform (SW-BiLSTM). Finally, SW-BiLSTM is employed to forecast real-time traffic flow with large-scale mobile trajectory data.

3. SW-BiLSTM model

[0006] In this invention, we put forward a distributed SW-BiLSTM model on Spark to improve the performance of traffic flow forecasting (TFF) in terms of accuracy, timeliness, and scalability, thereby addressing the real-time application problem in ITS.

3.1. Overview

[0007] In the SW-BiLSTM model, for improving the accuracy and timeliness of TFF with mobile trajectory big data (e.g., large-scale GPS trajectories of taxicabs), we employ the normal distribution and time window to optimize the traditional BiLSTM model and then implement the optimized BiLSTM model on Spark for the parallel forecasting and distributed computing of real-time traffic flow. As shown in Fig. 1, our method for real-time TFF based on SW-BiLSTM mainly consists of three steps. The transformations and actions of RDD are utilized for data preprocessing, which includes creating RDD for data reading, converting RDD for data computing, and starting RDD for data storage. Furthermore, a distributed SW-BiLSTM model is proposed to address the TFF problem in real time. Specifically, in SW-BiLTSM, the normal distribution is adopted to weigh the influence of adjacent road segments on the traffic flow on the targeted road segment (TRS), and a time window of size eight is used for TFF. Finally, the SW-BiLSTM model is implemented on the Spark parallel distributed computing platform by different RDD partitions.

3.2. Preprocessing

[0008] We preprocess traffic flow data to reduce the impact on the accuracy of TFF due to traffic abnormity, uncertain traffic conditions, and signal transmission. The original raw data are processed based on the Spark framework, namely performing data preprocessing by RDD (the core of Spark). RDD is a fault-tolerant, parallel data structure enabling users to store intermediate results in memory. It also controls the partitioning of the data set to achieve optimal data storage and processing and handles the data via a rich set of operators. Therefore, two types of operators of RDD (i.e., transformations and actions) are used for data preprocessing. Specifically, data are loaded into memory by transformations,then RDD is transformed, and finally data are stored through actions. As illustrated in Fig. 2, the data preprocessing based on the Spark parallel distributed computing platform mainly includes three Steps: RDD creation, RDD conversion, and RDD starting.

* [0009] Step 1: RDD creation.

[0010] Mobile trajectory data (e.g., GPS trajectory data) are extracted and uploaded to HDFS by adopting external storage with the way of creating RDD. The data stored in HDFS are read through the textFile in SparkContext object, and then they are loaded into the memory of a cluster. Finally, many RDDs could be created.

* [0011] Step 2: RDD conversion.

[0012] Data processing and conversion on RDD are conducted by calling map, filter, flatMap, sortByKey, distinct, reduceByKey, and other operators of Spark, which mainly consists of three tasks.

[0013] Task 1: The vehicle information on the TRS at the current time interval (CTI) is extracted. We first convert the data on each node into < keyl, value > pairs using the flatMap operator, and then employ the map operator to set keyl = time and vehicle ID, value = the number of the TRS. Moreover, the filter operator is utilized to filter the GPS trajectory data that do not belong to the selected TRS in the RDD. Finally, the sortByKey operator is used to sort keyl according to keyl in RDD < keyl, value >, and the distinct operator is employed to remove duplicates of the same vehicle at the CTI in RDD to obtain the vehicle information on the TRS.

[0014] Task 2: The data information extracted in Step 1 is read. We first convert the data on each node into < key2, value > pairs via the flatMap operator, and then the map operator is adopted to set key2 = time and area number and the value of value adds one. Finally, the reduceByKey operator is utilized to perform reduce operation according to the value of key2. The number of vehicles at the CTI is counted, and the total number of vehicles on the selected TRS at per time interval is obtained (i.e., traffic flow).

[0015] Task 3: The total number of vehicles on the TRS at the CTI t is incorporated into a one-dimensional array Xt composing a matrix X. We first convert the data distributed on each node into < key3,value3 > pairs by the flatMap operator, and then the map operator is employed to set key3 = time interval and value = the total number of vehicles on each TRS. Finally, the key3 is sorted through the sortByKey operator, and then results are outputted.

• [0016] Step 3: RDD starting.

[0017] The data flow is transferred in different RDD partitions based on transformations operator, and then the Actions operator is started, and the data are stored in HDFS by calling saveAsTextFile to achieve distributed storage of data.

3.3. Model

[0018] LSTM can address the long-term dependency learning problem, and thus has great potential to forecast traffic flow. The LSTM unit maintains a separate memory cell, which is the crucial element of the LSTM neural network structure. LSTM can only utilize past information rather than future information. BiLSTM, however, can meditate the information in both the past and future, and consists of two stacked unidirectional LSTMs. In this invention, we employ the past traffic flow with the positive state and the future traffic flow with the reverse state to forecast real-time traffic flow. Specifically, we first input the traffic flow into the forward LSTM network layer for getting an output vector and then input the traffic flow in the opposite direction into the backward LSTM network layer to get another output vector. Finally, the final output can be produced by combining the two output vectors.

[0019] In this invention, it is necessary to measure the weight of the influence of the adjacent road segments on the TRS because traffic flow is affected by the traffic flow on the adjacent road segments and does not exist independently. Parameters 6 and u in the normal distri bution determine the position and size of the image. Thus, when the parameter 6 is fixed, the closer the sample point is next to theu parameter, the higher the weighted value is. That is why normal distribution is used to calculate the weight of the targeted road segment and adjacent road segments. Moreover, to enhance the forecasting accuracy of TFF, this invention takes the variation of traffic flow between the adjacent road segment and the TRS into consideration, as shown in Formula (1).

1 ((x -u)2\ f (X) v26 exp - 262 2 (di - dm) (1)

where the u of normal distribution represents traffic flow on the TRS, 6 can be set as 0.6, x denotes discrete value of the traffic flow on each road segment, di is the traffic flow on each road segment, and dm represents the traffic flow on each TRS.

[0020] With the traffic flow V/ on the TRS iat the CTI t, we extract the past (historical) traffic flows Vi_ 7,1 Vt,1 Vi,1 Vi 4,1 Vi,1 Vi 21 1. Then, a time window with a size of 8 is employed for TFF, as shown in Formula (2).

1 V8 / V1 V2 V3 V4 V5 V6 V 7

=2 V2 V3 V4 V5 V6 V7 V8 V9 (2)

Vn-8 n-8 Vn-7 Vn-6 Vn-5 Vn-4 Vn-3 Vn-2 Vn-1

[0021] In this invention, according to time series composed by time intervals t - 7, t - 6, t

, t - 4, t - 3, t - 2, t - 1, t, the weighted traffic flow is input into the BiLSTM model to

produce forecasting results at the next time interval t + 1.

[0022] Therefore, we train the BiLSTM model through the following formulas:

it = c-(WiA + Whiht-1 + Weict- + bi), (3) ft= -(WfA+Whf hh-1 +Wft-1 + bf), (4) where W is a weight matrix and b is a bias vector. The cell state in the hidden layer of LSTM is able to be achieved by the following formulas: ct= ftcti + 1itg(WcA +Whcht-1 + be), (5) ot= a(W A +Whoht-1 +Weoct + b), (6) ht = oth(ct), (7) lit = LSTM (ht_1, X,ct_1), (8) h = LSTM(ht_1,1, cti), (9)

Ht = [it,?], (10)

where it represents input gate of output in current time-step, ft denotes forget gate, ct represents cell state, ot denotes output gate, the ht is hidden layer output. -(-) represents the activation function of sigmoid, W denotes the connection weights of matrices. The hidden state Ht of Bi-LSTM at time interval t contains the forward ht and backward hidden states.

3.4. Implementation

[0023] For improving the timeliness and scalability of real-time TFF, we employ a Spark distributed parallel computing framework to implement the SW-BiLSTM model by reducing the computational cost and memory consumption. As illustrated in Fig. 3, SW-BiLSTM is divided into many RDD partitions based on the Spark framework, and different RDD partitions are executed via the following three steps in parallel.

[0024] Step 1: Initialing the RDD data set. The RDDs read the data sets of the TRS and the adjacent road segments in different RDD partitions to generate different < key, value > pairs. Next, the normal distribution with the Formula (1) is applied to calculate the weight of the interaction between the adjacent road segments and the TRS in parallel, and then the weighted data sets are obtained.

[0025] Step 2: Aggregating the intermediate results. After the data go through in Step 1, aggregation is performed at each node. The results of RDD partitions are sorted, aggregated, and cached, and intermediate results can be directly stored in memory for the process of reading in the next step. Then, the distributed SW-BiLSTM model is established by the optimization of BiLSTM, which is combined with the weighted data sets via the Formula (1) and the time window that determines the data set of the input model through the Formula (2).

[0026] Step 3: Producing the forecasting results. The determined data set is input into the model with the Formulas (3)-(10) for model training by the SparkD4jMultiLayer instance using the distributed SW-BiLSTM model on Spark for real-time TFF, and then forecasting results are outputted.

4. Innovation

[0027] (1) To improve the accuracy and robustness of real-time TFF, a distributed weighted bidirectional LSTM neural network model (SW-BiLSTM) on Spark is proposed by taking advantage of a normal distribution for weighing the influence of adjacent road segments. Different from the traditional weighted method, our approach considers the influence degree of the TRS. Moreover, a time window method is incorporated into TFF.

[0028] (2) To enhance the timeliness and scalability of real-time TFF, the SW-BiLSTM model is implemented on the Spark parallel computing framework to conduct parallel forecasting and distributed training of real-time traffic flow. In the data preprocessing, we employ the RDD on Spark to process traffic flow data in data cleaning. In addition, we use the KF method to eliminate abnormal GPS points, thereby achieving discrete smoothing for large-scale traffic flow data.

[0029] (3) The real-time traffic flow of Sanlihe East Road in Beijing is forecasted successfully using our SW-BiLSTM model on Spark with the real-word GPS trajectories of taxicabs. In particular, the empirical results from extensive experiments demonstrate that, compared with several state-of-the-art models, the MAPE value of SW-BiLSTM is lower than that of ARIMA, LR, GBN, CNN, GRU, LSTM, and WND-LSTM, respectively.

5. Experimental evaluations

[0030] In this invention, we compare our SW-BiLSTM model with several state-of-the-art models to validate the performance of traffic flow forecasting, and then report the results and give the analyses in detail.

5.1. Experimental setup

[0031] This experiment adopts a wholly distributed model to build a distributed parallel computing platform based on the Spark framework, including a cluster of 1I Master node and 3 Slave nodes, and the necessary hardware is a Lenovo Host i7 with Inteli7-3550 CPU and ECC DDR3 8.0 GB Memory. All experiments are conducted on Ubuntu 18.64 OS with Hadoop 3.1.1, Spark 2.4.3, Idea 2018.2.2, and Pycharm 2019.2.5 using Java and Python.

[0032] Moreover, we select seven cutting-edge models as baselines in this experiment, i.e., ARIMA, LR, GNB, CNN, GRU, LSTM, and WND-LSTM, besides BiLSTM.

5.2. Experimental data

[0033] A real-world GPS trajectory data set is employed in this case study produced by 12,000 taxis of Beijing between Nov.05 and Nov.17 in 2012. Furthermore, the trajectory data are divided into four groups (i.e., one day: Nov.05, five days: Nov.05 to Nov.09, nine days: Nov.05 to Nov.13, thirteen days: Nov.05 to Nov.17) for performance evaluation. In extensive experiments, 65% data of a day is chosen as the training set, and the rest 35% of the data set is the test set. In other groups of data sets, the data of any day are used as the test set, and the rest of the data are utilized as the training set.

[0034] In addition, the Sanlihe East Road of Beijing in China is selected as the targeted road segment (TRS), including three subsegments, i.e., Fuchengmen Outer Street - Yuetan North Street, Yuetan North Street - Yuetan South Street, Yuetan South Street - Fuxingmen Outer Street.

[0035] The fact that each road segment has randomness in traffic flow changes, especially traffic flow data with noise that affects the accuracy of TFF. To obtain a smooth curve for training, we adopt KF to process the noise data in the experiment.

5.3. Evaluation metrics

[0036] To validate the accuracy and robustness of SW-BiLSTM, MAPE, RMSE, MAE, and ME are taken as the evaluation metrics for the measure of the effectiveness (MOEs), which are defined as:

MAPE = n X- x 100%, (11) t=1

RMSE iL(XtX),(2

RMSE = 1 -j%2, (12) in MAE EX--X, (13)

ME= max IX - Zl, (14) t=1,---,n

where Xt denotes the real value of the traffic flow on the TRS at the CTI t, It represents the forecasting value of the traffic flow on the same road segment at the same time interval, n is the total number of handling traffic flow during the time intervals provided.

5.4. Parametertuning

[0037] Here, we evaluate how to determine the parameters of SW-BiLSTM in this empirical study automatically.

[0038] In this case study, experimental data are divided into several groups by batch-size, and parameters are updated by groups. A set of data in batch-size determines the direction of gradient descent and reduces the randomness and calculation amount during gradient descent. When batch-size increases, a local optimum may occur. When batch-size decreases, the introduced randomness is more significant, and it is not easy to achieve convergence. Meanwhile, as the batch-size increases, the number of required epochs ascends as well; as batch-size decreases, the number of required epochs also descends. Therefore, we need to adjust batch-size and epochs to improve the evaluation metrics for forecasting, but there are no rules about adjusting the parameters in the existing approaches. Therefore, to obtain optimal parameter combinations for our model, we select the Grid Search (GS) method to train the model repeatedly.

[0039] In this invention, the batch-size and epochs are optimized using the GS method. That is, tuning batch-size and epochs are batch-size = range(0, 32, 2) and epochs = range(0, 500, ), respectively.

[0040] The parameters of batch-size and epochs are optimized in this invention. Parameters of the parameter grid are initialized through the GridSearchCV function, and the GS model is returned by grid.fit). Best-score provides the best score observed during the optimization process, and best-params describe the parameter combinations that have achieved the best result. From the output (Best: 0.106227 using 'batch-size': 24,'nb-epoch': 400), we can find out that the best result will be reached when the optimal parameter combination is batch-size = 24 and epochs = 400. The experimental results are illustrated in Table 1.

Table 1: The parameter combinations of batch-size and epochs(%).

0 50 100 150 200 250 300 350 400 450

2 0.0733 0.3663 0.2198 0.0733 0.0733 0.2198 0.0733 0.0733 0.0733 0.2930 4 0.1465 0.0000 0.0733 0.1465 0.2198 0.1465 0.1465 0.1465 0.1465 0.0733 6 0.5128 0.0733 0.5128 0.5128 0.4396 0.5128 0.2930 0.0733 0.5128 0.5128 8 0.0733 0.0733 0.0733 0.0733 0.0733 0.0733 0.0733 0.2198 0.0733 0.0733 10 0.2930 0.1465 0.2930 0.2930 0.2930 0.2930 0.2930 0.2930 0.2930 0.2930 12 0.0733 0.0733 0.0000 0.1465 0.0733 0.0000 0.0733 0.0000 0.0733 0.0000 14 0.0733 0.0733 0.0733 0.0733 0.0000 0.0733 0.0733 0.0733 0.0733 0.0733 16 0.0733 0.0733 0.0733 0.0733 0.0733 0.0733 0.0733 0.0733 0.0733 0.0733 18 0.0733 6.0073 0.0733 0.0733 0.0733 0.0733 1.1722 0.0733 0.0733 0.0733 20 0.6593 10.5495 1.0989 0.0733 1.0256 6.3004 5.9341 0.0733 1.1722 0.0733 22 10.4762 6.7399 6.1538 1.1722 6.8864 1.1722 6.7399 1.7582 0.0733 10.1832 24 10.3297 9.6703 1.0256 10.2564 9.9634 9.8168 9.9634 10.5495 10.6227 9.5238 26 9.9634 10.6227 10.6227 10.5495 10.0366 10.6227 10.6227 10.6227 10.6227 10.6227 28 10.5495 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 30 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227 10.6227

5.5. Experimental results

[0041] In this empirical study, we train the model, which is composed of four parts in a limited data set. The first is an input layer with eight dimensions, the second is two LSTM hidden layers with sixteen dimensions, the third is a dropout layer with sixteen dimensions in which dropout-rate is 0.2, and the fourth is the output layer with one dimension where batch-size is 24 and epochs is 400.

[0042] To validate the performance of the SW-BiLSTM model, we compare it with the traditional BiLSTM model. We use KF to smooth the experimental data set after data preprocessing in the SW-BiLSTM and BiLSTM models, and then plot the results in Fig. 4. Moreover, we compare the MOE values (i.e., MAPE, MAE, RMSE, and ME) of SW-BiLSTM and BiLSTM. Next, the real value and the forecasting values produced through BiLSTM and SW-BiLSTM are shown in Figs. 5 and 6, respectively.

[0043] The MOEs of SW-BiLSTM are better than the traditional BiLSTM model with four different data sets. The MAPE value of SW-BiLSTM is 29.73% lower than that of BiLSTM on an average. It is obvious that the proposed approach in SW-BiLSTM that combines weighted normal distribution and time window can produce better forecasting performance and achieve higher accuracy than BiLSTM. As shown in Figs. 5 and 6, SW-BiLSTM has a better fitting effect with real traffic flow than BiLSTM.

[0044] Furthermore, for further validating the performance of SW-BiLSTM, we compared it with ARIMA, LR, GNB, CNN, GRU, LSTM, and WND-LSTM. The MOE values of SW BiLSTM with a data set on 13 days are compared with that of other models mentioned above.

[0045] Based on the results, we can conclude that the MAPE of SW-BiLSTM is much lower than that of other models in most cases. More specifically, the MAPE of SW-BiLSTM is 65.62%, 69.10%, 87.30%, 3.52%, 17.78%, 42.86%, and 1.23% lower than that of ARIMA, LR, GBN, CNN, GRU, LSTM, and WND-LSTM, and particularly the accuracy improvement reaches 41.06% on an average. Therefore, our SW-BiLSTM model can provide more accurate predictions than other cutting-edge models. The RMSE, MAE, and ME values of SW BiLSTM are lower than that of other models, respectively, because the forecasting peak value is close to the real peak value, and there is time deviation between them. Moreover, the results demonstrate that the MAPE value of SW-BiLSTM is lower than that of WND-LSTM in most cases, which means that SW-BiLSTM obtains better forecasting performance owing to the use of normal distribution and time window based on BiLSTM. The MAPE value of SW-BiLSTM is decreased by 1.23% compared with that of WND-LSTM, which indicates that BiLSTM specializes in mining traffic flow information from forward and reverse directions. Based on the aforementioned analysis, SW-BiLSTM can provide more accurate forecasting than ARIMA, LR, GBN, CNN, GRU, LSTM, and WND-LSTM.

[0046] From the experimental results mentioned above, it can be found that SW-BiLSTM improves the forecasting accuracy significantly, which is superior to other comparable models to address the TFF problem in real time.

6. Brief Description of The Drawings

[0047] Figure 1 is an overview of SW-BiLSTM.

[0048] Figure 2 is the process of data preprocessing on Spark.

[0049] Figure 3 is the implementation of SW-BiLSTM based on the Spark framework.

[0050] Figure 4 is the experimental data processed by KF for smoothing. (a) Before smooth ing and (b) after smoothing with the data set on thirteen days.

[0051] Figure 5 is the forecasting results on the same data set with different models. (a) BiLSTM and (b) SW-BiLSTM.

[0052] Figure 6 is the forecasting results of BiLSTM and SW-BiLSTM with different data sets. (a) one day, (b) five days, (c) nine days, and (d) thirteen days.

[0053] Figure 7 is the MOEs of ARIMA, CNN, GNB, GRU, LR, LSTM, WND-LSTM, and SW-BiLSTM with different data sets. (a) one day, (b) five days, (c) nine days, and (d) thirteen days.

[0054] Figure 8 is the forecasting results on the same data set with different models. (a) ARIMA, (b) LR, (c) GNB, (d) CNN, (e) GRU, (f) LSTM, (g) WND-LSTM, and (h) SW BiLSTM.

[0055] Figure 9 is the forecasting results produced by ARIMA, LR, GBN, CNN, GRU, LSTM,

WND-LSTM, and SW-BiLSTM with different data sets. (a) one day, (b) five days, (c) nine days, and (d) thirteen days.

Claims

The claims defining the invention are as follows:

1. SW-BiLSTM model

In this invention, we put forward a distributed SW-BiLSTM model on Spark to im prove the performance of traffic flow forecasting (TFF) in terms of accuracy, timeliness, and scalability, thereby addressing the real-time application problem in ITS.

1.1. Overview

In the SW-BiLSTM model, for improving the accuracy and timeliness of TFF with mobile trajectory big data (e.g., large-scale GPS trajectories of taxicabs), we employ the normal distribution and time window to optimize the traditional BiLSTM model and then implement the optimized BiLSTM model on Spark for the parallel forecasting and distributed computing of real-time traffic flow. As shown in Fig. 1, our method for real-time TFF based on SW-BiLSTM mainly consists of three steps. The transformations and actions of RDD are utilized for data preprocessing, which includes creating RDD for data reading, converting RDD for data computing, and starting RDD for data storage. Furthermore, a distributed SW-BiLSTM model is proposed to address the TFF problem in real time. Specifically, in SW-BiLTSM, the normal distribution is adopted to weigh the influence of adjacent road segments on the traffic flow on the targeted road segment (TRS), and a time window of size eight is used for TFF. Finally, the SW-BiLSTM model is implemented on the Spark parallel distributed computing platform by different RDD partitions.

1.

2. Preprocessing

We preprocess traffic flow data to reduce the impact on the accuracy of TFF due to traffic abnormity, uncertain traffic conditions, and signal transmission. The original raw data are processed based on the Spark framework, namely performing data preprocessing by RDD (the core of Spark). RDD is a fault-tolerant, parallel data structure enabling users to store intermediate results in memory. It also controls the partitioning of the data set to achieve optimal data storage and processing and handles the data via a rich set of operators. Therefore, two types of operators of RDD (i.e., transformations and actions) are used for data preprocessing. Specifically, data are loaded into memory by transformations,then RDD is transformed, and finally data are stored through actions. As illustrated in Fig. 2, the data preprocessing based on the Spark parallel distributed computing platform mainly includes three Steps: RDD creation, RDD conversion, and RDD starting.

* Step 1: RDD creation.

Mobile trajectory data (e.g., GPS trajectory data) are extracted and uploaded to HDFS by adopting external storage with the way of creating RDD. The data stored in HDFS are read through the textFile in SparkContext object, and then they are loaded into the memory of a cluster. Finally, many RDDs could be created.

* Step 2: RDD conversion.

Data processing and conversion on RDD are conducted by calling map,filter, flatMap, sortByKey, distinct, reduceByKey, and other operators of Spark, which mainly consists of three tasks.

Task 1: The vehicle information on the TRS at the current time interval (CTI) is extracted. We first convert the data on each node into < key1, value > pairs using the flatMap operator, and then employ the map operator to set keyl = time and vehicle ID, value = the number of the TRS. Moreover, the filter operator is utilized to filter the GPS trajectory data that do not belong to the selected TRS in the RDD. Finally, the sortByKey operator is used to sort keyl according to keyl in RDD < keyl, value >, and the distinct operator is employed to remove duplicates of the same vehicle at the CTI in RDD to obtain the vehicle information on the TRS.

Task 2: The data information extracted in Step 1 is read. We first convert the data on each node into < key2, value > pairs via the flatMap operator, and then the map operator is adopted to set key2 = time and area number and the value of value adds one. Finally, the reduceByKey operator is utilized to perform reduce operation according to the value of key2. The number of vehicles at the CTI is counted, and the total number of vehicles on the selected TRS at per time interval is obtained (i.e., traffic flow).

Task 3: The total number of vehicles on the TRS at the CTI t is incorporated into a one-dimensional array Xt composing a matrix X. We first convert the data distributed on each node into < key3, value > pairs by the flatMap operator, and then the map operator is employed to set key3 = time interval and value = the total number of vehicles on each TRS. Finally, the key3 is sorted through the sortByKey operator, and then results are outputted.

* Step 3: RDD starting.

The data flow is transferred in different RDD partitions based on transformations operator, and then the Actions operator is started, and the data are stored in HDFS by calling saveAsTextFile to achieve distributed storage of data.

1.

3. Model

LSTM can address the long-term dependency learning problem, and thus has great po tential to forecast traffic flow. The LSTM unit maintains a separate memory cell, which is the crucial element of the LSTM neural network structure. LSTM can only utilize past information rather than future information. BiLSTM, however, can meditate the informa tion in both the past and future, and consists of two stacked unidirectional LSTMs. In this invention, we employ the past traffic flow with the positive state and the future traffic flow with the reverse state to forecast real-time traffic flow. Specifically, we first input the traffic flow into the forward LSTM network layer for getting an output vector and then input the traffic flow in the opposite direction into the backward LSTM network layer to get anoth er output vector. Finally, the final output can be produced by combining the two output vectors. In this invention, it is necessary to measure the weight of the influence of the adjacent road segments on the TRS because traffic flow is affected by the traffic flow on the adjacent road segments and does not exist independently. Parameters 6 and u in the normal distribution determine the position and size of the image. Thus, when the parameter 6 is fixed, the closer the sample point is next to the u parameter, the higher the weighted value is. That is why normal distribution is used to calculate the weight of the targeted road segment and adjacent road segments. Moreover, to enhance the forecasting accuracy of TFF, this invention takes the variation of traffic flow between the adjacent road segment and the TRS into consideration, as shown in Formula (1).

f (X) 1 exp ((x -u)2\ 2 (c- - dm) (1) v26 262

where the u of normal distribution represents traffic flow on the TRS, 6 can be set as 0.6, x denotes discrete value of the traffic flow on each road segment, di is the traffic flow on each road segment, and dm represents the traffic flow on each TRS. With the traffic flow V on the TRS i at the CTI t, we extract the past (historical) traffic flows V 7 ,1Vt/,1V5,1 V 4,1V,1V7 2, Vt 1. Then, a time window with a size of 8 is employed

for TFF, as shown in Formula (2).

V1 V1 V2 V3 V4 V5 V6 V 7 V8

= 2 V2 V3 V4 V5 V6 V7 V8 V9 (2)

n-8 Vn- 8 Vn- 7 Vn- 6 Vn- 5 Vn-4 Vn- 3 Vn-2 Vn-1

In this invention, according to time series composed by time intervals t -7, t -6, t -5, t 4, t - 3, t - 2, t - 1, t, the weighted traffic flow is input into the BiLSTM model to produce

forecasting results at the next time interval t + 1. Therefore, we train the BiLSTM model through the following formulas:

it = c-(W A + Whiht-1 + Weict 1 + bi), (3)

ft= c-(WfA Whfhh-1 +Wefct-1 +bf), (4)

where W is a weight matrix and b is a bias vector. The cell state in the hidden layer of LSTM is able to be achieved by the following formulas:

ct ftct-i +itg(WcA+Whcht-1 +be), (5)

ot= a(Wx +Whoht-1 +Weoct + b), (6)

ht = oth(ct), (7)

h = LSTI(ht- 1 ,X,ct_ 1 ), (8)

h = LSTM(ht- 1 ,X,ct+1 ), (9)

Ht= [ht , (10)

1.

4. Implementation

For improving the timeliness and scalability of real-time TFF, we employ a Spark dis tributed parallel computing framework to implement the SW-BiLSTM model by reducing the computational cost and memory consumption. As illustrated in Fig. 3, SW-BiLSTM is divided into many RDD partitions based on the Spark framework, and different RDD partitions are executed via the following three steps in parallel. Step 1: Initialing the RDD data set. The RDDs read the data sets of the TRS and the adjacent road segments in different RDD partitions to generate different < key, value > pairs. Next, the normal distribution with the Formula (1) is applied to calculate the weight of the interaction between the adjacent road segments and the TRS in parallel, and then the weighted data sets are obtained. Step 2: Aggregating the intermediate results. After the data go through in Step 1, aggregation is performed at each node. The results of RDD partitions are sorted, aggregated, and cached, and intermediate results can be directly stored in memory for the process of reading in the next step. Then, the distributed SW-BiLSTM model is established by the optimization of BiLSTM, which is combined with the weighted data sets via the Formula (1) and the time window that determines the data set of the input model through the Formula (2).

Step 3: Producing the forecasting results. The determined data set is input into the model with the Formulas (3)-(10) for model training by the SparkD4jMultiLayer instance using the distributed SW-BiLSTM model on Spark for real-time TFF, and then forecasting results are outputted.