CN108399748B

CN108399748B - Road travel time prediction method based on random forest and clustering algorithm

Info

Publication number: CN108399748B
Application number: CN201810190151.6A
Authority: CN
Inventors: 宋万超; 周应华; 程爱华
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2018-03-08
Filing date: 2018-03-08
Publication date: 2020-12-22
Anticipated expiration: 2038-03-08
Also published as: CN108399748A

Abstract

The invention discloses a road travel time prediction method based on random forests and a clustering algorithm, which accurately predicts the travel time of each key road section in a certain time period by using a mixed prediction model of a density clustering algorithm (DBSCAN) and the Random Forests (RF) according to the time sequence rule of historical traffic data and by combining the attributes of roads, weather factors, holiday information and the states of upstream and downstream traffic flows of the roads. The prediction result can be used for prejudging the development trend of the traffic state, making a control scheme for a potentially congested road in advance, and can also be used for dynamic path induction, planning an optimal trip plan for travelers and assisting social intelligent trip. The prediction method improves the prediction precision of each tree in the random forest through density clustering, thereby improving the overall accuracy of prediction.

Description

Road travel time prediction method based on random forest and clustering algorithm

Technical Field

The invention belongs to the field of road travel time prediction, and particularly relates to a DBSCAN-RF hybrid prediction model for optimizing a random forest prediction result by using a density clustering algorithm.

Background

The travel time of the road is one of important indexes reflecting traffic states, is used as a basis for road traffic jam management and road network optimization integration, and is also an important content in intelligent traffic research. Accurate travel time prediction is an important basis of a modern traffic guidance system and an advanced traveler information system, and can provide decision support for traffic management departments and plan an optimal travel path for travelers. In the past, a time series prediction algorithm is often adopted when prediction analysis is carried out on the field, the algorithm is sensitive to data loss, only the time sequence rule of historical traffic data is considered, and other related factors are ignored; in the case of such a complex prediction scenario, it is far from being satisfied to consider only the timing law.

The method adopts an integrated prediction algorithm based on trees, namely random forests, combines the defects of a random forest algorithm, and provides a mixed prediction model combining a density clustering algorithm and the random forests to improve the accuracy of the prediction model.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The road travel time prediction method based on the random forest and clustering algorithm is provided, so that the algorithm has good convergence, and the prediction accuracy is improved. The technical scheme of the invention is as follows:

a road travel time prediction method based on random forests and a clustering algorithm comprises the following steps:

1) extracting a relevant feature set V influencing a prediction result from historical traffic data, weather data, a road network structure relation and road section self attributes according to the time-space characteristics of a road network, and constructing a sample data set D, wherein V is { V ═ V { (V }₁,v₂,v₃,…,v_nThe data structure of the sample data set D is: { v₁,v₂,v₃,…,v_n,y}，v₁As features, n is a feature dimension, and y is a prediction target;

2) carrying out data cleaning and missing value supplement on the sample data set D in the step 1);

3) and measuring the feature importance of the feature set V and the predicted target y by using a feature selection method commonly used by a random forest algorithm to obtain the correlation degree V 'between each feature in V and the predicted target, wherein V' ═ V { (V)₁′,v₂′,v₃′,…,v_n′},v₁The larger the value of' the characteristic v is specified₁The more relevant the predicted target, the more important this feature, the more important features are preferentially selected for constructing random forests;

4) constructing a density clustering and random forest mixed prediction model by using the sample data set D;

5) inputting sample data X related to the predicted road section into the density clustering and random forest mixed prediction model constructed in the step 4) to obtain a final prediction result.

Further, the step 3) of constructing a density clustering and random forest mixed prediction model specifically comprises the steps of:

initializing Generation of regression in random forestThe number m of trees T and the depth d of the trees; wherein T ═ { T ═ T₁,t₂,t₃,…,t_m}，t₁Representing the serial number of the regression tree, and m represents the generation of m regression trees;

secondly, generating m sample subsets D which are consistent with the number of the established regression tree into the sample data set D obtained in the step 2) by a Bootstrap method through a re-sampling mode with a back-put on the sample data set D, wherein the number of the m sample subsets D is equal to that of the { D }₁,D₂,…,D_m}；

Respectively creating m regression tree models without pruning for the m sample subsets, adopting the minimum residual variance as a basis for selecting splitting attributes and splitting points, and after the regression tree is created, enabling leaf nodes of each regression tree to correspond to partial sample data;

fourthly, clustering sample data existing in leaf nodes of each regression tree in the third step by using a density clustering algorithm, forming a plurality of clusters C (C1, C2, C3, …, ck) with any shapes in the sample data set in the leaf nodes, and calculating the clustering center D of each cluster in the set C, wherein k is the number of the clusters_c＝{d₁,d₂,d₃,…,d_kIn which d is₁Representing the cluster center of the class cluster c 1.

Further, the step 5) of obtaining a final prediction result specifically includes the steps of:

firstly, traversing each regression tree in the random forest by using a sample X until a leaf node of each regression tree;

respectively calculating the distance between the sample X and the center of each class cluster in the leaf nodes by using an Euclidean distance formula, and finally returning the class cluster closest to the predicted sample;

calculating the average value of the target variable values of the samples in the most similar cluster returned in the step two to serve as the prediction result of the regression tree;

fourthly, obtaining the prediction result of each regression tree, and taking the average value as the final prediction result of the random forest:

wherein, y_rfAs final predicted value of random forest, y_t1Denotes the t-th₁Prediction of regression trees.

Considering the weight of the sample attribute, calculating the distance between the sample X and each cluster center through the weighted Euclidean distance, wherein the cluster with smaller distance value is generally considered to be more similar to the predicted sample;

weighted euclidean distance formula:

wherein, X: prediction samples, d: cluster center of class cluster, n: sample dimension, V'_i: and (4) feature correlation degree.

Further, the method includes the steps of respectively constructing m regression tree models without pruning for m sample subsets, and using the minimum residual variance as a basis for selecting splitting attributes and splitting points, and specifically includes the following steps: firstly, traversing each feature in a sample subset, and sequencing data under each feature from small to large; secondly, taking the sorted characteristic data as splitting nodes in sequence, enabling the data smaller than the splitting nodes to enter a left branch, enabling the data larger than or equal to the splitting nodes to enter a right branch, and dividing the original sample subset into two data sets; then, the S value before division and the S value of the left branch after division are respectively calculated by using a minimum residual variance formula_LValue, S of the right branch_RValue, calculate S- (S)_L+S_R) S represents the minimum residual variance value, a characteristic value meeting the maximum difference value is selected as a splitting node, and a characteristic item where the splitting node is located is used as the current splitting characteristic; recursively performing the above partitioning operation, and checking whether a recursion termination condition is satisfied: the depth of the tree is larger than the initial value of d, if the depth of the tree is larger than the initial value of d, the growth of the tree is stopped, and if the depth of the tree is larger than the initial value of d, the regression tree is allowed to continue to grow.

Further, the minimum residual variance formula is:

where N is the number of samples in each sample subset, y is the prediction target,

is the mean of the predicted targets.

The invention has the following advantages and beneficial effects:

the invention adopts an integrated prediction algorithm based on a tree, namely random forest, combines the defects of a random forest algorithm and adopts a mixed prediction model combining a density clustering algorithm and the random forest, so that the algorithm has good convergence and the prediction accuracy is improved. The prediction model can accurately predict the travel time of each key road section in a certain time period, the result can be used for prejudging the development trend of the traffic state, a control scheme is made on the road with potential congestion in advance, and the prediction model can also be used for dynamic path induction, so that the best travel plan is planned for travelers, and social intelligent travel is assisted.

Drawings

FIG. 1 is a flow chart of the operation of a preferred embodiment of the present invention;

FIG. 2 is a feature importance metric made for a portion of the correlation factors;

FIG. 3 is a diagram of a model architecture combining density clustering and regression trees;

FIG. 4 is a travel time trend graph predicted for a particular road by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the specific steps of the embodiment provided by the invention comprise:

1. a historical traffic data set is obtained.

The data set mainly comprises: the system comprises a travel time data set, a road section attribute data set, a road network topology data set, a weather data set and the like, wherein the specific information is as follows:

2. and extracting key features having influence on the prediction result.

2.1 the present invention is directed to the prediction of time series, when predicting the travel time of a certain time period in the future, the data of the first few minutes of the time period has a large influence on the prediction result, so that a characteristic related to time needs to be constructed. Two minutes are taken as a time period, if the time period to be predicted is 2017/06/0108:04-08:06, time is taken as the current travel time, time1 is taken as the travel time of 08:02-08:04, time2 is taken as the travel time of 08:00-08:02, and the like, and the specific steps are shown in the following table. The feature vectors constructed here are time1, time2, …, and time 8.

Time period (2017/06/01)

time

time1

time2

time3

···

time7

time8

08:00-08:02

10

null

···

null

08:02-08:04

8.2

10

null

···

null

08:04-08:06

7.8

8.2

10

null

···

null

And 2.2, extracting relevant data of the upper and lower streams of the road section as features by combining the road network topological structure.

And calculating the feature correlation degree of the feature vector and the target vector. And (3) performing feature importance measurement on the features extracted in the step (2.1) by utilizing a feature selection method commonly used in a random forest algorithm, and taking the feature importance measurement as a measure for the correlation degree between the feature vector and the target vector. The results are shown in FIG. 2.

3. And establishing a prediction model.

The prediction model is to improve a random forest by using density clustering algorithm (DBSCAN), the structures of a single regression tree and the DBSCAN model are shown in figure 3, and the specific method comprises the following steps:

1) and creating a regression tree by using the training sample set, wherein the training samples are selected to enter a left branch or a right branch according to the splitting attribute and the splitting point, and when the regression tree stops growing, leaf nodes of the training samples accumulate partial sample data. The sum of the sample data of all leaf nodes is equal to the original training sample for creating the regression tree, and when the leaf nodes accumulate more data, the abnormal sample does not exist in the training sample.

2) And (3) clustering the sample data accumulated in the leaf nodes of the regression tree in the step 1) by adopting a density clustering algorithm (DBSCAN), wherein the sample data set in the leaf nodes forms a plurality of arbitrarily-shaped clusters C (C1, C2, C3, …, ck), and k is the number of the clusters. And calculating the clustering center D of each cluster in the set C_c＝{d₁,d₂,d₃,…,d_kIn which d is₁Representing the cluster center of the class cluster c 1.

3) When the regression tree created in the step 1) is used for prediction, firstly, the prediction sample X starts to move from the root node of the regression tree, traverses the split attributes and the split nodes, and finally falls into the corresponding leaf nodes, and a plurality of clusters are formed after the sample data in the leaf nodes at the moment is processed in the step 2).

When a new sample X is predicted, the sample X is respectively calculated to be most similar to the class cluster in the leaf node, the distance between the sample X and the center of each cluster is calculated by the weighted Euclidean distance in consideration of the weight of the sample attribute, the class cluster with the closer distance is generally considered to be more similar to the predicted sample, and finally the cluster ci with the closest distance to the predicted sample is returned.

Weighted euclidean distance formula:

wherein, X: prediction samples, d: cluster center of class cluster, n: sample dimension, V'_i: is the calculated feature correlation.

4) Acquisition step 3)The most similar cluster ci returned in the step (c), and the mean value V of the target variable values of the samples in the cluster ci is calculated_CiBy V_CiAs the final prediction result of the regression tree.

5) By the method for obtaining the prediction result, the prediction result of each tree in the random forest is obtained, and the average value of the prediction results is taken as the final prediction result of the random forest.

Fig. 4 predicts the travel time trend of a road segment in the time period of 2017, 6, month 26, 5:00 to 20: 00.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A road travel time prediction method based on random forests and a clustering algorithm is characterized by comprising the following steps:

1) extracting a relevant feature set V influencing a prediction result from historical traffic data, weather data, a road network structure relation and road section self attributes according to the time-space characteristics of a road network, and constructing a sample data set D, wherein V is { V ═ V { (V }₁,v₂,v₃,…,v_nThe data structure of the sample data set D is: { v₁,v₂,v₃,…,v_n,y}，v_nAs features, n is a feature dimension, and y is a prediction target;

3) and measuring the feature importance of the feature set V and the predicted target y by using a feature selection method commonly used by a random forest algorithm to obtain the correlation degree V 'between each feature in V and the predicted target, wherein V' ═ V { (V)₁′,v₂′,v₃′,…,v_n′},v₁The larger the value of' the characteristic v is specified₁The more relevant the predicted target, the more important this feature, the more important features are preferentially selected for constructionAn organic forest;

5) inputting sample data X related to the predicted road section into the density clustering and random forest mixed prediction model constructed in the step 4) to obtain a final prediction result;

the step 4) of constructing the density clustering and random forest mixed prediction model specifically comprises the following steps:

firstly, initializing the number m of generating regression trees T and the depth d of the trees in a random forest; wherein T ═ { T ═ T₁,t₂,t₃,…,t_m}，t₁Representing the serial number of the regression tree, and m represents the generation of m regression trees;

2. The method for predicting road travel time based on random forests and clustering algorithms according to claim 1, wherein the step 5) of obtaining a final prediction result specifically comprises the steps of:

3. The method for predicting road travel time based on random forest and clustering algorithm as claimed in claim 2, wherein the step two is to calculate the distance between the sample X and each clustering center by weighted Euclidean distance in consideration of the weight of the sample attribute, and generally considering that the cluster with smaller distance value is more similar to the predicted sample;

weighted euclidean distance formula:

4. The method for predicting road travel time based on random forests and clustering algorithms according to claim 2, wherein the m regression tree models without pruning are respectively constructed for the m sample subsets, and the minimum residual variance is used as a basis for selecting splitting attributes and splitting points, and specifically comprises the following steps: first, each feature in the sample subset is traversed, and the data under each feature is sorted according to the orderSorting from small to large; secondly, taking the sorted characteristic data as splitting nodes in sequence, enabling the data smaller than the splitting nodes to enter a left branch, enabling the data larger than or equal to the splitting nodes to enter a right branch, and dividing the original sample subset into two data sets; then, respectively calculating the S value before division and the S value of the left branch after division by using a minimum residual variance formula_LValue, S of the right branch_RValue, calculate S- (S)_L+S_R) S represents the minimum residual variance value, a characteristic value meeting the maximum difference value is selected as a splitting node, and a characteristic item where the splitting node is located is used as the current splitting characteristic; recursively performing the above partitioning operation, and checking whether a recursion termination condition is satisfied: the depth of the tree is larger than the initial value of d, if the depth of the tree is larger than the initial value of d, the growth of the tree is stopped, and if the depth of the tree is larger than the initial value of d, the regression tree is allowed to continue to grow.

5. The method for predicting road travel time based on random forest and clustering algorithm according to claim 4, wherein the minimum residual variance formula is as follows:

is the mean of the predicted targets.