CN110390816A

CN110390816A - A kind of condition discrimination method based on multi-model fusion

Info

Publication number: CN110390816A
Application number: CN201910650794.9A
Authority: CN
Inventors: 张凤荔; 王瑞锦; 翟嘉伊; 刘崛雄; 周世杰; 张雪岩
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2019-10-29

Abstract

The present invention relates to a kind of condition discrimination methods based on multi-model fusion, and the method includes the following contents: data prediction, carry out data prediction to the traffic flow data of acquisition；Feature selecting selects correlated characteristic subset to reduce data dimension by removing uncorrelated and redundancy feature；Multi-characters clusterl, by being divided to multidimensional characteristic analysis to traffic flow data；Real-time grading carries out the differentiation that classification carries out real-time traffic states to traffic flow data.The real-time status in the path in current network topology can be differentiated, for routine weight value determination and subsequent path planning application is provided fundamental basis and technology path；Accuracy and validity are improved compared with traditional single characteristic threshold value method of discrimination, meanwhile, feature selection approach can remove some extraneous features, promote the precision of differentiation.

Description

State discrimination method based on multi-model fusion

Technical Field

The invention relates to a traffic flow state discrimination method, in particular to a state discrimination method based on multi-model fusion.

Background

In recent years, the urban traffic demand is increasingly saturated due to the high-speed development of urban economy, the traffic jam phenomenon becomes the first common enemy in urban traffic transportation, and urban roads are in a queuing waiting state or even a congestion state, so that the enthusiasm and efficiency of people in traveling are seriously influenced. As an important component of the intelligent transportation system, the vehicle route guidance can efficiently provide services such as navigation positioning, geographic information and the like for travelers in real time, and guide the travelers to reach target places from original places. The selected path planning strategy directly determines the quality of the driving path provided by the path guidance to the travelers. According to the dynamic traffic demand, the path planning technology involved in the vehicle path guidance system provides an accurate path search result, and meanwhile, the result needs to be calculated in real time along with the dynamic change of traffic information, so that the failure of the obtained path planning result is prevented. The optimal path planning technology utilizes intelligent equipment such as a GPS (global positioning system), a sensor and the like to acquire the real-time running state of a road network, analyzes the accessibility of an original node and a target node in the road network, searches for an accessible path between the original node and the target node, sets a certain optimal rule such as lowest oil consumption, congestion avoidance and the like, selects different schemes according to the optimal rule, and presents a screening result to a user for selection.

And the vehicle path guidance shows the current optimal planning scheme for travelers by utilizing an optimal path finding technology according to the real-time requirements of users. Since the vehicle path guidance needs to provide an optimal routing route based on the current road network running state, great challenges are provided for the timeliness of routing algorithms. The traditional path planning algorithm based on the graphics method has too many nodes to traverse, and the amount of stored intermediate data is too large, so that the traditional path planning algorithm is difficult to be applied to a large-scale complex network topology structure.

From the perspective of economic development, the traffic jam caused by the over-saturation of urban traffic at present becomes an irreplaceable problem in the process of urban construction. The utilization rate of a road network is difficult to effectively improve, and a traffic resource allocation mechanism is disordered. The traffic flow data state judgment based on the real-time road network structure can effectively improve the utilization rate of the current road network, reduce traffic accidents, promote the scientificity and intellectualization of decision management and improve the resource allocation capability.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a state discrimination method based on multi-model fusion, which is used for discriminating the real-time state of a path in the current network topology and providing a theoretical basis and a technical route for path weight determination and subsequent path planning application.

The purpose of the invention is realized by the following technical scheme: a state discrimination method based on multi-model fusion comprises the following steps:

data preprocessing, namely preprocessing the acquired traffic flow data;

selecting characteristics, namely selecting a relevant characteristic subset to reduce data dimensionality by removing irrelevant and redundant characteristics;

multi-feature clustering, which is to divide traffic flow data by analyzing multi-dimensional features;

and (4) real-time classification, namely classifying the traffic flow data to judge the real-time traffic state.

The data preprocessing steps are as follows:

judging whether the acquired traffic flow data has abnormal data or not, and processing the abnormal data;

and carrying out data standardization processing on the data.

The specific steps of judging whether the acquired traffic flow data has abnormal data and processing the abnormal data are as follows:

judging whether the fluctuation range of the data value range is in a reasonable range or not;

if the data value range exceeds a reasonable range, indicating that the data has obvious errors, and processing the error data;

if the data value range fluctuates within a reasonable range, the data is normal.

The specific steps for carrying out data standardization processing on the data are as follows:

traversing all feature vectors of traffic flow data to obtain a maximum value;

traversing all feature vectors of traffic flow data to obtain a minimum value;

and carrying out normalization processing on the feature vectors.

The specific steps of selecting the relevant feature subset and reducing the data dimension by removing irrelevant and redundant features are as follows:

calculating the correlation between different feature vectors and known classes in the training set;

determining different weights of different characteristics according to different correlations;

and deleting the characteristic that the weight value is smaller than the threshold value.

The multi-feature clustering is characterized in that the traffic flow data are divided by multi-dimensional feature analysis, and the method comprises the following specific steps:

the first step is as follows: initially, let S be 1, K S centroids are calculated by using K-Means clustering algorithm on the initial m data.

The second step is that: the first step is repeated until m S-level centroids are obtained.

The third step: and calculating the m S-level centroids by using a K-Means clustering algorithm to obtain K S + 1-level centroids.

The fourth step: repeating the third step until m centroids of S +1 level are obtained, wherein S +1 is

The fifth step: repeatedly executing the steps, namely clustering by using a K-Means algorithm to obtain K S + 1-level centroids every time m S-level centroids are obtained; until the final k centroids are finally obtained.

The real-time classification for classifying the traffic flow data to judge the real-time traffic state comprises the following steps:

the first step is as follows: randomly selecting a replaced sample selection process in the sample set, and selecting m random samples in total;

the second step is that: for the feature set subjected to feature selection, randomly selecting n features in the feature set, and establishing a CART decision tree model;

the third step: repeating the first step and the second step k times to generate k CART decision trees, wherein each decision tree has an independent decision criterion;

the fourth step: and inputting the traffic flow data into each tree decision, and finally determining the category to which the features belong.

The invention has the following advantages: a state discrimination method based on multi-model fusion can discriminate the real-time state of a path in the current network topology, and provides a theoretical basis and a technical route for path weight determination and subsequent path planning application; compared with the traditional single-feature threshold value discrimination method, the accuracy and the effectiveness are improved, meanwhile, the feature selection method can remove some irrelevant features, and the discrimination precision is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of error data determination;

FIG. 3 is a flow chart of a live forest algorithm

FIG. 4 is a characteristic weight diagram of traffic flow data in an embodiment;

FIG. 5 is a graph comparing the accuracy of different models in the examples.

Detailed Description

The invention will be further described with reference to the accompanying drawings, but the scope of the invention is not limited to the following.

As shown in fig. 1, a method for discriminating states based on multi-model fusion includes the following steps:

data preprocessing, namely preprocessing the acquired traffic flow data;

further, the original data cannot be completely correct during the processes of acquisition, transmission and storage, and many incomplete places such as inconsistent data types, missing data, redundant data and the like are necessarily present. If the original data is directly used without being processed, the low-quality data is released to flow into the algorithm model, and huge damage is caused to the learning process of the algorithm model. On the contrary, the quality and reliability of the algorithm model decision can be obviously improved by properly preprocessing the data.

further, with the exponential rise of data scale and data complexity, an ultra-large algorithm structure is often required to be established for solving the problem, and the algorithm complexity and response time increase suddenly. This is naturally unacceptable for the data itself, especially streaming data, which is an infinitely arriving stream over time. Therefore, in the process of model construction, it is crucial to select effective data features (variables) for solution.

furthermore, different features describe different aspects of the problem to be solved, the surfaces are seemingly independent, the real situation has deep-level relation, and multi-feature clustering is to analyze multi-dimensional features and divide the examples into a plurality of sub-examples with obvious differences by utilizing the similarity principle. In short, the clustering analysis is to place classified objects in a multidimensional space, recognize the classified objects according to differences among the objects, divide the objects with the same attribute into the same class, and divide the objects with different attributes into different classes, so as to realize high cohesion and low coupling among the classes, that is, the objects classified into the same class have extremely high similarity, and the objects classified into the different classes have extremely high differences.

Furthermore, due to the particularity of the stream data, the real-time performance of the model algorithm is extremely high, so that it is important to establish a real-time classifier to classify the stream data continuously generated continuously, the real-time classifier needs to be capable of quickly responding to the data flowing into the model, the classification process of the stream data can be completed within a limited time, and the phenomenon of large-scale data queuing and blocking caused by overlarge calculation complexity is avoided.

The data preprocessing steps are as follows:

and carrying out data standardization processing on the data.

As shown in fig. 2, the specific steps of determining whether there is abnormal data in the collected traffic flow data and processing the abnormal data are as follows:

further, the fluctuation of the data value range is between 50% and 150%, and the data value range fluctuates within a reasonable range.

Further, the error data mainly includes two types of data errors and data misses; the data error represents an unexpected result of the data caused by data format error in the process of acquiring and storing the data, such as negative number in traffic flow data; the data missing represents that the device is interrupted in the process of data acquisition, so that some data are obviously missed, such as traffic density data which is not acquired.

Further, when the data has obvious errors, error or exception processing is needed; when only a few errors occur, the errors can be ignored compared with the correct data, and the error data can be directly deleted. If the error data is less than 5% compared with the correct data, the error data can be directly deleted. If the error data is more than 5% compared with the correct data, the error data needs to be corrected, and the invention uses the adjacent data or the algebraic mean in a period of time to fill.

traversing all feature vectors of traffic flow data to obtain a maximum value Max;

traversing all feature vectors of traffic flow data to obtain a minimum Min;

and carrying out normalization processing on the feature vectors.

Further, the normalized calculation formula is as follows:

in the formula x_0-1For the normalized feature vector, x is the feature vector, Min is the minimum value of the feature vector, and Max is the maximum value of the feature vector.

Specifically, a sample S is randomly selected from a training set T, and then k adjacent samples H of the S are found from a sample set which is similar to the S_kFinding out k adjacent samples M from each sample set different from S_kThe weight of each feature is updated according to the following formula.

W(A)＝W(A)-similarity_H(A)+difference_M(A)

Wherein,

M_j(c) represents the jth nearest sample in class C, diff (A, S, R) represents that sample S and sample R are inThe difference in characteristic a is calculated as follows:

it can be found that the second of the above equations is essentially calculating a certain characteristic of the sample S to the nearest sample H of the same kind_kThe sum of the distances of (a); the third formula is to calculate a certain feature of the sample S to the nearest sample M of different classes_kThe sum of the distances of (a). According to the updated formula of the first formula, when a certain feature of the sample S reaches the nearest sample H of the same class_kIs greater than the feature to the nearest sample M of the different classes_kThe weight of the feature is boosted when the sum of the distances of (A) and (B) is equal, i.e. the feature is positive in classifying the same type of sample and the non-same type of sample, and conversely, when a certain feature of the sample S is equal to the nearest sample H of the same type_kIs less than the feature to the nearest sample M of the different classes_kThe sum of the distances of (1) is then the weight is reduced, i.e. the feature is a negative effect in classifying homogeneous samples and non-homogeneous samples. Of course, the selection of the sample S may have a certain randomness, and therefore, the selection may be repeated n times, the average weight of each feature is taken as the final weight of the feature, if the weight of a certain feature is greater than 0.5, it is proved that the correlation between the feature and the problem to be solved is high, otherwise, it is proved that the correlation between the feature and the problem to be solved is low, and particularly, if the weight of a certain feature is less than a threshold, it is illustrated that there is almost no relationship between the feature and the problem to be solved, and the feature may be directly removed from the multidimensional feature vector group, thereby achieving the purpose of feature selection.

Furthermore, the invention carries out multi-feature clustering analysis based on the STREAM algorithm, and the STREAM algorithm is based on the K-Means algorithm and introduces a sliding window mechanism to solve the problem in STREAM data clustering. The bottom framework of the STREAM algorithm is still the K-Means clustering algorithm, and the K-Means clustering algorithm is briefly analyzed below.

On the basis of the K-Means algorithm, the STREAM algorithm is used for realizing the clustering process of the flow data characteristics. The bottom layer structure algorithm of the STREAM algorithm is a K-Means algorithm, and a batch processing mechanism is added in the upper layer structure to solve the problem of concept drift in STREAM data.

The K-Means algorithm divides different classes according to the distribution similarity of data points in the multi-dimensional feature space. Specifically, k objects are randomly acquired from the dataset and are considered as initial centroids of k clusters; and distributing the other objects to the nearest cluster according to the Euclidean distance between the other objects and the centroid of each cluster, recalculating the centroid of each cluster, and repeating the process iteratively until the distortion function is converged to obtain k fixed and invariable centroids. Specifically, the algorithm flow is as follows:

1. randomly acquiring k objects from the dataset as initial centroids μ of k clusters₁,μ₂...μ_k；

2. Calculating Euclidean distances between each object and each cluster center point, and dividing the corresponding objects again according to the minimum distance, wherein the dividing standard is shown in a formula;

C⁽ⁱ⁾＝argmin||x⁽ⁱ⁾-μ_j||²

wherein, C⁽ⁱ⁾For the category to which the ith data object belongs, x⁽ⁱ⁾For the ith data object, μ_jIs the jth cluster center.

3. Updating the centroids μ of the k clusters according to the following formula₁,μ₂...μ_k，

4. Repeating the steps 2-3 until the distortion function of the following formula is converged to obtain k unchanged centroids;

wherein J (c, μ) is a distortion function, μ_CThe center after the clustering is completed.

As shown in fig. 3, the real-time classification is based on a decision tree theory, a random forest model is established, and the classification obtained by the multi-feature clustering module is used as a training set to classify the real-time traffic flow and judge the real-time traffic state.

Taking the traffic flow data of the expressway in Sichuan province as an example, for convenience of discussion, the original data is simply processed in advance, and 287 pieces of traffic flow data acquired by the expressway in Sichuan province are shown in table 1, wherein the acquisition period is 5 min. Volume in the table is a flow field; speed is a vehicle Speed field; density is a traffic Density field; occupancy is an Occupancy field; queue is a queuing time length field.

TABLE 1 traffic flow data for a certain highway in Sichuan province

It can be observed that the traffic flow data in the table above are not completely correct, wherein significant errors remain. Normally, the range of occupancy should be between 0 and 1. When the road is completely unblocked, the vehicle does not need to stay on the road, the occupancy is minimum and is 0 at the moment, when the road is seriously congested, the vehicle needs to stay on the road to wait, and the occupancy reaches the peak value and is 1 at the moment. However, when the occupancy of a plurality of data in the table exceeds 1, the data is classified as erroneous data. In addition, in some data, when vehicles exist on the road, the traffic density is reduced to 0, which is obviously not reasonable. Thus, correction of the error data is required.

A traffic parameter data model is established based on a traffic flow theory. The correction process for the erroneous data is accomplished using the following formula.

After error data correction, normalization processing needs to be performed on the features. The processed data are shown in Table 2, with a total of 274 correct data.

TABLE 2 traffic flow data after preprocessing

It can be found that through the data normalization processing, the value ranges of all traffic flow characteristics are distributed between 0 and 1, and the errors of the models caused by different expression forms among the characteristics are eliminated.

After the traffic flow data is preprocessed, the traffic flow characteristics are analyzed, the traffic characteristics beneficial to solving the traffic state are selected, and the characteristics which do not help or are redundant to solving the traffic state are removed, so that the precision and the reliability of a subsequent model are improved. Before feature selection, the invention contacts experts in the traffic field to manually judge the traffic states corresponding to a small part of traffic flow data, and the invention also uses the part of data as reference to select traffic features.

As can be seen from table 1, the characteristics included in the traffic flow data are mainly traffic Volume (Volume), travel Speed (Speed), traffic Density (Density), occupancy (occupancy), and Queue length (Queue).

As shown in fig. 4, in the feature selection process using the method provided by the present invention, considering that the algorithm may select a random sample S during the operation process, which may cause a certain difference in the result weight, the present invention adopts a method of averaging in multiple experiments, and performs 30 experiments in total, and summarizes the operation results each time to obtain the average value of each weight. In the figure, Q represents the flow rate, V represents the travel speed, P represents the traffic density, O represents the occupancy, and L represents the queuing time.

The average weight of each feature is shown in table 3;

TABLE 3 mean weight of features

In the table, Q represents flow, V represents travel speed, P represents traffic density, O represents occupancy, and L represents queuing time.

According to the feature selection algorithm, the weight values of the 3 features of the flow, the travel speed and the occupancy are all larger than 10%, and the weight values of the features of the traffic density and the queuing time are all smaller than 5%, so that the traffic state can be mapped by the 3 traffic flow features of the flow, the travel speed and the occupancy on the expressway. The traffic density and the queuing length have low correlation with the traffic state, and the two low correlation characteristics should be removed.

According to the traffic flow theory, the invention defines 4 traffic state grades which are respectively smooth, slow running, congestion and serious congestion.

The road traffic status results output by the real-time classifier are shown in table 4.

TABLE 4 traffic status discrimination

Test sets were randomly assigned to 1: and 4, dividing according to the proportion, and establishing a real-time classification model to judge the road traffic state. The invention uses the accuracy index to evaluate the algorithm.

Accuracy＝(TP+TN)/(P+N)

Where TP is correctly divided into the number of positive cases, i.e., the number of instances (number of samples) that are actually positive cases and are divided into positive cases by the classifier, and TP is incorrectly divided into the number of positive cases, i.e., the number of instances that are actually negative cases but are divided into positive cases by the classifier. P + N is the total number of samples.

As shown in fig. 5, a real-time classification model is established by respectively using the multi-model fusion algorithm, the traditional clustering algorithm and the single-feature threshold discrimination algorithm provided by the invention, the real-time traffic state of a certain expressway in sichuan province is discriminated, the accuracy of the model is calculated, and the quality of the model is evaluated according to the accuracy.

According to experimental results, the precision of the state discrimination algorithm provided by the invention can reach about 94%, the accuracy and the effectiveness are improved compared with the traditional single-feature threshold discrimination method, and meanwhile, the feature selection method provided by the invention can remove some irrelevant features and improve the discrimination precision.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that this invention is not limited to the disclosed forms, but is intended to cover other embodiments, as may be used in various other combinations, modifications, and environments and is capable of changes within the scope of the invention as set forth, either as indicated by the above teachings or as may be learned by the practice of the invention. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A state discrimination method based on multi-model fusion is characterized in that: the method comprises the following steps:

data preprocessing, namely preprocessing the acquired traffic flow data;

2. The method according to claim 1, wherein the method comprises: the data preprocessing steps are as follows:

and carrying out data standardization processing on the data.

3. The method according to claim 2, wherein the method comprises: the specific steps of judging whether the acquired traffic flow data has abnormal data and processing the abnormal data are as follows:

4. The method according to claim 2, wherein the method comprises: the specific steps for carrying out data standardization processing on the data are as follows:

traversing all feature vectors of traffic flow data to obtain a maximum value;

traversing all feature vectors of traffic flow data to obtain a minimum value;

and carrying out normalization processing on the feature vectors.

5. The method according to claim 1, wherein the method comprises: the specific steps of selecting the relevant feature subset and reducing the data dimension by removing irrelevant and redundant features are as follows:

6. The method according to claim 1, wherein the method comprises: the multi-feature clustering is characterized in that the traffic flow data are divided by multi-dimensional feature analysis, and the method comprises the following specific steps:

the first step is as follows: initially, let S = 1, K S centroids are calculated for the initial m data using the K-Means clustering algorithm.

7. The second step is that: the first step is repeated until m S-level centroids are obtained.

8. The third step: and calculating the m S-level centroids by using a K-Means clustering algorithm to obtain K S + 1-level centroids.

9. The fourth step: repeating the third step until m centroids of S +1 level are obtained, S = S +1

10. The method according to claim 1, wherein the method comprises: the real-time classification for classifying the traffic flow data to judge the real-time traffic state comprises the following steps:

the first step is as follows: randomly selecting the samples with the replacement in the sample setmA random sample;

the second step is that: for feature-selected feature sets, the features are randomly selected from the feature setsnEstablishing a CART decision tree model;

the third step: repeating the first and second stepskThen, generatekEach CART decision tree has an independent decision criterion;