WO2020093701A1

WO2020093701A1 - Vehicle accident risk prediction model based on adaboost-so in vanets

Info

Publication number: WO2020093701A1
Application number: PCT/CN2019/092462
Authority: WO
Inventors: 赵海涛; 丁仪; 蔡舒祺; 张晖; 段佳秀; 朱洪波
Original assignee: 南京邮电大学
Priority date: 2018-11-07
Filing date: 2019-06-24
Publication date: 2020-05-14
Also published as: WO2020093702A1; CN109558969A

Abstract

A vehicle accident risk prediction model based on AdaBoost-SO in VANETs, being able to provide a theoretical basis for an ITS and driving safety assistance. The model establishment method comprises: first populating a study dataset, balancing samples in the dataset by using an SMOTE algorithm, encoding each sample feature by means of One-Hot, then training the study dataset by using a trichotomy Adaboost-SO algorithm to obtain a system model, and finally importing traffic data by means of VANETs, so as to obtain a vehicle accident probability, AdaBoost-SO referring to trichotomy Adaboost with SMOTE and One-Hot encoding, VANETs referring to Vehicular Ad Hoc Networks, ITS referring to an Intelligent Transportation System, and SMOTE referring to a Synthetic Minority Oversampling Technique.

Description

A VANETs vehicle accident risk prediction model based on AdaBoost-SO

Technical field

The invention relates to the field of Internet of Vehicles technology, in particular to a VANETs vehicle accident risk prediction model based on AdaBoost-SO.

Background technique

With the development of today's society and economy, urban residents have put forward higher requirements for the convenience and comfort of travel. The number of cars has increased, the pressure on urban traffic has increased, and road safety has become more and more serious. Especially in large cities, traffic accidents lead to traffic congestion, and vehicle accidents are increasingly threatening personal safety, which makes traffic safety research of great significance. At the same time, vehicle-mounted Ad Hoc networks (VANETs) as the key technology of intelligent transportation systems (ITS), its rapid development has great potential to improve road safety and traffic efficiency. It provides original road safety information for effective research on road safety, and provides new ideas for predicting vehicle accident risk. Collect a large amount of VANETs data from highly heterogeneous resources, paving the way for the new era of VANETs-BigData

With the development of big data and machine learning, using machine learning technology to predict traffic accidents has become a new bright spot. The document “The traffic accident hotspot prediction: Based on the logistic regression method” uses statistical and logistic regression analysis of typical factors to study the relationship between traffic accidents, road types, vehicle types, driver status, weather, and dates. Establish an accident hot spot prediction model. The documents "The Five-factor Model, Conscientiousness, and Driving Accident Involvement" and "Determining Personality Traits" of racing games players: using the open racing simulator: toward believable "virtual drivers" study the relationship between driver's responsibility and driving accidents. Prove that those with a strong sense of responsibility are unlikely to have a traffic accident. The document “Traffic Big Data Analysis Supporting Network Access Recommendation” developed an intelligent network recommendation system supported by traffic big data analysis. It is recommended that vehicles use the analysis framework to access the appropriate network and allow individual vehicles to automatically access based on the access recommender The internet.

However, all these methods focus on analyzing the causes of traffic accidents from the existing traffic data, and fail to obtain an accident prediction model with universal application value. Therefore, it is necessary to design a vehicle accident risk prediction model that can use real-time traffic data and alert vehicles at any time to provide a theoretical basis for intelligent transportation systems and driving safety assistance.

Summary of the invention

The main purpose of the present invention is to solve the problems in the prior art. The present invention provides a VANETs vehicle accident risk prediction model based on AdaBoost-SO.

A VANETs vehicle accident risk prediction model based on AdaBoost-SO. The steps of establishing the model include:

Step 1: Populate the research data set;

Step 2: SMOTE algorithm is used to balance the samples in the data set, and the discrete features of each sample are encoded with One-Hot;

Specifically, the Synthetic Minority Oversampling Technology (SMOTE) algorithm is used to solve the problem of imbalance in the number of samples in each category in the research data set;

After pre-processing the initial research data set using the SMOTE algorithm, a relatively balanced number of experimental data sets for each category can be obtained; next, the discrete features of each sample are encoded with One-Hot; the One-Hot encoding method uses N Bit state register to encode N states, each state has a separate register bit, and only one bit is valid at any time;

Step 3: Use trichotomy Adaboost-SO algorithm to train the research data set to obtain the system model;

Specifically, first, when constructing the experimental data set, the road safety data is randomly divided into training data and test data, and cross-validation is performed 6 times. This method makes full use of all samples. It requires 6 trainings and 6 tests; then , Use trichotomy AdaBoost algorithm to process research data set;

Step 4: Import real-time traffic data sets through VANETs to obtain the output of the prediction model;

Specifically, the output value is C = {C ₀ , C ₁ , C ₂ }, which indicates whether the predicted object belongs to a high accident rate; C ₀ indicates that the probability of a car accident is low or only a slight collision accident occurs, and C ₁ means that a more serious accident may occur For accidental injuries, C ₂ indicates that the probability of a car accident is high or an accident may occur.

Further, in the first step, specifically, before reconstructing the data, find and modify uncertain or incomplete road safety data to improve the data set; the usual implementation scheme includes filling in the average value of available features, special values, The average of similar samples, and directly ignore samples with missing values.

Further, in the second step, the SMOTE algorithm implementation process is:

Step 2-1. For each sample x in the minority category, the Euclidean distance is used as a criterion to calculate the distance from all other samples in the minority category to obtain its k nearest sample;

Step 2-2. Set the sampling rate N according to the sample imbalance ratio. For each minority sample x, assume that the selected neighboring sample is k, and randomly select several samples from its k neighboring samples;

Step 2-3: For each selected neighbor, use the original sample to construct a new sample according to the following formula;

Further, in step three, the specific implementation steps of the six cross-validation are as follows:

Step 3-1-1: Divide the entire research data set S into 6 disjoint subsets of the same size; assuming that the number of training samples is m, each subset will have

Training samples, the corresponding subsets are {S ₁ , S ₂ , S ₃ , S ₄ , S ₅ , S ₆ };

Step 3-1-2, use one subset as the test set, and then use the other five subsets as the training set;

Step 3-1-3, train the model through the training data, use the test data to verify the accuracy of the model and repeat six times;

Step 3-1-4: Calculate the average value of 6 evaluation errors as the true classification accuracy of the model.

Further, in the third step, the trichotomy AdaBoost algorithm is used to process the research data set, and the specific implementation steps are as follows:

Step 3-2-1, input training data set T = (x ₁ , y ₁ ), (x ₂ , y ₂ ) ..., (x _N , y _N ), x _i is the feature vector of the sample, y ∈ {1,2,3}, the weak classifier used in the present invention is a decision tree;

Step 3-2-2, the weights of the training data are initialized as:

Step 3-2-3. For the mth iteration, m = 1, 2, ..., M: use the training data set D _m with weight distribution to train to obtain a basic classifier:

G _m (x): χ → {1,2,3}

χ is the data to be trained, the G _m (x) error rate is calculated according to the classification results of the training data, and w _mi represents the weight of the i-th sample in the m-th iteration:

Since the weights are normalized in each step, the denominator does not need to be divided by the sum of the sample weights;

Step 3-2-4, trichotomy AdaBoost error rate threshold is set to e _m

And add the positive term x when

When, to ensure a _m ≥0; coefficient calculation classifier G _m (x) according to the error rate e _m:

Update the weight distribution of the training data set according to the coefficient a _m :

D _{m + 1} = (w _{m + 1,1} , ..., w _{m + 1, i} , ... w _{m + 1, N} )

Can be simplified to:

Among them, Z _m as a normalization factor makes D _{m + 1} a probability distribution:

After training, the weights of misclassified samples of the basic classifier G _m (x) continue to expand, while the weights of correctly classified samples decrease. Therefore, the misclassified samples play a greater role in the next iteration;

Step 3-2-5, construct a linear combination of basic classifiers to obtain the final classifier:

The linear combination f (x) implements the weighted voting of M basic classifiers, the f (x) value determines the category of the instance x, and indicates the confidence of the classification, and combines the trained weak classifier into a strong classifier to obtain the risk of vehicle accidents Forecasting model.

Compared with the prior art, the beneficial effect of the present invention is that: the system model with a maximum iteration value of 100 guarantees the maximum accuracy of accident prediction under ordinary road conditions, and the system model with a smaller maximum iteration value under special circumstances can improve timeliness. In the prediction, the maximum performance of the system can be exerted.

BRIEF DESCRIPTION

FIG. 1 is a schematic flowchart of the method of the present invention.

Figure 2 shows the architecture of trichotomy Adaboost-SO model.

detailed description

The technical solution of the present invention will be further described in detail below in conjunction with the accompanying drawings of the specification.

Step 1: Populate the research data set.

Specifically, before reconstructing the data, find and modify uncertain or incomplete road safety data to improve the data set; common implementation schemes include filling the average of available features, special values, the average of similar samples, and directly ignoring Samples with missing values.

Step 2: Use the SMOTE algorithm to balance the samples in the data set, and encode the discrete features of each sample with One-Hot.

Specifically, the Synthetic Minority Oversampling Technology (SMOTE) algorithm is used to solve the problem of imbalance in the number of samples in each category in the research data set. The SMOTE algorithm implementation process is:

Step 2-1. For each sample x in the minority category, the Euclidean distance is used as a criterion to calculate the distance to all other samples in the minority category to obtain its k nearest sample.

Step 2-2. Set the sampling rate N according to the sample imbalance ratio. For each minority sample x, assuming that the selected neighboring sample is k, several samples are randomly selected from its k neighboring samples.

Step 2-3: For each selected neighbor, use the original sample to construct a new sample according to the following formula.

After pre-processing the initial research data set using the SMOTE algorithm, a relatively balanced number of experimental data sets for each category can be obtained. Next, one-Hot encodes the discrete features of each sample.

The One-Hot encoding method uses N-bit status registers to encode N states, each state has a separate register bit, and only one bit is valid at any time.

Step 3: Use the trichotomy Adaboost-SO algorithm to train the research data set to obtain the system model.

Specifically, first, when constructing the experimental data set, road safety data is randomly divided into training data and test data, and cross-validation is performed 6 times. This method makes full use of all samples, and it requires 6 trainings and 6 tests. The specific implementation steps of the six cross-validations are as follows:

Training samples, the corresponding subsets are {S ₁ , S ₂ , S ₃ , S ₄ , S ₅ , S ₆ }.

Step 3-1-2, use one subset as the test set, and then use the other five subsets as the training set.

Step 3-1-3: Train the model through the training data, use the test data to verify the accuracy of the model and repeat six times.

Then, use the trichotomy AdaBoost algorithm to process the research data set, the specific implementation steps are as follows:

Step 3-2-1, input training data set T = (x ₁ , y ₁ ), (x ₂ , y ₂ ) ..., (x _N , y _N ), x _i is the feature vector of the sample, y ∈ {1,2,3}, the weak classifier used in the present invention is a decision tree.

Step 3-2-2, the weights of the training data are initialized as:

G _m (x): χ → {1,2,3}

χ is the data to be trained. Calculate the G _m (x) error rate according to the classification results of the training data, w _mi represents the weight of the i-th sample in the m-th iteration:

Since the weights are normalized in each step, the denominator does not need to be divided by the sum of the sample weights.

Step 3-2-4, trichotomy AdaBoost error rate threshold is set to e _m

And add the positive term x when

D _{m + 1} = (w _{m + 1,1} , ..., w _{m + 1, i} , ... w _{m + 1, N} )

Can be simplified to:

After training, the weights of misclassified samples of the basic classifier G _m (x) continue to expand, while the weights of correctly classified samples decrease. Therefore, the misclassified samples play a greater role in the next iteration.

The linear combination f (x) implements the weighted voting of M basic classifiers, the f (x) value determines the category of the instance x, and indicates the confidence level of the classification, and combines the trained weak classifier into a strong classifier to obtain the vehicle accident risk Forecasting model.

Step 4: Import real-time traffic data sets through VANETs to obtain the output of the prediction model.

Specifically, the output value is C = {C ₀ , C ₁ , C ₂ }, indicating whether the predicted object belongs to a high incidence of accidents. C ₀ means that the probability of a car accident is low or only a minor collision accident occurs, C ₁ means that a more serious accidental injury may occur, and C ₂ indicates that the probability of a car accident is high or an accident may occur.

The above are only the preferred embodiments of the present invention, and the scope of protection of the present invention is not limited to the above embodiments, but any equivalent modification or change made by those skilled in the art based on the disclosure of the present invention should be included Within the scope of protection described in the claims.

Claims

A VANETs vehicle accident risk prediction model based on AdaBoost-SO, characterized in that the steps of establishing the model include:

Step 1: Populate the research data set;

Step 2: SMOTE algorithm is used to balance the samples in the data set, and the discrete features of each sample are encoded with One-Hot;

Specifically, the Synthetic Minority Oversampling Technology (SMOTE) algorithm is used to solve the problem of imbalance in the number of samples in each category in the research data set;

After pre-processing the initial research data set using the SMOTE algorithm, a relatively balanced number of experimental data sets for each category can be obtained; next, the discrete features of each sample are encoded with One-Hot; the One-Hot encoding method uses N Bit state register to encode N states, each state has a separate register bit, and only one bit is valid at any time;

Step 3: Use trichotomy Adaboost-SO algorithm to train the research data set to obtain the system model;

Specifically, first, when constructing the experimental data set, the road safety data is randomly divided into training data and test data, and cross-validation is performed 6 times. This method makes full use of all samples, and it requires 6 trainings and 6 tests; then , Use trichotomy AdaBoost algorithm to process research data set;

Step 4: Import real-time traffic data sets through VANETs to obtain the output of the prediction model;

Specifically, the output value is C = {C 0 , C 1 , C 2 }, which indicates whether the predicted object belongs to a high accident rate; C 0 indicates that the probability of a car accident is low or only a slight collision accident occurs, and C 1 means that a more serious accident may occur For accidental injuries, C 2 indicates that the probability of a car accident is high or an accident may occur.
The AdaBoost-SO-based VANETs vehicle accident risk prediction model according to claim 1, characterized in that: in the first step, specifically, the uncertain or incomplete road safety data is found and modified before the data is reconstructed, To improve the data set; the usual implementation includes filling the average of available features, special values, the average of similar samples, and directly ignoring samples with missing values.
The VANETs vehicle accident risk prediction model based on AdaBoost-SO according to claim 1, characterized in that: in the second step, the SMOTE algorithm implementation process is:

Step 2-1. For each sample x in the minority category, the Euclidean distance is used as a criterion to calculate the distance from all other samples in the minority category to obtain its k nearest sample;

Step 2-2. Set the sampling rate N according to the sample imbalance ratio. For each minority sample x, assume that the selected neighboring sample is k, and randomly select several samples from its k neighboring samples;

Step 2-3: For each selected neighbor, use the original sample to construct a new sample according to the following formula;
The AdaBoost-SO-based VANETs vehicle accident risk prediction model according to claim 1, wherein in step three, the specific implementation steps of the six cross-validations are as follows:

Step 3-1-1. Divide the entire research data set S into 6 disjoint subsets of the same size; assuming the number of training samples is m, each
Training samples, the corresponding subsets are {S 1 , S 2 , S 3 , S 4 , S 5 , S 6 };

Step 3-1-2, use one subset as the test set, and then use the other five subsets as the training set;

Step 3-1-3, train the model through the training data, use the test data to verify the accuracy of the model and repeat six times;

Step 3-1-4: Calculate the average value of 6 evaluation errors as the true classification accuracy of the model.
The AdaBoost-SO-based VANETs vehicle accident risk prediction model according to claim 1, characterized in that: in the third step, the trichotomyAdaBoost algorithm is used to process the research data set, and the specific implementation steps are as follows:

Step 3-2-1, input training data set T = (x 1 , y 1 ), (x 2 , y 2 ) ..., (x N , y N ), x i is the feature vector of the sample, y ∈ {1,2,3}, the weak classifier used in the present invention is a decision tree;

Step 3-2-2, the weights of the training data are initialized as:

Step 3-2-3. For the mth iteration, m = 1, 2, ..., M: use the training data set D m with weight distribution to train to obtain a basic classifier:

G m (x): χ → {1,2,3}

χ is the data to be trained, the G m (x) error rate is calculated according to the classification results of the training data, and w mi represents the weight of the i-th sample in the m-th iteration:

Since the weights are normalized in each step, the denominator does not need to be divided by the sum of the sample weights;

Step 3-2-4, trichotomy AdaBoost error rate threshold is set to e m
And add the positive term x when
When, to ensure a m ≥0; coefficient calculation classifier G m (x) according to the error rate e m:

Update the weight distribution of the training data set according to the coefficient a m :

D m + 1 = (w m + 1,1 , ..., w m + 1, i , ... w m + 1, N )

Can be simplified to:

Among them, Z m as a normalization factor makes D m + 1 a probability distribution:

After training, the weights of misclassified samples of the basic classifier G m (x) continue to expand, while the weights of correctly classified samples decrease. Therefore, the misclassified samples play a greater role in the next iteration;

Step 3-2-5, construct a linear combination of basic classifiers to obtain the final classifier:

The linear combination f (x) implements the weighted voting of M basic classifiers, the f (x) value determines the category of the instance x, and indicates the confidence level of the classification, and combines the trained weak classifier into a strong classifier to obtain the vehicle accident risk Forecasting model.