CN109145953B

CN109145953B - Adaboost algorithm-based traffic high-risk personnel identification method

Info

Publication number: CN109145953B
Application number: CN201810815618.1A
Authority: CN
Inventors: 吕伟韬; 刘林; 陈凝; 饶欢
Original assignee: Jiangsu Zhitong Traffic Technology Co ltd
Current assignee: Jiangsu Zhitong Traffic Technology Co ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2021-09-07
Anticipated expiration: 2038-07-16
Also published as: CN109145953A

Abstract

The invention provides a traffic high-risk personnel identification method based on an Adaboost algorithm, which is characterized in that based on original traffic violation data and accident data, the Adaboost algorithm is adopted to train and correct a high-risk personnel identification model, and personnel violation attribute information is input into the model, so that the identification and prediction of high-risk personnel can be realized, and the method has practical significance in the aspects of improving the traffic safety control work efficiency, assisting the daily safety management work of traffic polices, and the like.

Description

Adaboost algorithm-based traffic high-risk personnel identification method

Technical Field

The invention relates to a method for identifying high-risk traffic personnel based on Adaboost algorithm.

Background

In the field of road traffic safety, research is mostly focused on analysis of association laws between external factors such as environment, road infrastructure, traffic flow running state and the like and traffic accidents, for example, chinese patents cn201710400521.x, CN201580075213.3, CN201611051192.4 and the like, or analysis of traffic accident laws and characteristics from the aspect of characteristics such as environment and traffic control measures. The internal factors such as behavior habits of traffic participants (motor vehicles, non-motor vehicle drivers and pedestrians) lack deep research and analysis at present due to the problems of wide information dimension, limited information perception means and the like, but the influence of human factors on traffic accidents is inevitable content of traffic safety research, and the method has great practical guiding significance on traffic safety management.

Research shows that the traffic violation and the traffic accident have a correlation; considering that a large amount of illegal data are accumulated in the current traffic control industry, reliable data support can be provided for feature mining of traffic accidents.

The AdaBoost algorithm utilizes the same weak classifier, distributes different weight parameters based on the error rate of the classifier, and takes the prediction result of accumulated weight as output. Adaboost provides a framework within which sub-classifiers can be constructed using various methods without screening features and without over-fitting. The method has good performance advantages when applied to data classification, can be used for mining valuable traffic safety information when applied to the processing of traffic violation data, but is lack of the application at present.

The invention takes the behavior characteristic mining of the traffic participants as a core, extracts the illegal driving behavior attribute of accident-related personnel, identifies high-risk personnel and realizes data-driven active traffic safety prevention.

Disclosure of Invention

The invention aims to realize data mining based on Adaboost, so that dangerous persons possibly suffering from traffic accidents are identified among traffic participants with traffic violation records, the effect of traffic safety risk prediction and evaluation of the persons is achieved, and scientific index basis for assisting decision making is provided in source management, field inspection and other works in traffic safety management application.

Based on original traffic violation data and accident data, the Adaboost algorithm is adopted to train and correct the classification model of the high-risk personnel, and violation attribute information is input into the model, so that the identification and prediction of the high-risk personnel can be realized, and the method has practical significance in improving the working efficiency of traffic safety control, assisting the daily safety management work of traffic polices, and the like.

The technical solution of the invention is as follows:

a method for identifying high-risk traffic personnel based on Adaboost algorithm comprises the following steps,

s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data;

s2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified;

s3, sampling a general personnel data subset N in the illegal data set, combining with a high-risk personnel data subset D, and splitting to obtain a training set and a test set;

s4, training a high-risk personnel recognition model by using training set data based on an Adaboost algorithm, and determining model parameters; the model parameters comprise a learning rate, the number of weak classifiers, a maximum tree depth, a node minimum splitting value, a leaf node minimum sample number and a maximum feature number;

s5, carrying out high-risk personnel identification model evaluation on the test set data, determining a classification probability critical threshold value, and correcting the model to obtain a final traffic high-risk personnel identification model;

and S6, inputting the data of the subset to be identified in the step S2 into the identification model of the high-risk traffic personnel obtained in the step S5, and obtaining the identification result of the high-risk personnel.

Further, the sampling method described in step S3 is specifically,

s31, randomly sampling the general personnel data subset to obtain a compressed general personnel sample N';

s32, performing variable processing and screening on the sample data of the compressed general personnel data subset N';

s33, splitting a high-risk personnel data subset D and a collection G of N' into a training set and a testing set;

s34, SMOTE sampling is carried out on the training set, the sample expansion and contraction proportion of the high-risk personnel data subset and the general personnel data subset is determined, the final sample number is obtained, and the training set sample is obtained after processing.

Further, the sample data variable processing and screening method in step S32 specifically includes:

s321, setting a dependent variable target, wherein the numerical value of the dependent variable target is determined according to a sample data label, and is selected from high-risk and general values; taking a data field of the illegal data set as an independent variable;

s322, deleting the constant independent variable and the independent variable with extremely small variance in the independent variables; the judgment condition that the variance is extremely small is as follows:

wherein freqcut_X＝x_f/x_lXf is the sample value with the maximum frequency of the variable X, xl is the sample value with the maximum frequency of the variable X, and Tf is the corresponding threshold value, which is usually 19; unisequential_X＝m_X/n_XWhere mX is the number of samples after sample value deduplication, nX is the total amount of samples, and Tu is the inspection threshold of unisequential, and the value is usually 0.1;

s323, deleting the independent variable which is more than a threshold value in collinearity with other independent variables; wherein the threshold value is typically 0.75;

s324, checking the multiple collinearity of the independent variables, and determining the data independent variables.

Further, the method for assigning the corresponding data label value label based on the classification rule in step S2 specifically includes:

high-risk personnel: one category is traffic participants who have illegal records and have serious traffic accident records with major responsibility or all responsibility; the other type is that illegal records exist, only slight accident records exist, and the accident records are not less than 2 traffic participants;

the average person: traffic participants who have illegal records but no records of accidents;

the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.

Further, the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals.

Further, in step S1, the occurrence condition of the accident-related illegal activity is obtained by a corresponding analysis method, and the type of the violation with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.

Further, in step S1, the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying the discrete variable according to the illegal time characteristics.

The invention has the beneficial effects that:

the invention provides a high-risk personnel identification method based on AdaBoost algorithm based on the relevance of traffic violation and traffic accidents, and achieves the effect of predicting the traffic safety risk of personnel by relying on traffic violation records. The method adopts a safety risk label determination method with strong implementation, and can be flexibly adjusted according to regional traffic regulation and safety degree in practical application and sensitivity required by a model.

The AdaBoost algorithm is adopted to fit the high-risk personnel identification model, the weak classifiers are well utilized for cascading, compared with the common integrated algorithm, the method has the advantages of low generalization error rate and high precision, the personnel classification requirement based on illegal data can be met, and the identification accuracy of the high-risk personnel is ensured.

And thirdly, compressing the large sample before SMOTE sampling, so that the problem that the accuracy of the model is influenced by the unbalanced data set can be relieved to a certain extent.

And fourthly, preprocessing the original data by adopting a characteristic engineering method, so that the accuracy of the model is improved.

Drawings

Fig. 1 is a schematic flow chart of a traffic high-risk person identification method based on the Adaboost algorithm in the embodiment of the present invention.

Fig. 2 is a schematic flow chart of sampling a general person data subset in the embodiment.

Fig. 3 is a schematic flowchart of a sample data variable processing and screening method in the embodiment.

FIG. 4 is an explanatory diagram of the data set in the embodiment.

FIG. 5 is a diagram illustrating attribute variables of the first 20 bits of importance in the embodiment.

FIG. 6 is a schematic representation of a test set ROC curve plotted for the examples.

FIG. 7 is a diagram of a test set PR curve plotted according to an embodiment.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

A traffic high-risk personnel identification method based on an Adaboost algorithm extracts the safety behavior characteristic attributes of traffic participants from traffic violation records and trains a model to realize high-risk personnel identification and safety risk prediction; as shown in fig. 1, the specific process flow is as follows:

s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data.

In an embodiment, the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; preprocessing operations such as collection and classification are carried out on the original illegal records to obtain an illegal data set; the law violation data set is full sample data of law violation records of personnel, and the data set information comprises personnel certificate numbers, violation times, violation types, punishment conditions, accident-related law violation behavior occurrence conditions and violation occurrence time intervals.

The occurrence condition of the accident-related illegal activity in the step S1 is obtained through a corresponding analysis mode, and the illegal type with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.

In the illegal occurrence time period in the step S1, the time continuous variable is converted into a discrete variable, and classification is performed according to the illegal time characteristics.

S2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified.

The method for giving the corresponding data label value label based on the classification rule in the S2 specifically includes that the classification rule of the high-risk person is as follows: (1) personnel who have illegal records and have major or total liability for heavy traffic accident records; (2) illegal records exist, only slight accident records exist, and the accident records are not less than 2 persons; the general personnel are personnel with illegal records but no accident records; the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.

S3, sampling a general personnel data subset N in the illegal data set, combining with a high-risk personnel data subset D, and splitting to obtain a training set and a test set.

And S31, randomly sampling the general staff data subset to obtain a compressed general staff sample N'.

S32, performing variable processing and screening on the sample data of the compressed general personnel data subset N'; the processing steps are as shown in fig. 3, and specifically include:

S33, splitting a high-risk personnel data subset D and a collection G of N' into a training set and a testing set; in the examples, the split ratio was 9: 1.

S4, training a high-risk personnel recognition model by using training set data based on an Adaboost algorithm, and determining model parameters; the model parameters comprise a learning rate, the number of weak classifiers, a maximum tree depth, a node minimum splitting value, a leaf node minimum sample number and a maximum feature number; in an embodiment, the Adaboost algorithm is performed using Python to call Adaboost classifier functions and decisiontreelsifier base functions in the skleann machine learning library.

And S5, carrying out high-risk personnel identification model evaluation by using the test set data, determining a classification probability critical threshold value, and correcting the model to obtain a final traffic high-risk personnel identification model.

Specific examples

Step 1, obtaining 2-year traffic violation records and accident records in an area through docking with a database.

The present embodiment takes a driver of a motor vehicle as an analysis target. The traffic accident with death or serious injury or hit-and-run accident is taken as a serious accident, other accidents are taken as slight accidents, the original accident records are classified according to the serious accident or serious injury or hit-and-run accident, the accident type and personnel certificate information are taken as attribute characteristics of a serious accident data set and a slight accident data set, and sample data of the two data sets are obtained.

Further, the original illegal data are preprocessed, and illegal information of personnel is collected and counted, wherein the illegal information comprises accumulated illegal times, illegal types, accumulated deduction scores, average deduction scores (minutes/times), single maximum deduction scores, accumulated fines amount and average fines amount (yuan/times).

The method comprises the steps of performing dimensionality reduction treatment on traffic accident data and illegal original data by adopting a corresponding analysis method, classifying illegal types according to the relevance of the illegal and the type of the accident, and extracting five types with highest relevance as data attributes of an accident risk illegal behavior field, wherein the data attributes are shown in a table 1.

TABLE 1 event-related violation type partitioning

According to the traffic flow operation of the road network of the area where the embodiment is located and the characteristics of the occurrence rule of the traffic violation event, aggregating the time, dividing the analysis time period, and converting the continuous variable into the nominal variable; in another embodiment, the time interval division is performed by other statistical means such as clustering.

Extracting age, gender and province and city codes according to the certificate number by the personnel characteristic data; and generating an illegal data set according to the information extracted from each link, as shown in table 2.

TABLE 2. partial data of illegal data set

And 2, classifying the full sample I in the illegal data set into two categories of high-risk drivers and common drivers. Referring to fig. 4, in a case where a person who has illegal records and has serious traffic accident records with major responsibility or all responsibility is taken as a high-risk driver, qualified data is classified as a data set D1; taking a person with illegal records, only a slight accident record and no less than 2 accident records as another condition of the high-risk driver, and dividing data meeting conditions into a data set D2; the data set D of the high-risk drivers is D1+ D2. And synthesizing the corresponding data of the personnel with illegal records but no accident records into a general driver data set N.

Accordingly, a high-risk or general data tag value label is determined for the data satisfying the rule in the illegal data set, and if the data subset U that cannot be applied to the classification rule is I-N-D U1+ U2, the data subsets are to be identified, and U1 and U2 are two subsets of the data subsets to be identified, respectively.

Step 3, sampling a general driver data subset N in the illegal data set, combining with a high-risk driver data subset D, and splitting to obtain a training set and a test set; the specific sampling method comprises the following steps:

step 31, randomly sampling the general driver data subset to obtain a compressed general driver sample N', wherein the sampling rate is generally 2.5% -25%, and 4000 pieces of data are extracted from 84383 pieces of data in the embodiment.

Step 32, carrying out variable processing and screening on the sample data of the sampled data subset of the general drivers; the method comprises the following specific steps:

s321, setting a dependent variable target, wherein the numerical value of the dependent variable target is determined according to a sample data label, and is selected from high-risk and general values; taking a data field of the illegal data set as an independent variable; setting the provincial code field and the city code field as dummy variables, and increasing the number of independent variables to 93;

wherein freqcut_X＝x_f/x_lXf is the sample value with the maximum frequency of the variable X, xl is the sample value with the maximum frequency of the variable X, and Tf is the corresponding threshold value, and the value is 19; unisequential_X＝m_X/n_XWhere mX is the number of samples after sample value deduplication, nX is the total amount of samples, and Tu is the inspection threshold of unisequential, and the value is 0.1; in the embodiment, the link deletes some independent variables of accumulated violation times, type2, type3, type5 and 19: 00-22: 00;

s323, deleting the independent variable which is more than a threshold value in collinearity with other independent variables; wherein the threshold value is typically 0.75; in this embodiment, the link deletes three independent variables of the cumulative score, the average score and other illegal activities;

s324, checking that multiple collinearity does not exist in the residual independent variables, and determining the data independent variables.

S33, splitting a high-risk driver data subset D and a high-risk driver data subset N' into a training set and a testing set; in the example, the sample size ratio of the training set to the test set is 9: 1.

S34, SMOTE sampling is carried out on the training set, the data subset of the high-risk driver and the data subset of the general driver are determined to be in sample expansion and sample contraction proportion, the final sample number is obtained, and the training set sample is obtained after processing. In the embodiment, the number of over-sampling samples of the high-risk driver data subset is 2 times of the original number, and the number of under-sampling samples of the general driver data subset is 2 times of the original number of high-risk driver samples.

And 4, training the model by using an Adaboost algorithm and adopting a 5-fold cross validation method for the training set data. The model parameters are specifically: the learning rate learning _ rate _ value is 0.1, the number of weak classifiers n _ estimators _ value is 500, the maximum tree depth max _ depth _ value is 2, the node minimum split value min _ samples _ split _ value is 2, the leaf node minimum sample number min _ samples _ leaf _ value is 2, and the maximum feature number max _ features _ value is 5. The number mtry of the selected attributes of the nodes in the model is 47, that is, 47 characteristic variables such as age, average fine amount, accumulated fine amount and gender and attribute variables 20 digits before the importance are selected from 93 attribute variables, as shown in fig. 5.

And 5, performing model evaluation by using the test set data, determining a classification probability critical threshold value, and correcting the model.

Specifically, firstly, inputting test set data into the model trained in the step 4, and obtaining a test sample classification class Fit _ class and a probability Fit _ probs thereof through model processing; secondly, drawing an ROC curve (figure 6) and a PR curve (figure 7) of the test set, and determining the accuracy and the recall rate; and determining a classification probability threshold according to the recall rate, wherein in the embodiment, the model accuracy is 0.8, the recall rate is 0.403, and the judgment probability threshold of corresponding high-risk personnel and general personnel is 0.736.

And 6, inputting the data of the subset U to be identified obtained in the step 2 into a model based on the high-risk personnel identification model trained in the step, predicting a target value through the model, and partially judging results are shown in a table 3.

TABLE 3 identification results of high-risk personnel using the method of the invention

Claims

1. A traffic high-risk personnel identification method based on an Adaboost algorithm is characterized by comprising the following steps: judging the traffic accident risk according to the illegal attribute of the road traffic participant, comprising the following steps,

s2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified; the method for assigning the corresponding data label value label based on the classification rule in step S2 specifically includes:

the data which do not meet the judgment condition form a subset to be identified;

s3, sampling a general personnel data subset N in the illegal data set, combining with a high-risk personnel data subset D, and splitting to obtain a training set and a test set; in particular to a method for preparing a high-performance nano-silver alloy,

s32, performing variable processing and screening on the sample data of the compressed general personnel data subset N'; the method specifically comprises the following steps:

s322, deleting the constant independent variable and the minimum variance in the independent variablesAn independent variable of (d); the judgment condition that the variance is extremely small is as follows:

wherein freqcut_X＝x_f/x_l，x_fFor the sample value, X, of the variable X having the greatest frequency_lFor sample values of variable X of greater frequency, T_fIs a corresponding threshold; unisequential_X＝m_X/n_X，m_XNumber of samples after de-duplication of sample values, n_XIs the total amount of the sample, T_uA test threshold of unisequential;

s323, deleting the independent variable which is more than a threshold value in collinearity with other independent variables;

s324, checking the multiple collinearity of the independent variables, and determining the independent variables of the data;

s34, SMOTE sampling is carried out on the training set, the sample expansion and contraction proportion of the high-risk personnel data subset and the general personnel data subset is determined, the final sample number is obtained, and the training set sample is obtained after processing;

2. The method for identifying high-risk traffic personnel based on Adaboost algorithm as claimed in claim 1, wherein: the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals.

3. The method for identifying high-risk traffic personnel based on Adaboost algorithm as claimed in claim 2, characterized in that: in step S1, the occurrence of the accident-related law violation is obtained by a corresponding analysis method, and the violation type with a high degree of influence of the traffic accident is extracted as the data attribute of the violation data set.

4. The method for identifying high-risk traffic personnel based on Adaboost algorithm as claimed in claim 2, characterized in that: in step S1, the time-continuous variable is converted into a discrete variable, and the discrete variable is classified according to the characteristics of the time of violation.