WO2020108219A1

WO2020108219A1 - Traffic safety risk based group division and difference analysis method and system

Info

Publication number: WO2020108219A1
Application number: PCT/CN2019/114373
Authority: WO
Inventors: 刘林; 吕伟韬; 陈凝; 饶欢
Original assignee: 江苏智通交通科技有限公司
Priority date: 2018-11-30
Filing date: 2019-10-30
Publication date: 2020-06-04
Also published as: CN109598931A; CN109598931B

Abstract

Traffic safety risk based group division and a difference analysis method and system: taking drivers and motor vehicles as objects, demarcating safety risks of the objects by using an ensemble learning algorithm and carrying out group division on the basis of the safety risks, and identifying salient difference indexes through a statistical approach; mining features of traffic participants from traffic behaviors thereof on the basis of the ensemble learning algorithm and demarcating safety risk degrees of said traffic participants, and in order to reduce data granularity of an analysis and determining program, dividing, according to the risk degrees, into a plurality of target groups having different safety levels; in order to overcome the problem that the ensemble learning algorithm lacks explanation during safety risk demarcation, accurately testing and identifying salient difference indexes among groups by using Fisher, thus accurately describing features of groups having various risk levels and providing data support for active traffic safety management.

Description

Group division and difference analysis method and system based on traffic safety risk

Technical field

The invention relates to a method and system for group division and difference analysis based on traffic safety risks.

Background technique

Using machine learning methods to predict the probability of traffic accidents of traffic participants, it is possible to calibrate a clear safety risk index for each driver and vehicle with traffic violations and accident records, but in the current practical application, the individual is the object Of traffic safety management application scenarios are relatively limited.

Under such application conditions, reducing data granularity and identifying critical security features from a group perspective have a more realistic guiding role for active security governance. To this end, there is an urgent need for a method and system for grouping and difference analysis based on traffic safety risks to achieve the above objectives.

Summary of the invention

The purpose of the present invention is to provide a method and system for group division and difference analysis based on traffic safety risks, to make up for the defects of the integrated learning algorithm in the description of the risk calibration process through statistical methods, and to dig out the causes of accidents of groups with different risk levels 3. The different characteristics of accident results solve the problem that the existing technology-specific traffic safety management application scenarios for individuals are relatively limited.

The technical solution of the present invention is:

A group division and difference analysis method based on traffic safety risk, which takes drivers and motor vehicles as the object, calibrates the object safety risk through integrated learning algorithm, then divides the group on this basis, and identifies significant differences through statistical methods Indicators; includes the following steps,

S1. Determine the objects of traffic participants, including drivers and motor vehicles; obtain the historical records of traffic violations and traffic accidents based on the information of the target objects as sample data;

S2. Construct a risk prediction model of the target object based on the integrated learning algorithm; input the sample data into the model, and the model outputs the risk index of the target object; wherein, the risk degree is the label classification probability of the sample data after the model processing;

S3. Determine the secondary attribute dimension of the target object according to the sample data obtained in step S1, and divide it into the secondary attribute set of the cause of the accident and the secondary attribute set of the result of the accident; split the secondary attribute into three levels to determine each secondary level The three-level attribute factor corresponding to the attribute;

S4, synthesizing the processing results of steps S2 and S3, establishing a group division data table, and determining the attribution of the sample group;

S5. Taking the group as the object and taking the second-level attribute as the statistical dimension, perform the statistics of the third-level attribute data within the group; integrate the statistical results of each group to generate a secondary attribute variable R*C contingency table, where R represents the amount of the group, C represents the number of third-level attribute factors corresponding to the second-level attributes; adopts Fisher's exact test method, assuming H0: there is a significant difference in the value of the attribute variable between different groups, H1: there is no significant difference in the attribute variable between different groups; adopt The Monte Carlo simulation calculation method obtains the fuzzy solution p_value of the Fisher's exact test p-value; the Fisher's exact test result is determined according to the p_value, and the variables with significant differences are used as the group safety feature attributes.

Further, in step S2, the construction process of the risk prediction model specifically includes data label definition and data set division, model feature variable screening based on embedding method, data set equalization processing, model training based on cross-validation, and acceptance based on ROC curve The model performance evaluation of the operator's operating curve and the area under the curve AUC selects the model with the best fitting effect; the risk index output by the model is the label classification probability of the data.

Further, in step S4, the fields of the group division data table include object information, time, three-level attribute factors, risk degree, and belonging group; wherein the data of the belonging group field belongs to the threshold range of each group risk degree according to the risk degree of the object information The situation is ok.

Further, in step S5, Fisher's exact test result is determined according to p_value, and variables with significant differences are used as group safety feature attributes. Specifically, if the fuzzy solution of p-value p_value is less than the set value, then the null hypothesis H0 is accepted; otherwise Reject the original hypothesis H0 and accept the hypothesis H1.

A traffic safety risk-based group division and difference analysis system implementing any of the above-mentioned traffic safety risk-based group division and difference analysis methods, including a data docking module, a risk prediction module, and an attribute factor analysis module And group feature recognition module,

Data docking module: extract traffic accident records and traffic violation records from the database;

Risk prediction module: access to the historical traffic violation data and traffic accident data of the data docking module as samples for model construction; define data labels, divide sample data sets; filter model feature variables; perform balanced processing of data sets; adopt cross-validation Method to train the model, select the model with the best fitting effect according to the ROC curve and AUC value; complete the construction of the risk prediction model, and extract the historical traffic violation records of the specified target object from the data docking module according to user instructions, through the model Process and output the predicted value of the risk degree of the target object; generate a risk degree table;

Attribute factor analysis module: access the sample data of the data docking module, determine the second-level attribute according to the original sample data field; determine the third-level attribute factor corresponding to the second-level attribute according to the specific value of the sample data field, where the second-level attribute is discrete data , Then the third-level attribute factor is the corresponding data value range. If the second-level attribute is continuous data, the third-level attribute factor is determined through discretization; the second-level attribute table and the third-level attribute table are generated;

Group feature recognition module: the risk degree prediction module is connected to the risk degree table, and the attribute factor analysis module obtains the second-level attribute table and the third-level attribute table; the group division data table is generated according to the setting of the risk threshold interval; Fisher exact Test and Monte Carlo simulation calculation method to determine the secondary attribute p-value and write it into the secondary attribute table; filter out the secondary attribute with p-value less than the set value, as the differential characteristics of different groups, generate a group characteristic table.

Further, it also includes a visualization module, a visualization module: obtaining a group division data table and a group characteristic table from the group characteristic recognition module, and counting each group sample according to the three-level attributes corresponding to the different characteristics to generate the different characteristics of each group Table; call the visualization engine and use thematic maps to visualize and display the differential characteristics of each group and the statistics of the three-level attribute samples.

The beneficial effects of the present invention are:

1. This grouping and difference analysis method and system based on traffic safety risk, based on the integrated learning algorithm, mines the characteristics of traffic participants' traffic behavior performance, and calibrates the degree of their safety risk, in order to reduce the application of data for analysis and judgment Granularity, according to the degree of risk, the target groups of different security levels are divided; at the same time, in order to overcome the problem of the lack of interpretation of the integrated learning algorithm in the process of security risk calibration, Fisher Fisher’s exact test is used to identify significant differences between groups, so Accurately describe the characteristics of each risk level group to provide data support for active traffic safety governance.

2. Based on the prediction of traffic safety risk of drivers, vehicles and other traffic participants based on integrated learning, divide the groups of different safety levels, use the R*C contingency table Fisher exact test method to identify the different attribute factors, In the inspection process, because the number of rows and columns of the R*C contingency table are all greater than 2, the error of calculating the exact solution is significant and the calculation takes a long time. Therefore, the present invention uses Monte Carlo simulation calculation to obtain the fuzzy solution of p value Effectively save the time cost of the algorithm.

3. The present invention performs traffic safety risk ratings on traffic participants such as drivers and vehicles, and takes groups of traffic participants of the same level as objects, and explores the differences between groups to solve the traffic safety management of individuals. The problem of relatively limited application scenarios.

BRIEF DESCRIPTION

FIG. 1 is a schematic flowchart of a method for group division and difference analysis based on traffic safety risks according to an embodiment of the present invention.

FIG. 2 is an explanatory diagram of a group division and difference analysis system based on traffic safety risks in an embodiment.

detailed description

The preferred embodiments of the present invention will be described in detail below with reference to the drawings.

Examples

A group division and difference analysis method based on traffic safety risk, which takes drivers and motor vehicles as the object, calibrates the object safety risk through integrated learning algorithm, then divides the group on this basis, and identifies significant differences through statistical methods Index; as shown in figure 1, the specific steps are:

S1. Determine the objects of traffic participants, including drivers and motor vehicles; obtain the historical records of traffic violations and traffic accidents based on the information of the target objects as sample data.

In the embodiment, the target object information of the driver is the ID number, and the target object information of the motor vehicle is the combination of the number plate type and the number plate number; the time range of historical records usually exceeds one year to ensure a sufficient sample size.

S2. Construct a risk prediction model of the target object based on the integrated learning algorithm; input sample data into the model, and the model outputs the risk index of the target object. Among them, the risk degree is the label classification probability of the sample data after the model processing.

Among them, the risk prediction model construction process includes data label definition and data set division, model feature variable selection based on embedding method, data set equalization processing, model training based on cross-validation, based on ROC curve (receiver operation curve) and under curve The model performance evaluation of the area AUC selects the model with the best fitting effect; the risk index output by the model is the label classification probability of the data.

In the embodiment, a method for combining the improved sampling method and the RF random forest algorithm is used to construct a risk prediction model, with a model coverage rate of 0.06 and an accuracy of 0.889.

S3. Determine the secondary attribute dimension of the target object according to the sample data obtained in step S1, and divide it into the secondary attribute set of the cause of the accident and the secondary attribute set of the result of the accident; split the secondary attribute into three levels to determine each secondary level The three-level attribute factor corresponding to the attribute.

In the embodiment, taking the driver as the target object, the corresponding elements in the secondary attribute set of the cause of the accident include gender, age, nationality, hukou nature, type of person, driving age, accident identification reason, blood alcohol content, seat belt helmet usage, etc. ; Target the vehicle, the secondary attributes of the cause of the accident include the type of vehicle, mode of transportation, nature of use of the vehicle, mileage, legal status, insurance, whether it is overloaded, the status of the light, the amount of load, etc.; the secondary attributes of the accident result include the form of the accident , Accident level, direct property loss, accident liability, etc.

S4, synthesizing the processing results of steps S2 and S3, establishing a group division data table to determine the sample group attribution; the field of the group division data table includes object information, time, three-level attribute factor, risk degree, and belonging group; among which the group field data It is determined according to the attribution of the risk degree of the object information within the risk degree threshold range of each group.

In the embodiment, three general, risk, and risk groups are set. The risk threshold interval of the general group is [0,0.15], the risk group interval is (0.15,0.8), and the risk group interval is [0.8,1.0].

S5. Taking the group as the object and taking the second-level attribute as the statistical dimension, perform the statistics of the third-level attribute data within the group; integrate the statistical results of each group to generate a secondary attribute variable R*C contingency table, where R represents the amount of the group, C represents the number of third-level attribute factors corresponding to the second-level attributes; adopts Fisher Fisher's exact test, assuming that H0: there are significant differences in attribute variable values between different groups, and H1: there are no significant differences in attribute variables between different groups; In view of the fact that the three-level attribute factor usually exceeds 2, and the number of rows and columns of the contingency table is generally different, the Monte Carlo simulation calculation method is used to obtain the fuzzy solution p_value of the Fisher test p-value. The variable is used as a group security characteristic attribute.

In the embodiment, the script for checking the difference of attribute variable values between groups is edited by R language, the fisher.test function in the stats statistical method package is called, the parameter simulate.p.value is set to TRUE, and the number of Monte Carlo simulations B is set to 105; p value is less than 0.05, accept the null hypothesis H0, otherwise reject the null hypothesis H0.

In one embodiment, the person is the target object, and Fisher's exact test is performed on the secondary attributes of the accident result. The resulting R*C contingency table for the accident level, accident form, accident liability, and direct property loss is as follows:

Table 1. Contingency table R*C

Table 2. R*C contingency table

Table 3. Accident liability R*C contingency table

Table 4. R*C Consolidated List of Direct Property Losses

Monte Carlo simulation calculation Fisher Fisher exact test p value, the results are:

事故结果二级属性Accident result secondary attribute	事故等级Accident level	事故形态Accident pattern	事故责任Accident liability	直接财产损失Direct property loss
p_valuep_value	1.01.0	0.587900.58790	0.035330.03533	0.014690.01469

The accident liability and direct property loss of different groups are significantly different, and these two variables are used as group safety characteristic attribute variables.

This method of group division and difference analysis based on traffic safety risk is used to rate traffic safety risks for traffic participants such as drivers and vehicles, and to target groups composed of traffic participants of the same level to explore the differences between groups. , Solve the problem that the application scenarios of traffic safety management for individuals are relatively limited.

A traffic participant group division and feature research and judgment system, as shown in Fig. 2, includes a data docking module, a risk prediction module, an attribute factor analysis module, a group feature recognition module, and a visualization module.

The data docking module extracts traffic accident records and traffic violation records from the database.

Risk prediction module, access to the historical traffic violation data and traffic accident data of the data docking module, as a sample of the model construction; define data labels, divide the sample data set; filter model feature variables; perform balanced processing of the data set; use cross-validation Method to train the model, select the best fitting model according to the ROC curve and AUC value; this module completes the construction of the risk prediction model, and extracts the historical traffic violation records of the specified target object from the data docking module according to user instructions, Through the model processing, the predicted risk value of the target object is output; a risk degree table is generated.

Attribute factor analysis module, access to the sample data of the data docking module, determine the secondary attribute according to the original sample data field; determine the tertiary attribute factor corresponding to the secondary attribute according to the specific value of the sample data field, where the secondary attribute is discrete data , Then the third-level attribute factor is the corresponding data range. If the second-level attribute is continuous data, the third-level attribute factor is determined through discretization; the second-level attribute table and the third-level attribute table are generated.

Group feature recognition module, the risk degree prediction module accesses the risk degree table, and the attribute factor analysis module obtains the second-level attribute table and the third-level attribute table; generates the group division data table according to the setting of the risk threshold interval; adopts Fisher's exact Test and Monte Carlo simulation calculation method to determine the secondary attribute p-value and write it into the secondary attribute table; filter out the secondary attribute with p-value less than the set value, as the differential characteristics of different groups, generate a group characteristic table. Among them, the set value is preferably 0.05.

The visualization module obtains the group division data table and the group characteristic table from the group characteristic recognition module, counts the samples of each group according to the three-level attributes corresponding to the different characteristics, and generates the different characteristic table of each group; calls the visualization engine and uses the topic The graph visualizes and displays the difference characteristics of each group and the statistics of the three-level attribute samples. The thematic maps include word cloud, histogram, pie chart, doughnut chart, number chart and other proportional and comparative graphic forms.

This grouping and difference analysis method and system based on traffic safety risk is based on integrated learning algorithm to mine the characteristics of traffic participants' traffic behavior performance and calibrate their safety risk degree, in order to reduce the data granularity of the analysis and judgment application , Divide several target groups with different security levels according to the degree of risk; at the same time, in order to overcome the problem of the lack of explanatory degree of the integrated learning algorithm in the process of security risk calibration, the Fisher Fisher's exact test is used to identify the significant differences between the groups to accurately describe The characteristics of each risk level group provide data support for active traffic safety governance.

Based on the prediction of traffic safety risk of drivers, vehicles and other traffic participants based on integrated learning, the groups of different safety levels are divided, and the R*C contingency table Fisher exact test method is used to identify the difference attribute factors. In the process, because the number of rows and columns of the R*C contingency table are all greater than 2, the error of calculating the exact solution is significant, and the calculation takes a long time. Therefore, the method and system of the embodiment use Monte Carlo simulation calculation to obtain the fuzzy solution of p value , Effectively saving the time cost of the algorithm.

Claims

A group division and difference analysis method based on traffic safety risk, which is characterized by taking drivers and motor vehicles as objects, calibrating object safety risks through integrated learning algorithms, group division on this basis, and statistical methods Identify significant difference indicators; include the following steps,

S1. Determine the objects of traffic participants, including drivers and motor vehicles; obtain the historical records of traffic violations and traffic accidents based on the information of the target objects as sample data;

S2. Construct a risk prediction model of the target object based on the integrated learning algorithm; input the sample data into the model, and the model outputs the risk index of the target object; wherein, the risk degree is the label classification probability of the sample data after the model processing;

S3. Determine the secondary attribute dimension of the target object according to the sample data obtained in step S1, and divide it into the secondary attribute set of the cause of the accident and the secondary attribute set of the result of the accident; split the secondary attribute into three levels to determine each secondary level The three-level attribute factor corresponding to the attribute;

S4, synthesizing the processing results of steps S2 and S3, establishing a group division data table, and determining the attribution of the sample group;

S5. Taking the group as the object and taking the second-level attribute as the statistical dimension, perform the statistics of the third-level attribute data within the group; integrate the statistical results of each group to generate a secondary attribute variable R*C contingency table, where R represents the amount of the group, C represents the number of third-level attribute factors corresponding to the second-level attributes; adopts Fisher's exact test method, assuming H0: there is a significant difference in the value of the attribute variable between different groups, H1: there is no significant difference in the attribute variable between different groups; adopt The Monte Carlo simulation calculation method obtains the fuzzy solution p_value of the Fisher's exact test p-value; the Fisher's exact test result is determined according to the p_value, and the variables with significant differences are used as the group safety feature attributes.
The method for group division and difference analysis based on traffic safety risk according to claim 1, wherein in step S2, the construction process of the risk prediction model specifically includes data label definition and data set division, and the model based on the embedding method Feature variable screening, data set equalization processing, model training based on cross-validation, model performance evaluation based on ROC curve, ie receiver operating curve and area under the curve AUC, the best fitting model is selected; the risk index output by the model Probability of label classification for data.
The method for group division and difference analysis based on traffic safety risk according to claim 1, wherein in step S4, the fields of the group division data table include object information, time, three-level attribute factor, risk degree, and group ; The field data of the group to which it belongs is determined according to the attribution of the risk degree of the object information within the threshold range of the risk degree of each group.
The method for group division and difference analysis based on traffic safety risk according to any one of claims 1 to 3, characterized in that in step S5, the Fisher exact test result is determined according to p_value, and variables with significant differences are used as Group security feature attributes, specifically, the fuzzy solution p_value of p-value is less than the set value, then accept the null hypothesis H0; otherwise reject the null hypothesis H0, accept the hypothesis H1.
A traffic safety risk-based group division and difference analysis system for realizing the traffic safety risk-based group division and difference analysis method according to any one of claims 1 to 4, characterized in that it includes a data docking module and a risk Degree prediction module, attribute factor analysis module and group feature recognition module,

Data docking module: extract traffic accident records and traffic violation records from the database;

Risk prediction module: access to the historical traffic violation data and traffic accident data of the data docking module as samples for model construction; define data labels, divide sample data sets; filter model feature variables; perform balanced processing of data sets; adopt cross-validation Method to train the model, select the model with the best fitting effect according to the ROC curve and AUC value; complete the construction of the risk prediction model, and extract the historical traffic violation records of the specified target object from the data docking module according to user instructions, through the model Process and output the predicted value of the risk degree of the target object; generate a risk degree table;

Attribute factor analysis module: access the sample data of the data docking module, determine the second-level attribute according to the original sample data field; determine the third-level attribute factor corresponding to the second-level attribute according to the specific value of the sample data field, where the second-level attribute is discrete data , Then the third-level attribute factor is the corresponding data value range. If the second-level attribute is continuous data, the third-level attribute factor is determined through discretization; the second-level attribute table and the third-level attribute table are generated;

Group feature recognition module: the risk degree prediction module is connected to the risk degree table, and the attribute factor analysis module obtains the second-level attribute table and the third-level attribute table; the group division data table is generated according to the setting of the risk threshold interval; Fisher exact Test and Monte Carlo simulation calculation method to determine the secondary attribute p-value and write it into the secondary attribute table; filter out the secondary attribute with p-value less than the set value, as the differential characteristics of different groups, generate a group characteristic table.
The group division and difference analysis system based on traffic safety risk according to claim 5, further comprising: a visualization module, the visualization module: obtaining a group division data table and a group characteristic table from the group characteristic recognition module, according to the difference The three-level attribute corresponding to the sexual characteristics collects statistics of each group sample to generate the different characteristic table of each group; calls the visualization engine and uses thematic maps to visualize and display the different characteristics of each group and the statistics of the three-level attribute samples.