CN113808392B

CN113808392B - Method for optimizing traffic accident data under multi-source data structure

Info

Publication number: CN113808392B
Application number: CN202110975201.3A
Authority: CN
Inventors: 郭延永; 刘攀; 丁红亮; 马景峰; 李清韵
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2022-04-01
Anticipated expiration: 2041-08-24
Also published as: CN113808392A

Abstract

The invention discloses a method for optimizing traffic accident data under a multi-source data structure, which comprises the following steps: (1) collecting multi-source traffic data; (2) constructing a generating model which accords with the form distribution of the multi-source data; (3) balancing the traffic accident data structure; (4) and (5) verifying and evaluating the optimized data. The method comprises the steps of firstly collecting and summarizing multi-source traffic accident data, respectively determining the distribution form of each traffic data type, secondly constructing an accident data generation model based on the data distribution form, and finally verifying and evaluating an optimized data set based on a road safety analysis model. The method can greatly reduce the influence of the unbalanced traffic accident data structure on the safety analysis model and obtain accurate and reliable traffic safety evaluation results.

Description

Method for optimizing traffic accident data under multi-source data structure

Technical Field

The invention relates to a method for optimizing traffic accident data under a multi-source data structure, and belongs to the technical field of traffic data structures.

Background

In recent years, the construction of a road safety accident analysis model becomes a research hotspot in the field of traffic safety, however, the performance of the model depends on the effectiveness of a traffic accident data structure to a great extent. As a small probability event, especially a serious accident, a traffic accident often results in an unbalanced accident data structure, that is, the accident data sample is far smaller than a zero accident sample (i.e., a phenomenon of excessive zero). At present, in the scientific research field and the patent application field, most researches are based on traditional statistical analysis models, such as a zero-expansion Poisson regression model, bootstrap resampling and the like. With the development of advanced data mining technologies, upsampling and downsampling technologies are beginning to be used for data structure balance optimization, such as synthesizing few classes of oversampling technologies, generating countermeasure networks, and the like.

However, the method often gives a common likelihood function to all variables when generating a new data set, and ignores heterogeneity among different variables, thereby affecting the fitting effect of the model and the identification of safety factors. Therefore, in order to ensure the effectiveness of data generation and ensure the acquisition of accurate and reliable safety evaluation results, likelihood functions conforming to respective morphological distribution need to be respectively constructed for different variable data to generate a new data set, so that the accident data structure is balanced.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for optimizing the traffic accident data under the multi-source data structure is provided, the influence of the unbalanced traffic accident data structure on the safety analysis model can be greatly reduced, and accurate and reliable traffic safety evaluation results can be obtained.

The invention adopts the following technical scheme for solving the technical problems:

a method for optimizing traffic accident data under a multi-source data structure comprises the following steps:

step 1, collecting multi-source traffic data, namely acquiring multi-source traffic safety influence factor data;

step 2, constructing a generation model which accords with the form distribution of the multi-source traffic data, namely constructing a distribution form function for each influence factor acquired in the step 1;

and 3, performing proliferation optimization processing on the multi-source traffic data acquired in the step 1 based on the generation model constructed in the step 2, so that the ratio of the number of accident samples to the number of zero accident samples in the processed multi-source traffic data is 1: 4.

As a further scheme of the invention, the method for optimizing the traffic accident data further comprises a step 4 of constructing a traffic safety analysis model and verifying the proliferation optimization result according to the fitting indexes of the model.

As a preferred scheme of the present invention, the multi-source traffic safety influencing factors in step 1 include: the total number N of annual traffic accidents of the road section, the length L of the road section, the daily average traffic quantity Q of the road section, the average speed V of the road section, the traffic node density S of the road section, the road grade A, the road width W, the number K of roads and the existence of buses and lanes B.

As a preferred scheme of the invention, the specific process of the step 2 is as follows:

dividing the multi-source traffic safety influence factors into counting variables, real-value variables, classification variables and ordered variables;

the counting variable comprises the total number N of the annual traffic accidents of the road section, and the distribution form function of the total number of the annual traffic accidents of the road section is constructed according to the formula (1):

wherein p (N ═ G) represents the probability of occurrence of a G accident on a link, λ represents the average number of occurrences of the accident per unit time or unit area, and G is a natural number;

the real-value variables comprise a road section length L, a road section daily average traffic quantity Q, a road section traffic node density S and a road width W, and the distribution form function of the real-value variables is constructed according to the formula (2):

j is a continuous natural number (2)

Wherein Z represents a real variable, p (Z ═ J) represents the probability of the real variable taking the value J,

represents a normal distribution function, mu (I), sigma (I)²Respectively, the mean and variance of the Gaussian distribution, wherein I represents the actual observed value of the real-valued variable;

the classification variables comprise road grade A, road lane number K and the existence of a bus lane B, and the distribution form function of the constructed classification variables is as shown in formula (3):

where H denotes a categorical variable, p (H ═ C) denotes the probability that the categorical variable takes on value C, and pi_C(F)、π_q(F) Expressing parameters of a polynomial Logit model, F expressing actual observed values of the classification variables, and U being a natural number;

the ordered variables comprise a road section average speed V, and the distribution form function of the road section average speed is constructed according to the following formulas (4) and (5):

p(V＝R)＝p(V≤R)-p(V≤R-1) (4)

wherein p (V ═ R) represents the probability of the average vehicle speed value R, p (V ≦ R) represents the probability of the average vehicle speed value R being less than or equal to R, p (V ≦ R-1) represents the probability of the average vehicle speed value R-1, R is a natural number, ω is a natural number, and R is a natural number_R(E) Indicating the segment threshold, ψ, to which the mean value R corresponds_V(E) And E is a model parameter and an actual observed value of the ordered variable.

As a preferred embodiment of the present invention, the traffic safety analysis model is represented by the following formulas (6) and (7):

Ln(N)＝θ+θ₁L+θ₂Q+θ₃V+θ₄S+θ₅A+θ₆W+θ₇K+θ₈B (6)

AIC＝-2 ln(Y)+2T，BIC＝ln(n)T-21n(Y) (7)

wherein N represents the total annual traffic accident quantity of the road section, L represents the length of the road section, Q represents the daily average traffic volume of the road section, V represents the average speed of the road section, S represents the traffic node density of the road section, A represents the road grade, W represents the road width, K represents the number of the roads, B represents the existence of the bus lane, theta and theta₁、θ₂、θ₃、θ₄、θ₅、θ₆、θ₇、θ₈The coefficient of the traffic safety analysis model is AIC, BIC and T, wherein AIC represents a Chichi information quantity criterion, BIC represents a Bayesian information criterion, Y represents a maximum likelihood value, T represents the number of influencing factors, and n is the number of observation samples.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the invention provides a method for optimizing traffic accident data under a multi-source data structure, which respectively determines the distribution form of each traffic data type, constructs an accident data generation model based on the data distribution form, and verifies and evaluates the optimized data set based on a road safety analysis model, thereby greatly reducing the influence of an unbalanced traffic accident data structure on the safety analysis model and enabling the traffic safety evaluation result to be more accurate and reliable.

2. The invention constructs the likelihood functions which accord with respective distribution aiming at different variable data, thereby ensuring the effectiveness of data generation and ensuring the acquisition of accurate and reliable safety evaluation results.

Drawings

FIG. 1 is a flow chart of a method of optimizing traffic accident data in a multi-source data structure in accordance with the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As shown in fig. 1, the method for optimizing traffic accident data under a multi-source data structure provided by the present invention includes the following steps:

step 1, collecting multi-source traffic data, and respectively obtaining the following multi-source traffic safety influence factors through field investigation and investigation of related department traffic departments: the method comprises the following steps of (1) carrying out traffic accident total number N on a road section year, length L of the road section, daily average traffic quantity Q of the road section, average speed V of the road section, traffic node density S of the road section, road grade A, road width W, road number K and whether a bus lane B exists or not;

step 2, constructing a generation model which accords with the form distribution of the multi-source data, and constructing a suitable distribution form function for each factor in the step 1 specifically as follows:

counting variable (total number of annual traffic accidents N), as in formula (1):

where p (N ═ G) represents the probability of occurrence of a G accident on a link, and λ represents the average number of occurrences of the accident per unit time or unit area.

The real-valued variables (the length L of the road section, the width W of the road width, the density S of the traffic nodes of the road section, and the average traffic volume Q of the daily traffic of the road section) are as follows:

j is a continuous natural number (2)

Wherein Z represents an actual value variable in the present invention, p (Z ═ J) represents a probability that the variable takes on the value J,

represents a normal distribution function, mu (I), sigma (I)²The mean and variance of the gaussian distribution, I, represent the actual observed values of the variables.

The classification variables (road grade A, presence or absence of bus lane B, road lane number K) are as follows:

where H denotes a categorical variable in the present invention, p (H ═ C) denotes a probability that the variable takes on the value C, and pi_C(S) represents parameters of a polynomial Logit model, F represents an actual observed value of each variable, and U is a natural number.

The order variable (road segment average speed V), as shown in equations (4) and (5):

p(V＝R)＝p(V≤R)-p(V≤R-1) (4)

wherein p (V ═ R) represents the probability of the average vehicle speed value R, p (V ≦ R) represents the probability of the average vehicle speed value R being less than or equal to R, p (V ≦ R-1) represents the probability of the average vehicle speed value R-1, R is a natural number, ω is a natural number, and R is a natural number_R(E) The segment threshold corresponding to the average value R is shown, for example, R is 0.8, and the segment where R is located is [0, 1 ]]The threshold is 1, psi_V(E) For model parameters, E is the order variable realAnd (4) observing the values.

Step 3, balancing a traffic accident data structure, balancing the traffic accident data structure based on the proliferation processing of each variable in the step 2 and combining the original observation data to balance the traffic accident data structure, wherein the ratio of a recommended accident sample (N is not equal to 0) to a zero accident sample (N is equal to 0) is 1: 4;

and 4, optimizing the verification and evaluation of data, constructing a traffic safety analysis model for verifying the balanced traffic accident data structure, and evaluating the data optimization result according to the fitting indexes (AIC and BIC) of the model, wherein the data optimization results are shown as the following formulas (6) and (7):

Ln(N)＝θ+θ₁L+θ₂Q+θ₃V+θ₄S+θ₅A+θ₆W+θ₇K+θ₈B (6)

AIC＝-2 ln(Y)+2T，BIC＝ln(n)T-2ln(Y) (7)

where Y represents the maximum likelihood value, T represents the number of parameters (9 in the present invention), and n is the number of observation samples.

The present invention will be described with reference to specific examples.

1) Multi-source traffic data acquisition: multi-source data is collected by accurate survey methods and relevant department surveys, assuming n₁-n₁₀For accident sample, n₁₁-n₁₀₀The sample is a zero accident sample, so the ratio of the accident sample to the zero accident sample in the original data is 1: 9, and the data structure has an unbalance phenomenon, as shown in table 1.

TABLE 1 statistical table for sample data collection

2) And (3) proliferating accident sample data: when the ratio of the accident sample to the zero accident sample is inquired to be 1:4 according to the literature, the effectiveness of the safety analysis model and the interpretability of the variable can be ensured, so that the accident sample is proliferated through the generation model which accords with the form distribution of each variable in the step 2 of the invention, and the ratio of the non-zero accident sample to the zero accident sample in the analysis data is 1: 4.

3) Constructing a safety analysis model: respectively constructing a safety analysis model according to the original traffic accident data and the traffic accident data after proliferation optimization, wherein the model comprises the following steps:

safety analysis model based on original traffic accident data (ratio 1: 9)

Ln(N_oi)＝θ_oi+θ_o1L_oi+θ_o2Q_oi+θ_o3V_oi+θ_o4S_oi+θ_o5A_oi+θ_o6W_oi+θ_o7K_oi+θ_o8B_oi

AIC_o＝-2 ln(Y_o)+2T_o，BIC_o＝ln(n_o)T_o-2ln(Y_o)

Safety analysis model (proportion is 1: 4) based on traffic accident data after proliferation optimization

Ln(N_ai)＝θ_ai+θ_a1L_ai+θ_a2Q_ai+θ_a3V_ai+θ_a4S_ai+θ_a5A_ai+θ_a6W_ai+θ_a7K_ai+θ_a8B_ai

AIC_a＝-2 ln(Y_a)+2T_a，BIC_a＝ln(n_a)T_a-2ln(Y_a)

4) The verification and evaluation of the optimized data, if the AIC is performed under the assumed data, because the case is performed under the assumed data_o＜AIC_a、BIC_o＜BIC_aThe safety analysis model based on the traffic accident data after the proliferation optimization is superior to the safety analysis model based on the original traffic accident data in model fitting, and the safety analysis model based on the traffic accident data after the proliferation optimization is inferior to the safety analysis model based on the original traffic accident data in model fitting.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. A method for optimizing traffic accident data under a multi-source data structure is characterized by comprising the following steps:

the multi-source traffic safety influence factors comprise: the method comprises the following steps of (1) carrying out traffic accident total number N on a road section year, length L of the road section, daily average traffic quantity Q of the road section, average speed V of the road section, traffic node density S of the road section, road grade A, road width W, road number K and whether a bus lane B exists or not;

step 2, constructing a generation model which accords with the form distribution of the multi-source traffic data, namely constructing a distribution form function for each influence factor acquired in the step 1; the specific process is as follows:

p(V＝R)＝p(V≤R)-p(V≤R-1) (4)

wherein p (V ═ R) represents the probability of the average vehicle speed value R, p (V ≦ R) represents the probability of the average vehicle speed value R being less than or equal to R, p (V ≦ R-1) represents the probability of the average vehicle speed value R-1, R is a natural number, ω is a natural number, and R is a natural number_R(E) Indicating the segment threshold, ψ, to which the mean value R corresponds_V(E) Is a model parameter, E is an ordered variable actual observed value;

and 3, carrying out proliferation optimization processing on the multi-source traffic data acquired in the step 1 based on the generation model constructed in the step 2, so that the ratio of the number of accident samples to the number of zero accident samples in the processed multi-source traffic data is 1: 4.

2. The method for optimizing traffic accident data under the multi-source data structure of claim 1, wherein the method for optimizing traffic accident data further comprises a step 4 of constructing a traffic safety analysis model and verifying the proliferation optimization result according to the fitting indexes of the model.

3. The method for optimizing traffic accident data under the multi-source data structure of claim 2, wherein the traffic safety analysis model is as shown in formulas (6) and (7):

Ln(N)＝θ+θ₁L+θ₂Q+θ₃V+θ₄S+θ₅A+θ₆W+θ₇K+θ₈B (6)

AIC＝-2ln(Y)+2T，BIC＝ln(n)T-2ln(Y) (7)