CN103927874A

CN103927874A - Automatic incident detection method based on under-sampling and used for unbalanced data set

Info

Publication number: CN103927874A
Application number: CN201410177414.1A
Authority: CN
Inventors: 陈淑燕; 李苗华; 王炜
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2014-04-29
Filing date: 2014-04-29
Publication date: 2014-07-16

Abstract

The invention discloses an automatic incident detection method based on under-sampling and used for an unbalanced data set. The automatic incident detection method comprises the steps of (1) using a maximum and minimum normalization method to carry out normalization processing on actually-measured traffic flow data, carrying out under-sampling processing on a majority class in a training set on the basis of a neighborhood cleaning rule to obtain a new training set which is relatively balanced, (2) selecting a radial basis function as a kernel function of a support vector machine, using an improved grid search algorithm to optimize a penalty factor C and a kernel parameter g of the support vector machine, and (3) training the support vector machine through the training set which is relatively balanced so as to obtain an automatic incident detection model used for the unbalanced data set. According to the automatic incident detection method based on under-sampling and used for the unbalanced data set, the problem that an existing traffic incident detection algorithm is not applicable to unbalanced traffic data in reality is solved, detection performance of the traffic incident detection algorithm is remarkably improved, the average detection time is shortened, and the requirement of traffic incident detection for real-time performance is met.

Description

Based on owing the traffic event automatic detection method of sampling towards unbalanced data collection

Technical field

The invention belongs to traffic intelligent management and control technology field, relate to a kind of based on owing the traffic event automatic detection method of sampling towards unbalanced data collection.

Background technology

Traffic events not only causes and blocks up and incur loss through delay, and also easily causes second accident.Detect accurately and rapidly traffic events, carry out in time event rescue and processing, can effectively reduce the traffic congestion and the delay that are produced by traffic events, avoid the generation of second accident.Traffic events detects (AutomaticIncident Detection automatically, AID) be the important component part of Modern Traffic supervisory system, it is the basis of advanced traffic control system and Traveler Information system, to significantly reducing the delay being caused by traffic events, crowded and accident, improve traffic safety and service level and there is very important meaning.

In recent years, the research of AID algorithm mainly concentrates on the application aspect of the new technologies such as neural network, fuzzy theory, wavelet analysis and support vector machine.With respect to traditional incident Detection Algorithm, above-mentioned Algorithm for Traffic Incidents Detection can improve the detection performance of algorithm to a certain extent.But in real world, traffic normal operating condition is far away more than traffic events state, it is in fact uneven classification problem that traffic events detects, and this problem of the less consideration of traffic events automatic detection algorithm in the past.Mostly above-mentioned Algorithm for Traffic Incidents Detection is the algorithm of classifying based on equilibrium criterion collection, often causes higher rate of false alarm, lower verification and measurement ratio and longer average detected time for traffic events while detection, detects effect disappointing.

Support vector machine (SupportVectorMachine, SVM) detects for traffic events, but it shows significantly " having bias " in the time processing uneven classification problem, is unfavorable for the study of minority class sample.In order to overcome above-mentioned defect, the present invention is based on neighborhood cleaning rule, combination supporting vector machine, proposes a kind of based on owing the traffic event automatic detection method of sampling towards unbalanced data collection.First owe sampling to reduce its unbalancedness by the methods of sampling of owing based on neighborhood cleaning rule to the most classes in training set, then use the training set Training Support Vector Machines of relative equilibrium, make it to carry out traffic events as sorter and automatically detect.

Summary of the invention

Technical matters: the invention provides a kind of unbalancedness of number of samples between class that reduces in training set, can adapt to unbalanced traffic data in real world based on owing the traffic event automatic detection method of sampling towards unbalanced data collection.

Technical scheme: of the present invention based on owing the traffic event automatic detection method of sampling towards unbalanced data collection, comprise the steps:

1) utilize maximum-minimum specification method to carry out standardization processing to actual measurement traffic flow data, obtain original training set and test set;

2) based on neighborhood cleaning rule to step 1) most classes in the original training set that obtains owe sample process, reduce the unbalancedness of training set, obtain the training set of new relative equilibrium;

3) based on step 1) the original training set that obtains, the kernel function of support vector machine adopts radial basis function, adopt penalty factor and the nuclear parameter g of improved grid search algorithm optimization support vector machine, the optimum value of the optimum value of supported vector machine penalty factor and nuclear parameter g;

4) according to step 3) optimum value of support vector machine penalty factor and the optimum value of nuclear parameter g that obtain, use step 2) the training set Training Support Vector Machines of the new relative equilibrium that obtains, obtain the automatic detection model of traffic events towards unbalanced data collection;

5) using the automatic detection model of the traffic events towards unbalanced data collection that trains, to step 1) test set that obtains carries out traffic events and automatically detects, and determines whether generation traffic events according to the Output rusults of model.

In the preferred version of the inventive method, step 1) in actual measurement traffic flow data comprise speed, occupation rate and the flow three class data of the detection section upstream and downstream that detecting device detects in each sampling period.

In the preferred version of the inventive method, step 1) in maximum-minimum specification method for actual measurement traffic flow data being processed according to following formula:

\overset{&OverBar;}{x_{ij}} = \frac{x_{ij} - x_{\min j}}{x_{\max j} - x_{\min j}},

In formula, raw data x _ijvalue after standardization processing; x _ijrepresent j property value of i sample; x _maxjand x _minjbe respectively maximal value and the minimum value of attribute j; J=1,2 ..., 6, corresponding upstream speed, velocity of downstream, upstream occupation rate, downstream occupation rate, upstream flowrate, downstream flow totally 6 attributes of representing respectively.

In the preferred version of the inventive method, step 2) method idiographic flow be: to the sample x in training set _i, find three neighbours nearest with it, comparative sample x _iclassification and nearest three neighbours' classification, if x _ibe most classes, and have two or three to be minority class sample in its three neighbours, in training set, remove sample x _i, otherwise not to x _ido any processing, continue to find the next sample in training set; If x _ibe minority class, and in its three neighbours, to have two or three be most class samples, in training set, remove the most class samples in these three neighbours, otherwise not to x _ido any processing, continue to find the next sample in training set; Wherein i is the sample sequence number in training set, i=1,2 ..., n, n is the total sample number in training set.

In the preferred version of the inventive method, step 3) idiographic flow be:

First allow penalty factor and nuclear parameter g at C=[2 ^-10, 2 ¹⁰], g=[2 ^-10, 2 ¹⁰] scope in the variation of 1.0 step-length, find penalty factor and the nuclear parameter g of corresponding maximum classification accuracy rate by cross validation, determine the optimum valuing range of penalty factor and nuclear parameter g with this;

Then at C=[2 ^-10, 2 ⁰], g=[2 ⁰, 2 ¹⁰] scope in the variation of 0.5 step-length, by cross validation, in the optimum valuing range of penalty factor and nuclear parameter g, find the best value of penalty factor and nuclear parameter g.

In the preferred version of the inventive method, step 5) in, if be-1 towards the Output rusults of the automatic detection model of traffic events of unbalanced data collection, represent that the traffic circulation state in detection zone is now normal, otherwise represent to occur traffic events.

The inventive method is based on neighborhood cleaning rule, a kind of new methods of sampling of owing is proposed, owe sample process to the most classes in training set, reduce in training set the unbalancedness of number of samples between class, Training Support Vector Machines carries out traffic events as sorter and automatically detects on this basis.The method can be used for traffic events to carry out real time automatic detection, can make up existing Algorithm for Traffic Incidents Detection and be not suitable with the defect of unbalanced traffic data in real world.

Beneficial effect: the present invention compared with prior art, has the following advantages:

The Algorithm for Traffic Incidents Detection generally adopting at present, major part is the algorithm of classifying based on equilibrium criterion collection, is not suitable with the unbalanced traffic data in real world.Some algorithms carry out oversampling by the minority class in training set and increase minority class sample, reduce the unbalancedness of training set, but the minority class sample increasing may cause the information redundancy of minority class sample, brings problem concerning study.Some algorithms are by the most classes in training set being owed at random to sampling, and transport solution data nonbalance problem, has blindness and limitation but owe at random the most class samples of sampling removal, lacks the consideration to noise sample and boundary sample.This rare phenomenon compared with normal traffic data for event data in reality, the inventive method proposes a kind of new methods of sampling of owing based on neighborhood cleaning rule, owe sample process to the most classes in training set, reduce the unbalancedness of training set, overcome existing Algorithm for Traffic Incidents Detection and be not suitable for the defect of unbalanced traffic data in real world; The methods of sampling of owing proposing is removed most class samples by finding arest neighbors, has increased the probability that retains internal security sample, improves the quality of data cleansing, has reduced the impact of noise sample on minority class classification performance; Use support vector machine as sorter, and adopt improved grid search algorithm optimization penalty factor and nuclear parameter g, improved category of model performance; The present invention carries out traffic events and automatically detects in conjunction with a kind of the owe methods of sampling and support vector machine based on neighborhood cleaning rule, has improved event verification and measurement ratio, has shortened the average detected time, meets the requirement of real-time that traffic events detects, and is easy to realize.

Brief description of the drawings

Fig. 1 the present invention is based on to owe the process flow diagram of sampling towards the traffic event automatic detection method of unbalanced data collection;

Fig. 2 is the actual measurement traffic flow data sampling schematic diagram that the present invention adopts;

The curve map that when Fig. 3 (a) and Fig. 3 (b) are Support Vector Machines Optimized penalty factor and nuclear parameter g, classification accuracy rate changes with C and g, wherein Fig. 3 (a) is different with hunting zone and the step-length of C in Fig. 3 (b) and g.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment, the technical program is described further.Should understand these embodiment and only be not used in and limit the scope of the invention for the present invention is described, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the amendment of various equivalents of the present invention.

The traffic event automatic detection method based on owing to sample towards unbalanced data collection that the present invention proposes, its process flow diagram is shown in accompanying drawing 1, mainly comprises the steps:

1) utilize maximum-minimum specification method to carry out standardization processing to actual measurement traffic flow data, obtain original training set and test set.

The present invention is based on and owe sampling and all adopt I-880 actual measurement traffic flow data towards the training and testing of the traffic events automatic detection algorithm of unbalanced data collection, accompanying drawing 2 is shown in by traffic flow data sampling schematic diagram.I-880 data are collected the traffic flow data composition that gathers the period (16 days-March 19 February in 1993 and 27 days-October 29 September in 1993) northwards by 35 groups of detecting devices (trackside that travels has been installed 18 groups, and the trackside that travels has southwards been installed 17 groups).Traffic flow data comprises the magnitude of traffic flow, occupation rate, speed three class data, and data acquisition time is spaced apart 30s.

Choose at random February 16 North and South direction each 2 pairs the normal traffic flow data of totally 4 pairs of adjacent detector (totally 5272 groups) and 8 traffic events (totally 762 groups) as original training set; Choose at random the each 2 pairs of each 2 pairs of totally 4 pairs of detecting devices of the totally 4 pairs of detecting devices, February 18 North and South direction of North and South direction on February 17, add up to the normal traffic flow data (totally 10320 groups) of 8 pairs of detecting devices and 35 events (totally 3167 groups) as test set.

Test set and training set are merged into an entirety, utilize maximum-minimum specification method to carry out data normalization processing, concrete grammar is as follows:

\overset{&OverBar;}{x_{ij}} = \frac{x_{ij} - x_{\min j}}{x_{\max j} - x_{\min j}}

\*MERGEFORMAT(1)

In formula, raw data x _ijvalue after standardization processing; x _ijrepresent j property value of i sample; x _maxjand x _minjfor maximal value and the minimum value of attribute j; J=1,2 ..., 6 represent upstream and downstream speed, occupation rate, flow totally 6 attributes.

2) owe sample process based on neighborhood cleaning rule to the most classes in training set, reduce the unbalancedness of training set, obtain the training set of new relative equilibrium.

For the internal security sample in reservation training set as much as possible, only reject the most class samples in training set, the present invention uses for reference the thought of neighborhood cleaning rule, proposes a kind of new methods of sampling of owing to be: to the sample x in training set _i, find three neighbours nearest with it, comparative sample x _iclassification and described three nearest neighbours' classification, if x _ibe most classes, and have two or three to be minority class sample in its three neighbours, in training set, remove sample x _i, otherwise not to x _ido any processing, continue to find the next sample in training set; If x _ibe minority class, and in its three neighbours, to have two or three be most class samples, in training set, remove the most class samples in these three neighbours, otherwise not to x _ido any processing, continue to find the next sample in training set, wherein i is the sample sequence number in training set, i=1,2 ..., n, n is the total sample number in training set.

This methods of sampling of owing utilizes arest neighbors thought to remove the most class samples in training set, and its method of finding arest neighbors is taking sample Euclidean distance between any two as standard, and the computing method of distance are as follows:

d (x_{a}, x_{b}) = {(Σ_{j = 1}^{n} {(x_{aj} - x_{bj})}^{2})}^{\frac{1}{2}}

\*MERGEFORMAT(2)

In formula, d (x _a, x _b) represent the Euclidean distance between two samples; x _ajrepresent j property value of a sample; x _bjrepresent j property value of b sample, be the data after standardization processing.

Original training set is owed after sample process, and the ratio that in training set, event sample accounts for total sample is increased to 33.63% by 12.63%, and embodiment of the present invention traffic data pattern representation used is in table 1:

Table 1 embodiment of the present invention traffic data pattern representation used

3) based on step 1) the original training set that obtains, the kernel function of support vector machine adopts radial basis function, adopt penalty factor and the nuclear parameter g of improved grid search algorithm optimization support vector machine, the optimum value of the optimum value of supported vector machine penalty factor and nuclear parameter g.

Traffic flow data is high dimensional nonlinear, need to raw data be mapped to by nuclear technology to the feature space of higher-dimension, solves linear classification problem in high-dimensional feature space.Radial basis function (RadialBasis Function, RBF) is most widely used SVM kernel function, and relatively stable compared with the performance of other types kernel function, the present invention selects the kernel function of radial basis function as SVM.

Adopt the penalty factor of improved grid search algorithm optimization support vector machine and the concrete grammar of nuclear parameter g to be:

First allow penalty factor and nuclear parameter g (C=[2 within the specific limits ^-10, 2 ¹⁰], g=[2 ^-10, 2 ¹⁰]) with the variation of 1.0 step-length, find penalty factor and the nuclear parameter g of corresponding maximum classification accuracy rate by cross validation, determine the optimum valuing range of penalty factor and nuclear parameter g with this.As shown in accompanying drawing 3 (a), corresponding maximum classification accuracy rate 100% has obtained a classification rate equal pitch contour, this curve correspondence the combination of a series of penalty factor and nuclear parameter g, consider that higher penalty factor can cause problem concerning study, select less that group of C as optimum value.3 (a) are known with reference to the accompanying drawings, and the optimum valuing range of penalty factor and nuclear parameter g is in C=[2 ^-6, 2 ⁰], g=[2 ⁰, 2 ⁶] scope in;

Then according to the optimum valuing range (C=[2 of fixed penalty factor and nuclear parameter g ^-6, 2 ⁰], g=[2 ⁰, 2 ⁶]), and be positioned at hunting zone in order to ensure the best value of penalty factor and nuclear parameter g, and suitably expand above-mentioned optimum valuing range, getting hunting zone is C=[2 ^-10, 2 ⁰], g=[2 ⁰, 2 ¹⁰], reduce step-size in search to 0.5 according to new hunting zone, in the optimum valuing range of above-mentioned penalty factor and nuclear parameter g, find the best value of penalty factor and nuclear parameter g by cross validation.As shown in accompanying drawing 3 (b), maximum classification accuracy rate is 100%, and to having obtained a classification rate equal pitch contour by classification accuracy rate, this curve correspondence the various combination of a series of penalty factor and nuclear parameter g.Although very high penalty factor can make the accuracy rate of cross validation improve, larger C tends to cause problem concerning study, therefore select that group of penalty factor minimum as optimum value.

Finding the corresponding maximum penalty factor of classification accuracy rate and the concrete grammar of nuclear parameter g to be by cross validation: by described step 1) the original training set that obtains is divided into two groups at random, one group as training set, one group as test set, utilize training set training classifier, then utilize test set verification model, record the performance index that corresponding classification accuracy rate is this sorter.

The curve map that while finding the penalty factor of support vector machine and the optimum value of nuclear parameter g, classification accuracy rate changes with C and g is shown in accompanying drawing 3 (a) and accompanying drawing 3 (b).Horizontal ordinate in accompanying drawing 3 (a) and accompanying drawing 3 (b) represents that it is the logarithm value at the end that penalty factor is got to 2, ordinate represents that it is the logarithm value at the end that nuclear parameter g is got to 2, in figure, curve represents classification accuracy rate level line, classification accuracy rate corresponding to digitized representation on curve.

Accompanying drawing 3 (a) is different with hunting zone and the step-length of penalty factor in accompanying drawing 3 (b) and nuclear parameter g, and in accompanying drawing 3 (a), the variation range of C and g is: C=[2 ^-10, 2 ¹⁰], g=[2 ^-10, 2 ¹⁰], step-size in search gets 1.0; In accompanying drawing 3 (b), the variation range of C and g is: C=[2 ^-10, 2 ⁰], g=[2 ⁰, 2 ¹⁰], step-size in search gets 0.5.

Be respectively by the penalty factor of the supported vector machine of this optimization and the optimum value of nuclear parameter g: C=0.022097, g=16.

4) according to step 3) optimum value of support vector machine penalty factor and the optimum value of nuclear parameter g that obtain, use step 2) the training set Training Support Vector Machines of the new relative equilibrium that obtains, obtain the automatic detection model of a kind of traffic events towards unbalanced data collection.

Use the training set training SVM of the relative equilibrium that obtained, the vector that it is input as one 6 dimension, comprises speed, occupation rate and flow totally 6 attributes of the detection section upstream and downstream that detecting device detects in the t moment.Its output is state flag bit, and in described state flag bit, 1 represents traffic events state, and-1 represents normal traffic states.

5) using the automatic detection model of the traffic events towards unbalanced data collection that trains, to described step 1) test set that obtains carries out traffic events and automatically detects, and determines whether generation traffic events according to the Output rusults of model.If the Output rusults towards the support vector machine of unbalanced data collection is-1, represents that the traffic circulation state in detection zone is now normal, otherwise represent to occur traffic events.

For the validity that the methods of sampling detects automatically for traffic events of owing based on neighborhood cleaning rule is described, design one group of contrast experiment, use respectively original training set and through owing the training set Training Support Vector Machines of the new relative equilibrium that sample process obtains, and use same test set to contrast the detection performance of two event detectors.

Select following 4 evaluation indexes: verification and measurement ratio DR (DetectionRate), rate of false alarm FAR (False AlarmRate), average detected time MTTD (MeanTimeToDetect) and correct classification rate CR (ClassificationRate).The computing formula of each index is as follows:

MTTD = \frac{1}{n} Σ_{i = 1}^{n} [TI (i) - AI (i)] - - - (5)

In formula, the actual time of origin that TI (i) is the event i that detected by AID algorithm; AI (i) detects time of event i for AID algorithm; N is the event number that AID algorithm detects.

Use same test set, to being trained the event detector obtaining by original training set and training the event detector obtaining to carry out respectively performance test by the training set of relative equilibrium, the results are shown in following table 2.

The testing result of table 2 different event detecting device

Training set	DR/％	FAR/％	MTTD/min	CR/％
					Original training set	85.71	0.48	3.30	92.29
Owe sample process	91.43	1.74	0.52	95.15

As shown in Table 2, svm classifier device through the training set training of owing the relative equilibrium that methods of sampling processing obtains based on neighborhood cleaning rule is more good to the classification performance of same test set, verification and measurement ratio DR is increased to 91.43% by 85.71%, the average detected time, MTTD was reduced to 0.52min by 3.30min, and classification accuracy rate is increased to 95.15% by 92.29%.But rate of false alarm FAR is increased to 1.74% by 0.48%, this may be because some most classes of training set being owed to sample process removal have been lost part useful information, make in the time that most class samples are differentiated, some normal variation of traffic parameter are mistaken for generation traffic events, cause the classification accuracy rate of most class samples to reduce, large thereby rate of false alarm becomes.

The present invention proposes a kind of based on owing the traffic event automatic detection method of sampling towards unbalanced data collection.By owing methods of sampling reconstruct training set based on neighborhood cleaning rule, the unbalancedness of number of samples between class in reduction training set, DR, FAR, these three indexs of MTTD are all improved, and have overcome existing AID algorithm and be not suitable for the defect of unbalanced traffic data in real world.Verification and measurement ratio increases, and has greatly shortened detection time simultaneously, meets the requirement of real-time that traffic events detects, and can be applicable to road traffic accident and automatically detect.

Claims

1. based on owing the traffic event automatic detection method of sampling towards unbalanced data collection, it is characterized in that, the method comprises the steps:

2) based on neighborhood cleaning rule to described step 1) most classes in the original training set that obtains owe sample process, reduce the unbalancedness of training set, obtain the training set of new relative equilibrium;

4) according to described step 3) optimum value of support vector machine penalty factor and the optimum value of nuclear parameter g that obtain, use described step 2) the training set Training Support Vector Machines of the new relative equilibrium that obtains, obtain the automatic detection model of traffic events towards unbalanced data collection;

5) using the automatic detection model of the traffic events towards unbalanced data collection that trains, to described step 1) test set that obtains carries out traffic events and automatically detects, and determines whether generation traffic events according to the Output rusults of model.

2. according to claim 1 based on owing sampling towards the traffic event automatic detection method of unbalanced data collection, it is characterized in that: described step 1) in actual measurement traffic flow data comprise speed, occupation rate and the flow three class data of the detection section upstream and downstream that detecting device detects in each sampling period.

3. according to claim 1 based on owing sampling towards the traffic event automatic detection method of unbalanced data collection, it is characterized in that: described step 1) in maximum-minimum specification method for actual measurement traffic flow data being processed according to following formula:

\overset{&OverBar;}{x_{ij}} = \frac{x_{ij} - x_{\min j}}{x_{\max j} - x_{\min j}},

4. according to claim 1 based on owing sampling towards the traffic event automatic detection method of unbalanced data collection, it is characterized in that described step 2) method idiographic flow be: to the sample x in training set _i, find three neighbours nearest with it, comparative sample x _iclassification and described three nearest neighbours' classification, if x _ibe most classes, and have two or three to be minority class sample in its three neighbours, in training set, remove sample x _i, otherwise not to x _ido any processing, continue to find the next sample in training set; If x _ibe minority class, and in its three neighbours, to have two or three be most class samples, in training set, remove the most class samples in these three neighbours, otherwise not to x _ido any processing, continue to find the next sample in training set; Wherein i is the sample sequence number in training set, i=1,2 ..., n, n is the total sample number in training set.

According to described in claim 1,2,3 or 4 based on owing sampling towards the traffic event automatic detection method of unbalanced data collection, it is characterized in that: described step 3) idiographic flow be:

Then at C=[2 ^-10, 2 ⁰], g=[2 ⁰, 2 ¹⁰] scope in the variation of 0.5 step-length, by cross validation, in the optimum valuing range of described penalty factor and nuclear parameter g, find the best value of penalty factor and nuclear parameter g.

6. according to the traffic event automatic detection method based on owing to sample towards unbalanced data collection described in claim 1,2,3 or 4, it is characterized in that: described step 5) in, if the Output rusults towards the automatic detection model of traffic events of unbalanced data collection is-1, represent that the traffic circulation state in detection zone is now normal, otherwise represent to occur traffic events.