CN112487961A

CN112487961A - Traffic accident detection method, storage medium and equipment

Info

Publication number: CN112487961A
Application number: CN202011364426.7A
Authority: CN
Inventors: 白鑫贝; 杨泽华; 王耀威; 徐勇; 郑伟诗
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-03-12

Abstract

The invention discloses a traffic accident detection method, a storage medium and equipment, wherein the method comprises the following steps: preprocessing the screened traffic data to construct a traffic accident data set; regarding each original video sample in the traffic accident data set as a packet, and performing space-time domain segmentation on each packet to obtain a plurality of instances corresponding to each packet; constructing an accident detection model according to the weak label attribute of the traffic accident data set, and training the accident detection model based on a plurality of examples corresponding to each packet to obtain a post-training accident detection model; and performing end-to-end traffic accident detection on the test video according to the trained accident detection model. The method provided by the invention can effectively improve the detection rate of the traffic accident and reduce the false alarm rate.

Description

Traffic accident detection method, storage medium and equipment

Technical Field

The present invention relates to the field of traffic accident detection, and in particular, to a traffic accident detection method, a storage medium, and a device.

Background

In recent years, as the quantity of motor vehicles kept by residents is greatly increased, urban traffic environment is increasingly complex, and the incidence rate of traffic accidents is high. Traffic accidents not only cause traffic jams, personal and public property losses, but also endanger life safety, and bring serious physical and psychological trauma to people. With the development and application of video monitoring technology, more and more road monitoring cameras are installed in many cities at present to monitor road traffic states and parameters, so that the monitoring and management of urban traffic are enhanced, and traffic accidents are dealt with in time.

The traditional traffic monitoring accident troubleshooting mode is manual monitoring, namely a mode of watching monitoring videos and finding accidents or abnormity from the monitoring videos by people in a whole day time, and because the traffic accidents occur randomly and unpredictably, the mode not only needs to consume a large amount of manpower, but also is influenced by manually uncontrollable factors such as human eye resolution capability, fatigue degree and the like, so that the stability is possibly low, and the requirements of large-scale traffic monitoring network application on efficiency and cost cannot be met.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

The invention aims to solve the technical problems that a traffic accident detection method, a storage medium and equipment are provided aiming at overcoming the defects of the prior art, and the traditional traffic monitoring accident troubleshooting mode is poor in stability and cannot meet the requirements of large-scale traffic monitoring network application on efficiency and cost.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a traffic accident detection method, comprising the steps of:

preprocessing the screened traffic data to construct a traffic accident data set;

regarding each original video sample in the traffic accident data set as a packet, and performing space-time domain segmentation on each packet to obtain a plurality of instances corresponding to each packet;

constructing an accident detection model according to the weak label attribute of the traffic accident data set, and training the accident detection model based on a plurality of examples corresponding to each packet to obtain a post-training accident detection model;

and performing end-to-end traffic accident detection on the test video according to the trained accident detection model.

The traffic accident detection method comprises the following steps of preprocessing the screened traffic data and constructing a traffic accident data set:

acquiring UCF Crimes traffic data and CADP traffic data;

screening normal videos under all traffic scenes from the UCF Crimes traffic data to form a basic positive sample set;

screening accident videos under all traffic scenes from the UCF Crimes traffic data, and combining the accident videos with CADP traffic data to form a basic negative sample set;

and carrying out data cleaning on the basic positive sample set and the basic negative sample set to obtain corresponding positive sample sets and negative sample sets, and dividing the positive sample sets and the negative sample sets according to a preset proportion to obtain training sets and test sets.

The traffic accident detection method comprises the following steps of dividing the positive sample set and the negative sample set according to a preset proportion to obtain a training set and a test set:

cutting long-term and long-term normal videos in a positive sample set in a training set to obtain the same number of normal videos and accident videos in the training set, wherein the normal videos in the training set comprise normal video level labels, and the accident videos in the training set comprise accident video level labels;

and cutting long-term and long-term normal videos in a positive sample set in a test set to obtain normal videos and accident videos with the number ratio of 3-10:1 in the test set, wherein the normal videos in the test set comprise normal video level labels, and the accident videos in the test set comprise accident video level labels, and initial frames and end frames of accident occurrence.

The traffic accident detection method comprises the following steps of performing space-time domain segmentation on an original video sample in the traffic accident data set to obtain a plurality of examples:

regarding each video segment in the traffic accident data set as a packet, and recording the normal video segment as a negative packet B_nThe accident video is a positive packet and is marked as B_a；

Dividing each packet into several time continuous non-overlapping video segments uniformly in time domain, and recording the number as N_T；

Further dividing or sampling each video segment obtained by dividing on the time domain on the space domain to obtain a plurality of examples, and recording the number as N_s。

The traffic accident detection method comprises the steps of constructing an accident detection model according to weak label attributes of the traffic accident data set, training the accident detection model based on a plurality of instances corresponding to each packet, and obtaining a post-training accident detection model, wherein the steps comprise:

performing multi-mode feature extraction operation on each instance in the packet by adopting a 3D feature extractor, and capturing time sequence information and motion features related to an accident occurrence process in a video;

considering traffic accident detection as a regression problem to design an accident detection model, wherein the accident detection model is a deep neural network formed by a plurality of fully-connected layers;

constructing a loss function of an accident detection model by combining multi-instance learning with a ranking concept based on short-term and continuous assumptions about a traffic accident, the expression is as follows:

in the formula (I), the compound is shown in the specification,

and

respectively representing the feature vectors extracted from the ith time slot and the jth video picture area in the accident video and the normal video,

and

representing the corresponding accident detection model prediction score, λ₁、λ₂W represents a network weight parameter as a hyperparameter;

the loss function is further simplified to: l ═ L (B)_a,B_n)+||W||₂Wherein L (B)_a,B_n) The sum of the first three terms in the loss function is directly related to input sample data;

during training, a certain amount of samples are selected in each iteration process, and the number of the samples is assumed to be N_bsAnd the normal video and the accident video respectively account for half, are respectively randomly extracted from the positive sample set and the negative sample set in the training set, and a pair of normal video and accident video can be combined to calculate a loss value according to the definition of a loss function, so that the loss value during batch sample training can be expressed as:

the traffic accident detection method comprises the steps of constructing an accident detection model according to the weak label attribute of the traffic accident data set, training the accident detection model based on a plurality of instances corresponding to each packet, and obtaining a trained accident detection model, and further comprises the following steps:

in the course of training accident detection model, L is added_i(B_a,B_n)(i＝1,2,K,N_bs/2) in descending order, assuming a difficult sample ratio of P_hBefore selection

L with the largest value_i(B_a,B_n) To calculate a final loss value;

the loss value of the selected strain was designated as L_s,i(B_a,B_n)，i＝1,2,K,

The final loss value is calculated by the formula:

performing iterative training on the accident detection model by adopting a deep learning optimizer algorithm until the loss function is converged;

extracting a first feature directly on the RGB image sequence by using a C3D pre-training model;

extracting a second feature directly on the RGB image sequence by using a Kinetics pre-training model of I3D; (ii) a

Extracting a third feature on the optical flow image by adopting an ImageNet pre-training model of I3D according to the calculation of the original image;

and respectively training the accident detection models by using the first characteristic, the second characteristic and the third characteristic to obtain three corresponding post-training accident detection models.

The traffic accident detection method comprises the following steps of carrying out end-to-end traffic accident detection on a test video according to the trained accident detection model:

dividing each segment of test video in the test set into a plurality of examples, and marking as I^i,jWherein i is 1,2, KN_TTime domain number, j ═ 1,2, K N, indicating an example_SIndicating the spatial domain number of the instance;

for each instance, an optical flow map of the image frame is computed, and then multi-modal feature information is extracted, including: extraction of features on a sequence of RGB images using the C3D model, denoted

Extracting features on an RGB image sequence by using an I3D Kinetics pre-training model, and recording the features as

Extracting features on the optical flow image by using an I3D ImageNet pre-training model, and recording the features as

Inputting the three characteristics into corresponding post-training accident detection models respectively, predicting abnormal scores, and recording the abnormal scores as:

for each type of features, the prediction scores of a plurality of instances obtained by spatially dividing and enhancing the same time interval are fused, and the method comprises the following steps:

the score vector of the input video in the time domain is obtained as follows:

to S_{I3D_FLOW}To carry outMin-Max normalization processing, the formula is as follows:

and in each time period, fusing the scores of the multimode characteristics in a soft voting mode, wherein the formula is as follows:

wherein the content of the first and second substances,

is a sequence which represents the final abnormal score of the input video in each time period, and when the abnormal score exceeds a certain threshold value S_ThAnd if so, judging that the traffic accident occurs in the period of time.

A storage medium, wherein the computer readable storage medium stores one or more programs, which are executable by one or more processors, to implement the steps in the traffic accident detection method of the present invention.

A traffic accident detection apparatus, comprising a processor adapted to implement instructions; and a storage medium adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the steps of the traffic accident detection method of the present invention.

Has the advantages that: compared with the prior art, the traffic accident detection method provided by the invention has the following advantages: based on the existing traffic data set with scale, system and wide application, the data is screened, cleaned and recombined to construct a new traffic accident data set, and data interference factors such as scene conversion, non-monitoring pictures and the like are eliminated, so that the real accident characteristic information can be learned by an algorithm; an accident detection model is established by adopting a multi-instance learning method, video segmentation is carried out on an original video in two dimensions of time and space, data enhancement is carried out, and a plurality of instances are obtained, so that characteristics with finer granularity and more accurate granularity can be obtained for learning; an online difficult sample learning mechanism is added in training, and the learning of samples with higher detection and identification difficulty is emphasized, so that the identification capability of the algorithm on normal videos and accident videos is improved; the multi-mode visual feature information of the RGB image domain and the optical flow domain is respectively extracted by adopting a plurality of 3D feature extractors, and then multi-model and multi-mode feature prediction scores are fused, so that the traffic accident detection performance can be effectively improved, the traffic accident detection rate is improved, and the false alarm rate is reduced; when a traffic accident occurs, the traffic accident detection method provided by the invention can automatically, quickly and accurately detect the traffic accident from the traffic monitoring video, so that an alarm signal is sent out in time after the accident occurs, an accident response plan is started, rescue and dredging work is better carried out, and the life and property loss of people is reduced.

Drawings

Fig. 1 is a flow chart of a traffic accident detection method according to a preferred embodiment of the present invention.

Fig. 2 is a schematic diagram of space-time domain partitioning of each packet according to the present invention.

FIG. 3 is a flow chart of end-to-end traffic accident detection of a test video according to the trained accident detection model of the present invention.

Fig. 4 is a schematic diagram of a traffic accident detecting apparatus of the present invention.

Detailed Description

The present invention provides a traffic accident detection method, a storage medium and a device, and in order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The invention will be further explained by the description of the embodiments with reference to the drawings.

Therefore, when a traffic accident happens, how to automatically, quickly and accurately detect the traffic accident from the traffic monitoring video, so that an alarm signal is timely sent out after the accident happens, an accident response plan is started, rescue and dredging work is better developed, life and property loss of people is reduced, and the traffic monitoring field is a key problem to be solved urgently. With the development of computer vision technology and artificial intelligence technology, intelligent traffic systems are gradually produced, intelligent video analysis technology, historical data and the like are applied to accident detection model establishment and parameter fitting, and an intelligent solution is provided for traffic accident detection. One method is as follows: the method adopts a supervised learning means, and comprises a non-end-to-end processing method and an end-to-end processing method. The non-end-to-end method is characterized in that a series of steps of background subtraction, detection, tracking and the like are connected in series, the manually designed characteristics are extracted, and accidents are detected according to a rule set defined in advance; the end-to-end method generally establishes a deep learning model to extract finer semantic features, and finally regresses to obtain abnormal scores, the data used by the method is generally short video with the duration of several seconds, or accurate label information (for example, frame level) is provided for the accident occurrence period in the video to complete model training, so that the method has the accident identification and positioning capacity, the method usually needs a large amount of training data, consumes a large amount of manpower to perform data preprocessing work such as video marking, cutting and the like, and at present, no large-scale data set which can be used for detecting traffic accidents exists.

The other method is as follows: the method is based on a reconstruction method of an auto-encoder by means of unsupervised learning. The method comprises the steps of performing feature representation on an image frame by using an encoder, reconstructing the image frame by using a decoder, calculating a reconstruction error, and judging that the image frame is abnormal when the reconstruction error is larger than a threshold value. The method generally models a normal behavior mode, only normal videos are in a training set, labeling is not needed, but changes of video pictures are sensitive, when conditions such as sudden illumination changes and rapid movement of a large foreground target exist in a test video, large reconstruction errors can be caused, in addition, all normal behaviors are difficult to cover in the training set, and some atypical normal behaviors can also be misjudged as abnormal accidents.

In fact, there are often times of normal traffic in the original traffic accident video, so that the video segment naturally becomes weak tag data, that is, tag information with known coarse granularity, and weak supervised learning is a very effective way for fully mining and utilizing the data.

Based on this, the present invention provides a traffic accident detection method, as shown in fig. 1, which includes the steps of:

s10, preprocessing the screened traffic data to construct a traffic accident data set;

s20, regarding each original video sample in the traffic accident data set as a packet, and performing space-time domain segmentation on each packet to obtain a plurality of instances corresponding to each packet;

s30, constructing an accident detection model according to the weak label attribute of the traffic accident data set, and training the accident detection model based on a plurality of examples corresponding to each packet to obtain a post-training accident detection model;

and S40, performing end-to-end traffic accident detection on the test video according to the trained accident detection model.

The embodiment provides a traffic accident detection method based on weak supervised learning, which is used for designing an accident detection model in a targeted manner by combining the current situation of data on the basis of deep analysis and description of the characteristics and the category of a traffic accident. Specifically, the embodiment screens, cleans and recombines data based on the existing traffic data set with scale, system and wide application to construct a new traffic accident data set, eliminates data interference factors such as scene conversion and non-monitoring pictures, and is beneficial to learning real accident characteristic information by an algorithm; performing video segmentation on an original video in two dimensions of time and space, and performing data enhancement to obtain a plurality of examples, so that characteristics with finer granularity and more accuracy can be obtained for learning; an accident detection model is established by adopting a multi-instance learning method, an online difficult sample learning mechanism is added, the learning of samples with higher detection and identification difficulty is emphasized, and the identification capability of the algorithm on normal videos and accident videos is improved; the multi-mode visual feature information of the RGB image domain and the optical flow domain is respectively extracted by adopting a plurality of 3D feature extractors, and then the multi-model and multi-mode feature prediction scores are fused, so that the traffic accident detection performance can be effectively improved, namely, the traffic accident detection rate is improved, and the false alarm rate is reduced.

In some embodiments, traffic accidents are scoped and characterized, video data is filtered and processed, and a traffic accident data set is constructed. Specifically, since there is no authoritative traffic accident data set disclosed in the academic world for research in the field of traffic accident detection, the embodiment first needs to construct a data set for traffic accident detection. The traffic accident referred to in this embodiment is a special traffic abnormal event which occurs instantly, is violent, and has strong visual significance, and is relatively easy to be determined by the appearance characteristics such as the motion characteristic or the appearance change of the accident main body, for example, the vehicle suddenly stops, the driving direction suddenly changes, the vehicle body deforms, and the like, and the traffic accident has scene decoupling. Traffic videos or surveillance videos in the real world usually come from cameras deployed in different places, and have great difference in picture factors such as scenes, angles, illumination and the like, and due to the sporadic nature and low frequency of traffic accidents, it is difficult to acquire normal videos and accident videos in the same scene, and meanwhile, in consideration of the characteristics of feature significance and scene decoupling of the traffic accidents, it is possible for an algorithm to learn deep semantic information irrelevant to the scene, so that certain cross-scene generalization capability is ensured, and therefore, the embodiment uses videos in different domains to construct a traffic accident data set.

In this embodiment, the traffic accident data set is used for training an accident detection model established based on a deep learning theory, and in order to ensure that the model can learn really valuable accident characteristic information, enhance the scene adaptability of the model, and ensure the quality and completeness of the data set as much as possible, the construction criteria of the traffic accident data set include: the method comprises the steps of firstly, obtaining a large enough sample amount in a data set, secondly, obtaining a rich enough scene, illumination conditions and accident types covered by the sample, and thirdly, obtaining a high enough sample quality including video definition, the proportion of an accident area to a picture and the like, and removing interference factors possibly influencing accident feature learning as much as possible.

Based on the above criteria, the embodiment constructs the traffic accident data set by comprehensively using the UCF Crimes traffic data and the CADP traffic data which are disclosed at present, have large scale and are applied more. The UCF Crimes traffic data is an abnormal behavior detection data set which comprises 1900 video segments and is rich in scenes, and comprises monitoring videos and traffic videos, wherein 13 types of abnormal events are shared, and traffic accidents are one type of abnormal events; the CADP traffic data is a traffic accident data set which comprises 1416 various types of traffic accident videos under multiple scenes.

In some embodiments, normal videos in all traffic scenes are screened out from UCF Crimes traffic data to form a basic positive sample set; and screening accident videos under all traffic scenes from UCF Crimes traffic data, and combining the accident videos with the CADP traffic data to form a basic negative sample set. In this embodiment, since many videos in the UCF Crimes traffic data have data quality problems such as lens movement, switching, zooming, fast forwarding, non-monitoring pictures and the like, data cleaning is performed on the basic positive sample set and the basic negative sample set obtained in the previous step, parts with scene conversion, zooming, non-monitoring pictures and picture quality problems in the videos are deleted, the cleaned positive sample set and negative sample set are obtained, then the cleaned positive sample set and negative sample set are divided according to a predetermined proportion, a training set and a testing set are obtained, and a traffic accident data set is constructed based on the training set and the testing set.

In some embodiments, the longer normal video is cut into several short video segments, for example, 1 minute in duration, and the video level label is retained as the sample for final use. In this embodiment, the number of normal videos and the number of accident videos of the training set in the constructed traffic accident data set are substantially equal, and each of the number of normal videos and the number of accident videos has 1500 segments. The ratio of the normal video quantity to the accident video quantity of the test set is about 5:1, and the normal video quantity and the accident video quantity are respectively 500 sections and 100 sections, so that a large amount of fine marking work is avoided, and the fact that the traffic normal condition in the real world is far more than that of an accident is reflected to a certain extent. In the video data in the training set, each section of video only has a normal or abnormal video level label, namely, the normal video in the training set comprises a normal video level label, and the accident video in the training set comprises an accident video level label; besides the video level label, each video segment of the test set is labeled with a start frame and an end frame of the traffic accident and used for performing frame-level performance evaluation during testing, that is, the normal videos in the test set comprise the normal video level label, and the accident videos in the test set comprise the accident video level label and the start frame and the end frame of the accident.

In some embodiments, the original video samples in the traffic accident data set are subjected to space-time domain segmentation to obtain a plurality of instances. Specifically, as shown in fig. 2, each video in the traffic accident data set is regarded as a packet, and the normal video is a negative packet and is marked as B_nThe accident video containing the traffic accident is a positive bag and is marked as B_a(ii) a Each packet is divided uniformly in the time domain into several temporally successive but non-overlapping video segments (also called instances), the number of which is denoted as N_T(ii) a The method comprises the specific steps of respectively taking four corners of a video picture as a base point and the center of the video picture as a center, cutting five video segments with the size of w multiplied by h (width multiplied by height), wherein the video segments can be overlapped in areas, then carrying out data enhancement, and carrying out mirror image operation on the five examples to counteract the influence of unbalanced picture distribution.

After each video segment divided in the time domain is subjected to space domain division and data enhancement, 10 new examples are finally obtained, and the number of the space domain examples is recorded as N_SIt should be noted that, based on the idea of space-time domain segmentation of the present invention, other space segmentation methods or data enhancement methods may be adopted according to the situation, so as to obtain different numbers of instances correspondingly. After the original video is subjected to space-time domain segmentation, the example obtained by forward packet is recorded as

The example of a negative packet is recorded as

Wherein i is 1,2, K N_TWatch, watchExample time domain number, j 1,2, KN_SThe spatial index of the example is shown.

In some embodiments, an accident detection regression model based on multi-instance learning is built, and loss functions are designed by combining the ranking concept and the short-term and continuous assumptions of traffic accidents. Specifically, a 3D feature extractor is adopted to perform a multi-mode feature extraction operation on each instance in the package to better capture timing information, motion features and the like related to the accident occurrence process in the video, wherein the feature extractor includes, but is not limited to, C3D, inclusion I3D, Resnet I3D and the like, the invention takes C3D and inclusion I3D networks as examples, and extracted feature vectors are used as input data of an accident detection model; the traffic accident detection is regarded as a regression problem to design an accident detection model, which is a deep neural network formed by a plurality of full-connection layers, and the network structure adopted by the embodiment specifically comprises the following steps: the method comprises the steps of inputting a layer, namely a full connection layer 1, a dropout1, a full connection layer 2, a dropout2 and a full connection layer 3, wherein an activation function of the full connection layer 1 is relu, an activation function of the full connection layer 3 is sigmoid, the number of neurons of the 3 full connection layers is 512, 32 and 1 in sequence, the output of the full connection layer 3 is an output result of a network, an abnormal score of an example is represented, the value range is 0-1, the higher the score is, the higher the probability of an accident is, the larger the number of neurons of the input layer is equal to the dimensionality of a feature vector extracted in the last step, the number of nodes of the input layer is 4096 for a C3D feature, and the number of nodes of the input layer is 1024 for an I3D feature.

In some embodiments, a loss function of an accident detection model is constructed by combining multi-instance learning with a ranking concept based on short-term and continuous assumptions about a traffic accident, the expression being as follows:

in the formula (I), the compound is shown in the specification,

and

respectively representing the feature vectors extracted from the ith time slot and the jth video picture area in the accident video packet and the normal video packet,

and

representing the corresponding accident detection model prediction score, λ₁、λ₂For the hyper-parameter, W represents a network weight parameter. The first term of the loss function is a maximum interval loss which is used for minimizing the highest score in a normal video and maximizing the highest score in an accident video so as to detect accident time intervals as much as possible and reduce the false alarm rate, and the maximum value of the scores of different time interval instances implies a sorting process; the second term and the third term are constraint terms added based on temporal continuity of the anomaly score and a short-time assumption of the occurrence of an accident, and the fourth term is a regularization term. The loss function is further simplified to:

L＝L(B_a,B_n)+||W||₂

in the formula, L (B)_a,B_n) The sum of the first three terms in the loss function is directly related to input sample data; during training, a certain amount of samples are selected in each iteration process, and the number of the samples is assumed to be N_bsAnd the normal video and the accident video respectively account for half, are respectively randomly extracted from the positive sample set and the negative sample set in the training set, and a pair of normal video and accident video can be combined to calculate a loss value according to the definition of a loss function, so that the loss value during batch sample training can be expressed as:

in some embodiments, an online difficult sample learning mechanism is added in the model training process, so that the samples with high identification difficulty are emphasized to be learned, and the discrimination capability of the algorithm on normality and accidents is improved. Specifically, L is_i(B_a,B_n)(i＝1,2,K,N_bs/2) in descending order, assuming a difficult sample ratio of P_hBefore selection

L with the largest value_i(B_a,B_n) To calculate a final loss value;

The final loss value is calculated by the formula:

in some embodiments, the accident detection model is trained using multi-modal features extracted using multiple 3D feature extractors on the RGB image domain and the optical flow domain of the example, respectively, resulting in multiple accident detection models. Specifically, multi-modal feature vectors are extracted from video segments (examples) obtained by original video sample space-time domain segmentation and are used for training an accident detection model; the deep learning optimizer algorithm during the training of the accident detection model can adopt a random gradient descent method, a momentum optimization method, an Adam algorithm, an AdaDelta algorithm and the like. The training data used by the accident detection model is a multi-modal feature vector extracted from a video clip (example) obtained by partitioning an original video sample in a space-time domain, and the invention uses three kinds of feature data, wherein the feature data is extracted directly from an RGB image sequence by adopting a C3D pre-training model, the feature data is extracted directly from the RGB image sequence by adopting an I3D Kinetics pre-training model, and the feature data is used for training respectively on an optical flow image obtained by calculation according to the original image by adopting an I3D ImageNet pre-training model to obtain three accident detection models.

In some embodiments, as shown in fig. 3, the step of performing end-to-end traffic accident detection on the test video comprises: dividing a test video into a plurality of examples, and extracting multi-modal characteristic information of all the examples; for the characteristics of each mode, predicting abnormal scores according to a corresponding accident detection model, fusing the abnormal scores of all instances obtained by space segmentation and data enhancement in the same time period, and normalizing the light stream prediction scores of all time periods of a video; and for each time period, fusing the multi-modal characteristic prediction scores by adopting an averaging method to obtain a final score, and comparing the final score with a preset threshold value to judge whether a traffic accident occurs.

Specifically, a piece of video is taken as a test input, and the test video is divided into a plurality of instances, denoted as I, according to the space-time domain segmentation and data enhancement method described above^i,jWherein i is 1,2, KN_TTime domain number, j ═ 1,2, K N, indicating an example_SIndicating the spatial domain number of the instance;

And (3) respectively inputting the three types of characteristics obtained in the last step into the corresponding trained accident detection models, predicting abnormal scores, and sequentially recording as:

the score vector of the input video in the time domain is obtained as follows:

to avoid the problems that the I3D optical flow characteristic prediction score distribution is too concentrated and the score baseline is inconsistent with the RGB characteristic, the S pair_{I3D_FLOW}Carrying out Min-Max normalization treatment, wherein the formula is as follows:

at each time segment (an example of time domain segmentation), the scores of the multi-modal features are fused together by means of soft voting, and the formula is as follows:

wherein the content of the first and second substances,

is a sequence which represents the final abnormal score of the input video in each time period, and when the abnormal score exceeds a certain threshold value S_ThThen, it is considered that the time is very goodTraffic accidents can occur. The videos in the test set can be tested in batch according to the process, and performance indexes of the algorithm, such as AUC, detection rate, false alarm rate and the like, are evaluated. In the real-time processing mode, the abnormal score may be predicted by executing the above-described flow.

In the embodiment, a section of video is regarded as a packet, an accident video is a positive packet, a normal video is a negative packet, and each packet is divided into a plurality of instances (video clips) according to two dimensions of time and space; for each example, respectively extracting feature information of RGB images by using a C3D network and an I3D network, extracting feature information of optical flow images by using an I3D network, and respectively inputting the three types of feature information into an accident detection network in a multi-layer perceptron mode to obtain abnormal scores; then, for each type of features, averaging scores of a plurality of examples obtained by spatially dividing the same time period to obtain an abnormal score in the time period, forming a vector by abnormal scores in a plurality of time periods (corresponding to a plurality of examples in a time domain), and then performing maximum and minimum normalization on the I3D optical flow feature abnormal score vector; finally, at each time interval, the scores of the three types of characteristics are fused together in a soft voting mode, and when the abnormal score exceeds a threshold value, the traffic accident can be considered to occur in the time interval. Because the information and the dimensionality represented by the features are different, the C3D RGB features, the I3D RGB features and the I3D optical flow features are respectively used for training three independent accident detection networks during training to obtain three accident detection models; the accident detection model is trained in a multi-instance learning mode, the monitoring information is a video-level label, namely whether an accident occurs in a video or not, and in addition, a difficult sample online learning mechanism is added to improve the resolution capability of the traffic accident as far as possible.

In some embodiments, a storage medium is also provided, wherein the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of the traffic accident detection method of the present invention.

In some embodiments, there is also provided a traffic accident detection apparatus, as shown in fig. 4, comprising at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific processes loaded and executed by the storage medium and the instruction processors in the terminal device are described in detail in the method, and are not stated herein.

In conclusion, the invention screens, cleans, recombines and the like the data to construct a new traffic accident data set, eliminates data interference factors such as scene conversion, non-monitoring pictures and the like, and is beneficial to learning real accident characteristic information by an algorithm; an accident detection model is established by adopting a multi-instance learning method, video segmentation is carried out on an original video in two dimensions of time and space, data enhancement is carried out, and a plurality of instances are obtained, so that characteristics with finer granularity and more accurate granularity can be obtained for learning; an online difficult sample learning mechanism is added in training, and the learning of samples with higher detection and identification difficulty is emphasized, so that the identification capability of the algorithm on normal videos and accident videos is improved; the method comprises the steps of adopting a plurality of 3D feature extractors (such as C3D and inclusion I3D) to respectively extract multi-mode visual feature information of an RGB image domain and a light flow domain, and then fusing multi-model and multi-mode feature prediction scores together, so that the performance of an algorithm can be effectively improved, namely, the accident detection rate is improved, and the false alarm rate is reduced.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of traffic accident detection, comprising the steps of:

2. The traffic accident detection method of claim 1, wherein the step of preprocessing the filtered traffic data to construct a traffic accident data set comprises:

acquiring UCF Crimes traffic data and CADP traffic data;

3. The traffic accident detection method of claim 2, wherein the dividing the positive sample set and the negative sample set according to the predetermined ratio to obtain a training set and a testing set further comprises:

4. The traffic accident detection method of claim 1, wherein the step of performing space-time domain segmentation on the original video samples in the traffic accident data set to obtain a plurality of instances comprises:

5. The traffic accident detection method of claim 4, wherein the step of constructing an accident detection model according to the weak label attributes of the traffic accident dataset and training the accident detection model based on the plurality of instances corresponding to each packet to obtain a trained accident detection model comprises:

in the formula (I), the compound is shown in the specification,

and

and

6. the traffic accident detection method of claim 5, wherein the step of constructing an accident detection model according to the weak label attributes of the traffic accident dataset and training the accident detection model based on the plurality of instances corresponding to each packet to obtain a trained accident detection model further comprises:

L with the largest value_i(B_a,B_n) To calculate a final loss value;

The final loss value is calculated by the formula:

7. the traffic accident detection method of claim 6, wherein the step of constructing an accident detection model according to the weak label attributes of the traffic accident dataset and training the accident detection model based on the plurality of instances corresponding to each packet to obtain a trained accident detection model further comprises:

extracting a second feature directly on the RGB image sequence by using a Kinetics pre-training model of I3D;

8. The traffic accident detection method of claim 7, wherein the step of performing end-to-end traffic accident detection on the test video according to the trained accident detection model comprises:

dividing each segment of test video in the test set into a plurality of examples, and marking as I^i,jWherein i is 1,2, K N_TTime domain number, j ═ 1,2, K N, indicating an example_SIndicating the spatial domain number of the instance;

the score vector of the input video in the time domain is obtained as follows:

to S_{I3D_FLOW}Carrying out Min-Max normalization treatment, wherein the formula is as follows:

wherein the content of the first and second substances,

9. A storage medium, wherein the computer-readable storage medium stores one or more programs, the one or more programs being executable by one or more processors to implement the steps in the traffic accident detection method according to any one of claims 1-8.

10. A traffic accident detection apparatus, comprising a processor adapted to implement instructions; and a storage medium adapted to store a plurality of instructions adapted to be loaded by the processor and to perform the steps of the traffic accident detection method according to any one of claims 1 to 8.