CN114202733A

CN114202733A - Video-based traffic fault detection method and device

Info

Publication number: CN114202733A
Application number: CN202210148570.XA
Authority: CN
Inventors: 陈维强; 王雯雯; 冯远宏
Original assignee: Hisense TransTech Co Ltd
Current assignee: Hisense TransTech Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-03-18

Abstract

The application relates to the technical field of intelligent traffic, and provides a video-based traffic fault detection method and equipment. The vehicle detection model uses the hole convolution and the mixed pooling, so that the accuracy of target detection is improved, and each traffic parameter in the traffic parameter set can accurately describe the characteristics of the traffic accident, so that the traffic accident can be automatically judged based on each traffic parameter set and can be reported in time, the effective operation of traffic order maintenance is facilitated, the whole accident detection process does not need manual participation, and the labor cost is saved.

Description

Video-based traffic fault detection method and device

Technical Field

The application relates to the technical field of intelligent traffic, in particular to a traffic fault detection method and device based on videos.

Background

Along with the development of road traffic, the construction of roads is increasingly perfect, the time required by travel and transportation is effectively reduced, and convenience is brought to interconnection and intercommunication among cities.

The running speed of the vehicles on the highway is as high as 80-120km/h, and once a traffic accident happens, the vehicles and public objects (such as guard rails, traffic signs and the like) are damaged, traffic jam is often caused, and even continuous collision of subsequent vehicles is caused, so that the life safety of drivers and passengers is endangered.

At present, the traffic accident detection in a high-speed driving state mainly depends on a manual inspection or a mode reported by the masses, accidents may not be found in time, so that the accident handling efficiency is low, the linkage response is slow, the accident prompt information is not timely, and the probability of occurrence of secondary accidents of traffic is increased.

Disclosure of Invention

The embodiment of the application provides a traffic accident detection method and device based on videos, and the method and device are used for improving the accuracy and the real-time performance of traffic accident detection.

In one aspect, an embodiment of the present application provides a traffic accident detection method based on a video, including:

acquiring a traffic video stream in a detection period;

for each video frame, detecting a preset detection area of the video frame by using a trained vehicle detection model to obtain at least one target vehicle, wherein a loss function of the vehicle detection model comprises a balance coefficient among positive training samples in a positive training sample set used by the vehicle detection model, and the balance coefficient is used for increasing a loss value of a first type of positive samples in the positive training sample set and reducing a loss value of a second type of positive samples in the positive training sample set;

acquiring a traffic parameter set of each target vehicle positioned on the same route in the detection period, wherein each traffic parameter in the traffic parameter set is used for describing the characteristics of a traffic accident;

and weighting each traffic parameter in the traffic parameter set according to a preset weight, and determining whether a traffic accident occurs according to a weighted result.

On the other hand, an embodiment of the present application provides a detection device, which includes a processor, a memory, and a communication interface, where the communication interface, the memory, and the processor are connected through a bus:

the memory stores a computer program, and the processor performs the following operations according to the computer program:

acquiring a traffic video stream in a detection period through the communication interface;

Optionally, the processor detects a preset detection area of the video frame by using a trained vehicle detection model to obtain at least one target vehicle, and the specific operations are as follows:

respectively extracting deep semantic features and shallow position scale features in a preset detection area of the video frame through a plurality of residual error units of the vehicle detection model, wherein at least one residual error unit comprises a cavity convolution kernel;

performing mixed pooling on the deep semantic features to obtain mixed deep semantic features;

upsampling the mixed deep semantic features;

determining the target probability of each target vehicle in the preset detection area according to the up-sampled mixed deep semantic features and the at least one shallow position scale feature;

and obtaining at least one target vehicle according to the determined target probabilities.

Optionally, the processor determines, according to the upsampled mixed deep semantic feature and the at least one shallow position scale feature, a probability that the preset detection region includes at least one target vehicle, and specifically:

performing dimensionality reduction on the up-sampled mixed deep semantic features, and determining a first probability that at least one target vehicle is contained in a preset detection area according to the dimensionality-reduced mixed deep semantic features;

determining a second probability that at least one target vehicle is contained in the preset detection area according to the up-sampled mixed deep semantic features and the first shallow position scale features;

the first shallow position scale feature is subjected to up-sampling, and a third probability that at least one target vehicle is contained in a preset detection area is determined by combining the mixed deep semantic feature and the second shallow position scale feature after the up-sampling;

and determining the target probability of at least one target vehicle in the preset detection area according to the first probability, the second probability and the third probability.

Optionally, the traffic parameter set includes at least one of the number of stationary target vehicles in each video frame, a stationary time difference between stationary target vehicles, a degree of coincidence between target vehicles in each video frame, and a traveling speed of the target vehicles.

Optionally, after determining that the traffic accident occurs, the processor further performs the following operations:

and reporting at least one item of accident information in the position information, the vehicle information, the accident time and the accident type corresponding to the traffic accident.

Optionally, when multiple traffic accidents occur on different routes in the detection period, the processor further performs the following operations:

determining an accident grade of each traffic accident;

and reporting accident information in order according to the accident grade.

Optionally, the vehicle detection model is obtained by training in the following manner:

acquiring a training sample set;

inputting a plurality of training samples in the training sample set and a pre-labeled real label corresponding to each training sample into an initial vehicle detection model, and obtaining the vehicle detection model with a detection loss value within a preset range through multiple rounds of iterative training; wherein, for each round of training, the following operations are performed:

extracting at least one shallow layer feature vector and deep layer feature vector of each training sample through an initial vehicle detection model;

obtaining a prediction label of a corresponding training sample according to at least one shallow layer feature vector and the deep layer feature vector of each training sample;

determining a detection loss value according to the real label and the prediction label corresponding to each training sample;

and adjusting the parameters of the initial vehicle detection model according to the detection loss value.

Optionally, the loss function for calculating the detection loss value is:

wherein the content of the first and second substances,

the value of the detection loss is represented,

representing the balance coefficient between positive and negative training samples,

representing the balance coefficients between positive training samples containing different types of vehicles,

a true label representing each of the training samples,

a prediction label for each training sample is represented.

In another aspect, the present application provides a computer-readable storage medium storing computer-executable instructions for causing a computer to execute a video-based traffic accident detection method provided in an embodiment of the present application.

The beneficial effects of the embodiment of the application are as follows:

in the embodiment of the application, a trained traffic accident detection model is adopted for each video frame in the acquired traffic video stream in the detection period, the preset detection area of the video frame is detected to obtain at least one target vehicle, thereby reducing the interference of a complex background, improving the accuracy of target detection, and, when the traffic accident detection model is trained, considering that the number of different types of vehicles is different, imbalance of the positive training samples of the different types of vehicles can be caused, therefore, a balance factor is added in the loss function of the traffic accident detection model, the balance factor is used for increasing the loss value of a first type sample (a sample with less vehicle number) in the positive training sample set, reducing the loss value of a second type sample (a sample with less vehicle number) in the positive training sample set, increasing the convergence speed of the model and effectively improving the accuracy of target detection; furthermore, after at least one target vehicle is detected, a traffic parameter set of each target vehicle on the same route in the detection period is obtained, and each traffic parameter in the traffic parameter set is used for describing the characteristics of the traffic accident, so that the probability of each traffic parameter in the traffic parameter set after being weighted can be used for accurately detecting the traffic accident in real time.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart illustrating a method for training a vehicle detection model provided by an embodiment of the present application;

FIG. 2 is a block diagram illustrating a feature extraction module of a vehicle detection model provided by an embodiment of the application;

FIG. 3 illustrates an overall framework diagram of a vehicle detection model provided by an embodiment of the present application;

FIG. 4 illustrates a spatial pyramid pooling schematic provided by embodiments of the present application;

FIG. 5 illustrates a stripe pooling scheme provided by embodiments of the present application;

FIG. 6 illustrates a flow chart of a video-based traffic accident detection method provided by an embodiment of the present application;

FIG. 7 illustrates a vehicle detection flow chart provided by an embodiment of the present application;

FIG. 8 is a diagram illustrating the effects of traffic accident detection provided by embodiments of the present application;

fig. 9 is a diagram schematically illustrating a structure of a detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The traffic accidents of vehicles in a high-speed driving state cannot be detected in time by depending on a manual inspection or a mode reported by people. With the increasingly perfect construction of roads, cameras installed in traffic management departments generally exist on the roads, and the development of real-time traffic accident detection based on videos is promoted.

At present, when a video is analyzed to detect a traffic accident, due to the fact that accident types are various (such as a vehicle collides with a guardrail on the roadside, two vehicles collide with the tail, the vehicle is on fire, and the like), accident definition is lack of unified accident description features, and some slight accident description features are not obvious, so that accident early warning is mostly carried out after the stay time of a driver exceeds a time threshold, but due to the fact that the speed of the vehicle is high in a high-speed driving state, when the vehicle collides violently, the situation that the driver cannot get off the vehicle may occur, and thus the stay time of the driver cannot be obtained, so that the current detection algorithm cannot accurately detect the traffic accident in the high-speed driving state, and further cannot accurately track the accident vehicle and the traffic order.

In view of this, embodiments of the present application provide a method and a device for detecting a traffic accident based on video, which obtain a traffic video stream collected by a collection device on a road in real time according to a set detection period, and accurately detect a traffic accident (such as collision, rear-end collision, etc.) of a vehicle in a high-speed driving state through methods of target detection, target tracking, traffic accident logic determination, etc. for each video frame, and report the traffic accident to a traffic system in time, so that a traffic management department can efficiently handle the traffic accident and maintain the stability of traffic operation. The method can automatically realize the detection of the traffic accident without manual participation, thereby saving the labor cost.

In the embodiment of the present application, in order to perform target detection on each acquired video frame, a set of training samples for training a vehicle detection model needs to be collected in advance.

At present, as the highway is generally provided with the acquisition equipment, including but not limited to a crossing camera, an electronic police, and a gate, the set of training samples can be obtained through the acquisition equipment on the highway. Considering the difference of the installation position and the angle of the acquisition equipment, the traffic pictures shot by the acquisition equipment may be different, and the acquisition equipment needs to be configured in advance to acquire the pictures meeting the detection requirements.

Taking the collecting device as an electronic police as an example, when a traffic manager installs the electronic police, the traffic manager usually debugs the installation height and the shooting angle of the camera so as to shoot the picture of the vehicle running. Therefore, after the detection equipment in the embodiment of the application is connected with an electronic police, the traffic video is acquired in real time, and when the acquired traffic video meets the preset vehicle detection requirement (such as a road is in the central area of a video frame), secondary debugging is not needed; and when the acquired traffic video does not meet the preset vehicle detection requirement, debugging is carried out again until the preset vehicle detection requirement is met. Further, after a traffic video with preset vehicle detection requirements is obtained, a frame of reference image is randomly selected from a traffic video frame, a detection area of a traffic accident is manually marked based on the selected reference image, and the position information of a marked detection area frame is recorded.

In a similar way, after the detection equipment is connected with the intersection camera and the bayonet, marking of the detection area is also carried out according to the collected traffic video.

In the embodiment of the application, because the parameters such as the installation position, the angle, the resolution ratio and the like of the same acquisition device are unchanged, only one reference image needs to be selected for marking the detection area, and the marking efficiency is improved; moreover, through the labeled detection area, the interference of a complex background can be eliminated, and the accuracy of target detection is improved.

It should be noted that, due to the difference in the resolution, the angle, and the like of the acquisition devices, the position and the size of the labeled detection area may be different for the traffic videos acquired by different acquisition devices. The detection areas corresponding to the respective acquisition devices are shown in table 1.

TABLE 1 detection regions corresponding to respective acquisition devices

Where, (xi, yi) represents the origin of the detection region frame, wi represents the width of the detection region frame, and hi represents the height of the detection region frame.

And marking a detection area for the traffic video acquired by each acquisition device in advance, and then training a vehicle detection model by taking the traffic video acquired by each acquisition device as a training sample set. The specific training process is shown in fig. 1:

s101: a set of training samples is obtained.

In an embodiment of the application, the training sample set is from traffic videos collected by each collection device on the road. Because the detection areas of the traffic videos acquired by the acquisition devices are labeled in advance, the traffic videos acquired by the acquisition devices are cut based on the detection area corresponding to each acquisition device, and a training sample set is obtained.

The training sample set comprises a positive training sample set and a negative training sample set, the positive training sample set is composed of images containing vehicles, and the negative training sample set is composed of images not containing vehicles. Due to the large difference between the use amounts of different types of vehicles, the positive training samples of the vehicles (such as cars) with large use amounts are easier to collect, and the positive training samples of the vehicles (such as mixer cars) with small use amounts are less easy to collect, so that the number of the positive training samples containing different types of vehicles in the positive training sample set may be different. Therefore, in the embodiment of the present application, the positive training samples in the positive training sample set are divided into the first type of positive samples and the second type of positive samples, the number of the first type of positive samples is smaller than the first sample threshold, and the number of the second type of positive samples is larger than the second sample threshold. The first sample threshold and the second sample threshold may be the same or different in size, and may be set according to actual requirements.

After a training sample set is obtained, each training sample is labeled with a sample label. For example, positive training samples are labeled 1 and negative training samples are labeled 0.

S102: inputting a plurality of training samples in the training sample set and a pre-labeled real label corresponding to each training sample into an initial vehicle detection model, and obtaining the vehicle detection model with the detection loss value within a preset range through multi-round iterative training.

In S102, for each round of training, the following operations are performed:

s1021: at least one shallow feature vector and deep feature vector of each training sample are extracted through an initial vehicle detection model.

In S1021, a plurality of training samples in the training sample set are input to the initial vehicle detection model, and then a feature vector of each training sample is extracted by the initial vehicle detection model. In specific implementation, considering that input training samples have the characteristic of multi-scale targets, and aiming at each training sample, extracting a deep layer feature vector and a shallow layer feature vector of the training sample through an initial vehicle detection model, wherein the deep layer feature vector contains semantic information of a target object in the training sample, and the shallow layer feature vector contains position information and multi-scale information of the target object in the training sample.

Generally, road traffic scenes are complex, and the resolution ratios of the acquisition devices are different, so that target objects in training samples are often influenced by environmental factors, and the problems of large light change, inconsistent shielding, large scale change and the like exist. Although the shallow feature vector extracted by the initial vehicle detection model integrates more position information and multi-scale information of the target object, the receptive field for extracting the shallow feature vector is also reduced. The reduction in the receptive field is more pronounced when the resolution of the training samples is greater (e.g., 11920 x 1080 pixels resolution).

In view of the above situation, in the embodiment of the present application, when training a vehicle detection model, in at least one residual unit of the model for extracting a shallow feature vector and a deep feature vector, an empty Convolution (Atrous Convolution) kernel is used to increase a receptive field when extracting the feature vector, and meanwhile, the adaptability of the model to a large-resolution training sample and a multi-scale target object is increased.

Optionally, in the embodiment of the present application, a hole convolution kernel is used in the last two residual error units of the vehicle detection model.

S1022: and obtaining the prediction label of the corresponding training sample according to at least one shallow layer feature vector and deep layer feature vector of each training sample.

In step S1022, for each training sample, performing spatial pyramid pooling on the deep feature vectors by using a feature dimension augmentation fusion method to reduce convolution operations in the vehicle detection model and prevent target object deformation in the training sample.

In specific implementation, firstly, a training sample to be detected is searched through Selective Search (Selective Search) to obtain a plurality of candidate regions; then carrying out convolution operation on the whole training sample to obtain a characteristic diagram; then mapping each candidate region with the feature map to obtain a feature vector of each candidate region, and inputting the feature vector of each candidate region into the pyramid space pooling layer to obtain a feature vector with a fixed length; and finally, inputting a plurality of feature vectors with fixed lengths into the full-connection layer to obtain the deep layer feature vectors after pooling.

Due to the fact that the maximum Pooling (Max Pooling) mode of the pyramid space Pooling layer easily causes loss of local information, the detection accuracy of the vehicle detection model is affected. Therefore, according to the embodiment of the application, the position-sensitive strip-shaped pooling structure is added in the pyramid space pooling layer, so that fine-grained segmentation and pooling of the feature map are realized, and the defects caused by maximum pooling are overcome.

The deep layer characteristic vector after the maximum pooling and the stripe pooling in the embodiment of the application not only contains more semantic information, but also stores certain local position information, so that the deep layer characteristic vector extracted by the initial vehicle detection model contains rich position and semantic information, and the subsequent detection of multi-target objects is facilitated.

Further, in S1022, the pooled deep layer feature vectors are up-sampled and fused with at least one shallow layer feature vector, and the sub-label of each training sample is predicted based on each fused feature vector and the pooled deep layer feature vectors, and then the final prediction label of each training sample is determined according to the prediction sub-label of each training sample.

S1023: and determining a detection loss value according to the real label and the predicted label corresponding to each training sample.

In the embodiment of the application, the initial vehicle detection model is used for calculating the loss function of the detection loss value, and not only the imbalance between the number of the positive training samples and the number of the negative training samples in the training sample set are considered, but also the imbalance between the positive training samples containing different types of vehicles is considered. Specifically, the loss function for calculating the detection loss value is as follows:

equation 1

Wherein the content of the first and second substances,

the value of the detection loss is represented,

the balance coefficient between the positive training sample set and the negative training sample set is represented, and is used for optimizing the weights of different numbers of positive training samples and negative training samples and ensuring the recall rate of the positive samples;

the balance coefficient is used for increasing the loss value of the first type of positive samples in the positive training sample set, reducing the loss value of the second type of positive samples in the positive training sample set and preventing the positive training samples from containing different types of vehicles

Is inundated by a larger number of negative training samples and a second type of positive training samples;

a true label representing each of the training samples,

a prediction label for each training sample is represented.

In the embodiment of the application, the balance coefficient between training samples is increased in the loss function of the model

Compared with the original loss function, the problem of unbalanced quantity among the training samples can be effectively solved, and the detection accuracy of the model is improved; and, increasing the balance coefficient between training samples

The loss function after the training enables the model to be fast converged aiming at the condition of the first type of positive training samples with small number in the training process, and the detection efficiency of the model is improved.

S1024: and adjusting the parameters of the initial vehicle detection model according to the detection loss value.

In S1024, parameters of the initial vehicle detection model are adjusted by using the detection loss value obtained in each training round until the detection loss value is within a preset range, so as to obtain a trained vehicle detection model.

On one hand, the vehicle detection model of the application, on the one hand, by using the cavity convolution kernel in at least one residual error unit of the vehicle detection model, the receptive field for extracting the characteristic vector is enlarged on the premise that the resolution is not lost for the trained vehicle detection model, and meanwhile, the size of the receptive field can be controlled by controlling the expansion Rate (contrast Rate) of the cavity convolution kernel, and the multi-scale information and the position information of the target object under high resolution are obtained, so that the problems of multi-scale change, shielding change and the like caused by different distances of the same target object in a complex traffic scene are effectively solved; on the other hand, stripe pooling is added in a pyramid space pooling layer of the vehicle detection model, fine-grained segmentation of deep layer feature vectors is achieved, and loss of local information caused by maximum pooling is made up; on one hand, the problem of unbalanced quantity between the positive training samples is effectively solved by adding the balance coefficient between the positive training samples in the loss function of the vehicle detection model, and the detection accuracy and the detection efficiency of the vehicle detection model are improved.

The network structure of the vehicle detection model in the embodiment of the present application is described below based on the network structure of CSPDarknet 53.

Referring to fig. 2, a feature extraction module of a vehicle detection model provided in the embodiment of the present application includes 5 residual error units (Resblock _ body), where the number of residual error blocks in different residual error units is different, and a Cross Stage Partial structure (Cross Stage Partial) is added to each residual error unit, so that parameters of the vehicle detection model are reduced, and the vehicle detection model is easier to train.

Considering that the resolution of each acquisition device is different in size, a target object in a traffic scene is influenced by environmental factors and often has the problems of large light change, nonuniform shielding, large scale change and the like, although the shallow feature vector contains the position information of the target object, the receptive field for extracting the shallow feature vector is also reduced, and the effect is obvious particularly for the image with large resolution. Therefore, in the last two residual error units of the vehicle detection model feature extraction module, the cavity convolution kernel with the partition Rate of 2 and the size of 5 × 5 is used, so that the sensing field for extracting the feature vector is increased relative to the original CSPDarknet53 feature extraction structure, and meanwhile, the accuracy for positioning the small target object in the complex traffic scene is also ensured.

The complete structure of the vehicle detection model is shown in fig. 3, and mainly comprises a feature extraction module, a pooling module and a prediction module. Taking an example of inputting a training sample with size of 608 × 608 pixels, three feature maps with sizes of 76 × 76, 38 × 38, and 19 × 19 respectively are extracted by convolution operations of a plurality of residual units. Wherein, the characteristic graphs of 76 × 76 and 38 × 38 are used as shallow layer characteristic vectors, and the characteristic graphs of 19 × 19 are used as deep layer characteristic vectors.

And performing Spatial Pyramid Pooling (SSP) of feature dimension augmentation fusion on the obtained deep feature vectors. Assuming that the deep feature vectors are convolved to obtain a feature map with the dimension of 13 × 256, and selectively searching the feature map to obtain a plurality of candidate regions during SSP pooling. For each candidate region, the SSP Pooling layer divides the candidate region into three subgraphs of 1 × 1, 2 × 2 and 4 × 4, performs Max Pooling (Max Pooling) on each subgraph, and connects the pooled feature vectors together to obtain a feature vector of (16+4+1) × 256, as shown in fig. 4, where d represents the dimension or the number of channels.

In the pooling module of the vehicle detection model, a position-sensitive strip-shaped pooling structure is added in spatial pyramid pooling, and as shown in fig. 5, short-distance and long-distance dependency relationships between different positions are captured by using horizontal and vertical stripe pooling operations, fine-grained segmentation of deep semantic information is realized, and the accuracy of target object detection is improved.

Fusing SSP pooled deep layer feature vectors and stripe pooled deep layer feature vectors at a prediction module of a vehicle detection model, reducing the dimension of the fused feature vectors, and predicting a first probability of existence of a target object based on the fused feature vectors after dimension reduction; the fused feature vector is subjected to up-sampling, and a second probability of existence of the target object is predicted by combining a shallow feature vector; and predicting a third probability of the presence of the target object using another shallow feature vector. Further, the first probability, the second probability and the third probability are weighted to obtain weighted probabilities, and whether the training sample contains the target object or not is predicted.

Based on the trained vehicle detection model, a video-based traffic accident detection method process is executed, referring to fig. 6, and the process is executed by a detection device and mainly includes the following steps:

s601: and acquiring the traffic video stream in the detection period.

In S601, the detection device acquires a traffic video stream from each connected acquisition device according to a preset detection period. The size of the detection period can be set according to actual requirements, for example, the detection period is 1 hour.

S602: and aiming at each video frame, detecting a preset detection area of the video frame by adopting a trained vehicle detection model to obtain at least one target vehicle.

The structure of the vehicle detection model is referred to the foregoing embodiments, and will not be repeated here.

As can be seen from the foregoing embodiment, the traffic video acquired by each acquisition device has been labeled with a detection area in advance, and when S602 is executed, the detection device determines, for each acquired video frame, a preset detection area of the video frame by determining the acquisition device to which the video frame belongs, and detects at least one target vehicle in the preset detection area by using a trained vehicle detection model. The specific detection process is shown in fig. 7:

s6021: and respectively extracting deep semantic features and shallow position scale features in a preset detection area of the video frame through a plurality of residual error units of the vehicle detection model.

The structure of the residual error unit is shown in fig. 2-3, wherein, for one residual error unit for extracting the shallow position scale feature and one residual error unit for extracting the deep semantic feature, a cavity convolution kernel with a partition Rate of 2 and a size of 5 × 5 is used, so that the receptive field of feature extraction is increased, and meanwhile, for a video frame with high resolution, the extracted shallow position scale feature retains the position information of the target vehicle, can adapt to the change of the target vehicle scale, and improve the accuracy of target detection.

S6022: and performing mixed pooling on the deep semantic features to obtain mixed deep semantic features.

In an optional embodiment, when performing S6022, referring to fig. 3, performing SSP pooling on the deep semantic features in a max-pooling manner, performing stripe pooling on the deep semantic features, and fusing the pooled deep semantic features to obtain a mixed deep semantic feature.

S6023: and upsampling the mixed deep semantic features.

By up-sampling the mixed deep semantic features, the size of the feature map is increased, and the follow-up pixel-by-pixel prediction is facilitated.

S6024: and determining the probability of containing at least one target vehicle in the preset detection area according to the up-sampled mixed deep semantic features and the at least one shallow position scale feature.

Taking two shallow position scale features as an example, when S6024 is executed, performing dimensionality reduction on the up-sampled mixed deep semantic features, and determining a first probability that at least one target vehicle is included in a preset detection area according to the dimensionality-reduced mixed deep semantic features; determining a second probability that at least one target vehicle is contained in the preset detection area according to the up-sampled mixed deep semantic features and the first shallow position scale features; the first shallow position scale feature is subjected to up-sampling, and a third probability that at least one target vehicle is contained in the preset detection area is determined by combining the mixed deep semantic feature and the second shallow position scale feature after the up-sampling; further, the target probability of each target vehicle in the preset detection area is determined according to the first probability, the second probability and the third probability.

S6025: and obtaining at least one target vehicle according to the determined target probabilities.

In general, there is more than one vehicle traveling at high speed on a highway, and there may be cases where the same video frame contains multiple vehicles. Therefore, when the vehicle detection model detects a vehicle, there may be a case where all of the plurality of target probabilities are greater than the detection threshold, and at this time, all of the target vehicles corresponding to all of the target probabilities greater than the detection threshold are framed out by the detection frame.

S603: and acquiring a traffic parameter set of each target vehicle positioned on the same route in the detection period, wherein each traffic parameter in the traffic parameter set is used for describing the characteristics of the traffic accident.

In the embodiment of the application, tracking is performed on each detected target vehicle, and a traffic parameter set of each target vehicle located on the same route in a detection period is obtained, wherein each traffic parameter in the traffic parameter set is used for describing the characteristics of a traffic accident. Specifically, the traffic parameter set includes at least one of the number of stationary target vehicles in each video frame, a stationary time difference between the stationary target vehicles, a degree of coincidence between the target vehicles in each video frame, and a traveling speed of the target vehicles.

The acquisition mode of each traffic parameter in the traffic parameter set is as follows:

traveling speed of target vehicle: for each target vehicle, the tail displacement (unit: pixel) of the target vehicle is obtained from the front video frame and the rear video frame, and the running speed of the target vehicle is calculated according to the time difference of the two video frames.

Stationary time difference between stationary target vehicles: and tracking the detected motion situation of the target vehicles to acquire the time difference when two target vehicles become stationary.

Number of stationary target vehicles in each video frame: and tracking the detected motion condition of the target vehicle, judging that the target vehicle is a static vehicle when a certain target vehicle is kept still in a preset detection area all the time in different video frames acquired by the same acquisition equipment, and counting the number of all static vehicles in each video frame.

Coincidence ratio between target vehicles in each video frame: and aiming at each video frame, determining the ratio of the shielding area between two intersected target vehicles to the sum of the areas of the two vehicles, counting the ratio of the shielding area of all the two intersected vehicles to the sum of the areas of the two vehicles in the video frame, and averaging the counted ratios to obtain the contact ratio between the target vehicles.

Generally, a vehicle collision accident is likely to occur when two target vehicles in the same video frame change from a higher traveling speed to a standstill in a short time and the standstill time exceeds a time threshold.

S604: and weighting each traffic parameter in the traffic parameter set according to a preset weight, and determining whether a traffic accident occurs according to the weighted probability.

And S604, weighting each traffic parameter in the traffic parameter set of each target vehicle according to the preset weight of each traffic parameter in the traffic parameter set, and determining whether a traffic accident occurs according to the weighted result.

For example, when the weighted result is greater than or equal to a preset accident threshold, it is determined that a traffic accident occurs; and when the weighted result is smaller than a preset accident threshold value, determining that no traffic accident occurs.

The video-based traffic accident detection method provided by the embodiment of the application performs automatic detection of a traffic accident based on a traffic video stream acquired in real time, performs target detection through a trained vehicle detection model, tracks a detected target vehicle to obtain a corresponding traffic parameter set, and further performs logical judgment of the traffic accident by using the obtained traffic parameter set. Because the vehicle detection model in the embodiment of the application uses the hollow convolution kernel and the mixed pooling, on the premise of not losing the resolution, the receptive field of feature extraction is enlarged, the position information and the multi-scale information of the target object are retained, the fine-grained segmentation of deep semantic features is realized, and the local information of the target vehicle is retained, so that the accuracy of target detection is improved; in addition, each traffic parameter in the traffic parameter set used in the embodiment of the application accurately defines the characteristics of the traffic accident of the target vehicles in different forms in the high-speed driving state, so that the traffic accident can be accurately identified based on each traffic parameter set, the whole traffic accident detection process does not need manual participation, and the labor cost is saved.

Referring to fig. 8, an effect diagram of the traffic accident detection provided in the embodiment of the present application is shown, where 801 is a preset detection area, and 802 are target vehicles detected by using the vehicle detection model provided in the embodiment of the present application, and the occurrence of a traffic accident can be accurately determined by obtaining traffic parameter sets of the two vehicles.

In some embodiments, after the occurrence of a traffic accident is determined, the detection device reports at least one item of accident information of position information, vehicle information, accident time and accident type corresponding to the traffic accident to the traffic system in real time, so that the traffic management department can dispatch corresponding personnel to deal with the traffic accident in time, maintain traffic order, improve traffic operation efficiency and avoid occurrence of secondary traffic accidents.

In some embodiments, since the detection device receives the traffic video streams collected by the collection devices from different positions, one detection period may involve multiple driving routes, and when multiple traffic accidents occur on different routes in the detection period, the accident level of each traffic accident of the detection device is detected, and the accident information is reported in order according to the accident level, so that the traffic management department can preferentially handle the serious accidents, and the life health of drivers and passengers is protected. Wherein, the higher the accident grade, the more serious the degree of the traffic accident.

For example, suppose that a traffic accident 1 occurs on a route 1 in a detection period, two vehicles collide with each other in the traffic accident 1, no personnel is damaged, and the accident level is level 1; a traffic accident 2 occurs on the route 2, multiple vehicles collide with each other in a chain in the traffic accident 2, people are injured, and the accident grade is 4. At this time, the detection device reports accident information corresponding to the traffic accident 2 preferentially.

The traffic accident detection method based on the video can be applied to an intelligent traffic system, and automatic detection and real-time reporting of traffic accidents are achieved, so that a traffic management department dispatches personnel to process the traffic accidents. Aiming at the vehicle collision accident in a high-speed driving state, the method can quickly and accurately detect the traffic accident within 30s, improves the processing efficiency of the traffic accident by 20 percent, and effectively reduces the probability of secondary damage accidents. Compared with a manual reporting mode, the method solves the problems of poor accident reporting real-time performance, high-speed traffic jam and secondary accident generation, and therefore the traffic operation efficiency is improved.

Based on the same technical concept, embodiments of the present application provide a detection device, which can execute the video-based traffic accident detection method provided in the foregoing embodiments and achieve the same technical effects, and thus, details are not repeated herein.

Referring to fig. 9, the detection apparatus includes a processor 901, a memory 902, and a communication interface 903, where the communication interface 903, the memory 902 and the processor 901 are connected by a bus 904, the memory 902 stores computer program instructions, and the processor 901 performs the following operations according to the computer program instructions stored in the memory 902:

acquiring a traffic video stream in a detection period through a communication interface 903;

aiming at each video frame, detecting a preset detection area of the video frame by adopting a trained vehicle detection model to obtain at least one target vehicle, wherein a loss function of the vehicle detection model comprises a balance coefficient among positive training samples in a positive training sample set used by the vehicle detection model, and the balance coefficient is used for increasing the loss value of a first type of positive samples in the positive training sample set and reducing the loss value of a second type of positive samples in the positive training sample set;

acquiring a traffic parameter set of each target vehicle positioned on the same route in a detection period, wherein each traffic parameter in the traffic parameter set is used for describing the characteristics of a traffic accident;

Optionally, the processor 901 detects a preset detection area of the video frame by using a trained vehicle detection model to obtain at least one target vehicle, and the specific operations are as follows:

respectively extracting deep semantic features and shallow position scale features in a preset detection area of a video frame through a plurality of residual error units of a vehicle detection model, wherein at least one residual error unit comprises a cavity convolution kernel;

upsampling the mixed deep semantic features;

determining the target probability of each target vehicle in a preset detection area according to the up-sampled mixed deep semantic features and at least one shallow position scale feature;

Optionally, the processor 901 determines, according to the upsampled mixed deep semantic features and the at least one shallow position scale feature, a probability that the preset detection region includes at least one target vehicle, and specifically:

Optionally, the traffic parameter set includes at least one of the number of stationary target vehicles in each video frame, a stationary time difference between the stationary target vehicles, a degree of coincidence between the target vehicles in each video frame, and a traveling speed of the target vehicle.

Optionally, when it is determined that a traffic accident occurs, the processor 901 further performs the following operations:

and reporting at least one item of accident information in position information, vehicle information, accident time and accident type corresponding to the traffic accident.

Optionally, when multiple traffic accidents occur on different routes in the detection period, the processor 901 further performs the following operations:

determining an accident grade of each traffic accident;

and reporting accident information in order according to the accident grade.

acquiring a training sample set;

inputting a plurality of training samples in the training sample set and a pre-labeled real label corresponding to each training sample into an initial vehicle detection model, and obtaining the vehicle detection model with the detection loss value within a preset range through multi-round iterative training; wherein, for each round of training, the following operations are performed:

obtaining a prediction label of a corresponding training sample according to at least one shallow layer feature vector and deep layer feature vector of each training sample;

Optionally, the loss function for calculating the detection loss value is:

wherein the content of the first and second substances,

the value of the detection loss is represented,

a true label representing each of the training samples,

a prediction label for each training sample is represented.

It should be noted that fig. 9 is only necessary hardware for implementing the video-based traffic accident detection method provided by the embodiment of the present application, and optionally, the detection device further includes conventional hardware such as a display, a video processor, and the like.

The processor referred to in the embodiments of the present application may be a Central Processing Unit (CPU), a general-purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like. Wherein the memory may be integrated in the processor or may be provided separately from the processor.

Embodiments of the present application also provide a computer-readable storage medium for storing instructions that, when executed, may implement the methods of the foregoing embodiments.

The embodiments of the present application also provide a computer program product for storing a computer program, where the computer program is used to execute the method of the foregoing embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A video-based traffic accident detection method is characterized by comprising the following steps:

acquiring a traffic video stream in a detection period;

2. The method of claim 1, wherein the detecting the preset detection area of the video frame by using the trained vehicle detection model to obtain at least one target vehicle comprises:

upsampling the mixed deep semantic features;

3. The method of claim 2, wherein determining the probability of containing at least one target vehicle within the preset detection region based on the upsampled mixed deep semantic features and the at least one shallow position scale feature comprises:

4. The method of any one of claims 1-3, wherein the set of traffic parameters includes at least one of a number of stationary target vehicles in each video frame, a stationary time difference between stationary target vehicles, a degree of overlap between target vehicles in each video frame, and a travel speed of a target vehicle.

5. The method of claim 1, wherein when it is determined that a traffic accident has occurred, the method further comprises:

6. The method of claim 1, wherein when there are multiple traffic accidents on different routes within the detection period, the method further comprises:

determining an accident grade of each traffic accident;

and reporting accident information in order according to the accident grade.

7. The method of any of claims 1-3 or 5-6, wherein the vehicle detection model is trained by:

acquiring a training sample set;

8. The method of claim 7, wherein the loss function that calculates the detected loss value is:

wherein the content of the first and second substances,

the value of the detection loss is represented,

a true label representing each of the training samples,

a prediction label for each training sample is represented.

9. A detection device, comprising a processor, a memory, and a communication interface, wherein the communication interface, the memory, and the processor are connected by a bus:

10. The detection apparatus according to claim 9, wherein the processor detects the preset detection area of the video frame by using a trained vehicle detection model to obtain at least one target vehicle, and is specifically configured to:

upsampling the mixed deep semantic features;