CN114677618A

CN114677618A - Accident detection method and device, electronic equipment and storage medium

Info

Publication number: CN114677618A
Application number: CN202210194552.5A
Authority: CN
Inventors: 贾若然; 谭昶; 朱兴海; 郑爱华; 刘江; 冯祥; 韩辉; 张友国; 姜殿洪
Original assignee: iFlytek Co Ltd; Anhui University; Iflytek Information Technology Co Ltd
Current assignee: iFlytek Co Ltd; Anhui University; Iflytek Information Technology Co Ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-06-28

Abstract

The invention provides an accident detection method, an accident detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining an image frame sequence of a video to be detected; based on a global extraction network, three-dimensional feature extraction is carried out on the image frame sequence to obtain global features of the video to be detected; based on a local extraction network, determining local characteristics of a video to be detected by applying a detection target and a target position of each frame of image in an image frame sequence; and determining the accident detection result of the video to be detected by applying the global features and the local features based on the fusion classification network. According to the method, the device, the electronic equipment and the storage medium, the accident detection is carried out by combining the global characteristics and the local characteristics of the video to be detected, so that the accident detection can be accurately and reliably finished no matter the condition that the target detection fails or is lost due to the drastic change of the target or the condition that the scene change is not obvious, the traffic accident can be timely monitored, and the timeliness of accident investigation is facilitated.

Description

Accident detection method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an accident detection method and apparatus, an electronic device, and a storage medium.

Background

With the development and application of video monitoring technology, a large number of road monitoring cameras are installed, so that conditions are provided for monitoring road traffic states and timely disposing of traffic accidents.

The traditional traffic accident troubleshooting mode is manual monitoring, personnel are required to watch and watch monitoring videos all day long, a large amount of manpower is required to be consumed, the influence of manual uncontrollable factors such as human eye resolution capability and fatigue degree is also caused, and the reliability is low.

With the wide application of deep learning in computer vision tasks, the video-based traffic accident detection method is developed for the sake of operation. At present, the scheme mostly acquires information of appearance and motion of a target from a video, extracts features from the information and detects accidents according to the features. However, the direct extraction of features from the appearance and operation information of the target may cause partial information loss and affect the detection accuracy, and the extraction of features and the detection of accidents depend on the accuracy of target detection and tracking, which also affects the reliability of accident detection.

Disclosure of Invention

The invention provides an accident detection method, an accident detection device, electronic equipment and a storage medium, which are used for solving the problems of low detection precision and low reliability in the prior art when accident detection is carried out based on related information of a target in a video.

The invention provides an accident detection method, which comprises the following steps:

determining an image frame sequence of a video to be detected;

based on a global extraction network, three-dimensional feature extraction is carried out on the image frame sequence to obtain global features of the video to be detected;

based on a local extraction network, determining local characteristics of the video to be detected by applying the detection target and the target position of each frame of image in the image frame sequence;

and determining the accident detection result of the video to be detected by applying the global features and the local features based on a fusion classification network.

According to the accident detection method provided by the invention, the three-dimensional feature extraction is carried out on the image frame sequence based on the global extraction network to obtain the global features of the video to be detected, and the method comprises the following steps:

performing multilayer three-dimensional convolution on the image frame sequence based on a multilayer three-dimensional convolution network in the global extraction network to obtain a first convolution feature and a second convolution feature, wherein the first convolution feature is obtained by convolution before the second convolution feature;

and based on the attention network in the global extraction network, applying the first convolution feature, determining the attention weight of the second convolution feature, applying the attention weight, and weighting the second convolution feature to obtain the global feature.

According to the accident detection method provided by the invention, the first convolution characteristic and the second convolution characteristic are convolution characteristics output by a second last layer and a first last layer of the multilayer three-dimensional convolution respectively;

the applying the first convolution feature and determining an attention weight of the second convolution feature includes:

performing single-layer three-dimensional convolution on the first convolution feature to obtain a third convolution feature with the same dimension as the second convolution feature;

determining the attention weight based on the third convolution characteristic.

According to the accident detection method provided by the invention, the method for determining the local characteristics of the video to be detected by applying the detection target and the target position of each frame of image in the image frame sequence based on the local extraction network comprises the following steps:

determining target characteristics and target positions of detection targets in each frame of images of the image frame sequence based on a target detection network in the local extraction network;

and performing spatio-temporal information extraction on the target feature map of each frame image based on a spatio-temporal extraction network in the local extraction network to obtain the local features, wherein the target feature map is determined based on the target features and the target positions of the detection targets in the corresponding images.

According to an accident detection method provided by the present invention, the extracting spatio-temporal information of the target feature map of each frame image based on the spatio-temporal extraction network in the local extraction network to obtain the local feature comprises:

based on a graph convolution network in the space-time extraction network, carrying out spatial information extraction on the target characteristic graph of each frame of image to obtain a target spatial relationship of each frame of image;

and performing time sequence characteristic extraction on the target space relation of each frame of image based on a time sequence extraction network in the space-time extraction network to obtain the local characteristics of the video to be detected.

According to the accident detection method provided by the invention, the determining of the accident detection result of the video to be detected by applying the global features and the local features based on the fusion classification network comprises the following steps:

and fusing the global features and the local features based on the fusion classification network, extracting context based on the fused features to obtain context features, classifying accidents by applying the context features, and determining the accident detection result.

According to the accident detection method provided by the invention, the global extraction network, the local extraction network and the fusion classification network are determined based on the following steps:

Constructing an initial detection network based on the initial global extraction network, the initial local extraction network and the initial fusion classification network;

training the initial detection network based on a first sample video carrying an accident label, and determining the global extraction network, the local extraction network and the fusion classification network based on the trained initial detection network.

According to the accident detection method provided by the invention, the initial global extraction network is obtained by combining a global classification network training based on a second sample video carrying an accident label, and the global classification network is used for carrying out accident detection based on global characteristics.

The present invention also provides an accident detection apparatus, comprising:

the sequence determining unit is used for determining an image frame sequence of a video to be detected;

the global extraction unit is used for extracting three-dimensional features of the image frame sequence based on a global extraction network to obtain global features of the video to be detected;

the local extraction unit is used for determining the local characteristics of the video to be detected by applying the detection target and the target position of each frame of image in the image frame sequence based on a local extraction network;

And the fusion classification unit is used for determining the accident detection result of the video to be detected by applying the global features and the local features based on a fusion classification network.

The invention also provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of any of the above-mentioned accident detection methods.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the fault detection method as described in any one of the above.

According to the accident detection method, the accident detection device, the electronic equipment and the storage medium, the accident detection is carried out by combining the global characteristics and the local characteristics of the video to be detected, so that the accident detection can be accurately and reliably finished no matter the condition that the target detection fails or is lost due to the drastic change of the target or the condition that the scene change is not obvious, the traffic accident can be timely monitored, and the timeliness of accident troubleshooting is facilitated.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of an accident detection method provided by the present invention;

FIG. 2 is a schematic flow chart illustrating step 120 of the accident detection method provided by the present invention;

FIG. 3 is a schematic structural diagram of a global abstraction network provided by the present invention;

FIG. 4 is a schematic flow chart illustrating step 130 of the accident detection method provided by the present invention;

FIG. 5 is a schematic diagram of a local extraction network provided by the present invention;

FIG. 6 is a schematic diagram of a first stage of training provided by the present invention;

FIG. 7 is a schematic flow chart of an accident detection method provided by the present invention;

FIG. 8 is a schematic structural diagram of an accident detection apparatus provided in the present invention;

fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the wide application of deep learning in computer vision tasks, video-based traffic accident detection methods are used for all the reasons.

At present, such schemes mostly acquire two aspects of information, namely appearance and motion of a target from a video, extract features from the information, and detect accidents according to the features. The characteristics extracted from the appearance information of the target are used for judging whether the normal states of the vehicle and the pedestrian are obviously changed, such as whether the vehicle turns over or not and whether the pedestrian falls down or not; the features extracted from the running information of the target are used to determine whether intersection of the vehicle and the pedestrian track, abrupt changes in speed and angular velocity, and the like occur.

Characteristics are directly extracted from the appearance and the operation information of the target, partial information is lost due to excessive attention to local information, the understanding of the whole traffic scene is ignored, and therefore non-accident scenes such as traffic jam and the like cannot be well identified, and the accident detection precision is influenced. In addition, the method for simultaneously capturing the fusion of the appearance and the motion on the characteristics needs to model information correlation on a target time sequence through a target detection and target tracking network, only focuses on local information of the target, excessively depends on the precision of the target detection and tracking, ignores the global information of the video, and also influences the reliability of accident detection.

In view of the above problems, an embodiment of the present invention provides an accident detection method, and fig. 1 is a schematic flow chart of the accident detection method provided by the present invention, as shown in fig. 1, the method includes:

step 110, determining a sequence of image frames of a video to be detected.

Specifically, the video to be identified is a video that needs to be subjected to accident detection, where the video to be identified may be a video that is shot and stored in advance, or may be a video stream that is acquired in real time, which is not specifically limited in the embodiment of the present invention. The image frame sequence is obtained by sampling videos to be identified, the image frame sequence comprises multiple frames of images, each frame of image is derived from the videos to be identified, and the multiple frames of images are arranged according to the time sequence in the videos to be identified, so that the image frame sequence is formed.

It should be noted that, when the video to be identified is collected, the video to be identified is usually collected in a uniform sequence based on the total number of frames of the video to be identified, and the time intervals between each frame of image obtained thereby are equal. Or, the images of the frames in the video to be recognized may be directly combined in time sequence to form an image frame sequence, for example, every 16 frames may be used as a group of image frame sequences.

And 120, extracting three-dimensional features of the image frame sequence based on a global extraction network to obtain global features of the video to be detected.

Here, the global extraction network is a neural network trained in advance for performing global feature extraction on the image frame sequence, for example, the image frame sequence may be input into the global extraction network, the global extraction network performs three-dimensional feature extraction on the input image frame sequence, and the three-dimensional feature obtained by extraction is used as the global feature of the video to be detected.

The three-dimensional feature extraction is performed on the image frame sequence, that is, feature extraction is performed on the whole image frame sequence in three dimensions, namely two dimensions of the image frame and a dimension corresponding to a time sequence between each frame image in the image frame sequence. Here, the three-dimensional feature extraction may be performed by a 3D-CNN (Convolutional Neural Networks). The three-dimensional feature extraction is carried out on the image frame sequence through the global extraction network, so that the obtained global feature can be ensured, information on each frame image in the image frame sequence and time sequence among the frame images can be covered, and the information of the video to be detected on a traffic scene is reflected integrally.

Step 130, based on the local extraction network, applying the detection target and the target position of each frame of image in the image frame sequence to determine the local feature of the video to be detected.

Here, the local extraction network, that is, a pre-trained neural network for extracting local target features from an image frame sequence, may, for example, input the image frame sequence into the local extraction network, and perform target detection on each frame image in the image frame sequence by using the local extraction network, so as to determine a detection target, such as a vehicle, a pedestrian, etc., contained in each frame image and determine a target position thereof, and on the basis of a position relationship, a position change, etc., of the detection target in each frame image, determine local features of the video to be detected; alternatively, the target detection may be performed on each frame image in the image frame sequence, and then the detection target and the target position included in each frame image may be input to the local extraction network to perform local feature extraction.

The local features extracted from the video image can reflect the information of each detection target in the video image to be detected on space and time sequence, and the information of the video image to be detected on the detection targets such as vehicles, pedestrians and the like is represented on the layer with emphasis on the local targets.

It should be noted that step 120 and step 130 may be executed synchronously, and step 120 may also be executed before or after step 130, which is not specifically limited in this embodiment of the present invention.

And 140, determining an accident detection result of the video to be detected by applying the global features and the local features based on the fusion classification network.

Here, the fusion classification network is a neural network trained in advance and used for performing accident detection by combining the global features and the local features, for example, the global features and the local features obtained in steps 120 and 130 are input into the fusion classification network, and the fusion classification network performs accident classification after fusing the global features and the local features, so as to obtain an accident detection result, or the global features and the local features are input into the fusion classification network, and the fusion classification network performs accident classification on the global features and the local features, respectively, so as to obtain a classification result based on the global features and a classification result based on the local features, and integrates the classification results based on the global features and the classification results, so as to obtain an accident detection result, where the accident detection result is used for reflecting whether a traffic accident occurs in the video to be detected, and if a traffic accident occurs, the accident detection result may further include the type or severity of the traffic accident, etc., or may further include a start frame and an end frame of the accident occurrence.

In the process, the accident detection of the video to be detected combines the features of the global feature and the local feature, wherein the application of the global feature can fully mine the features of the traffic scene, but small changes in the traffic scene can be difficult to capture and identify, while the application of the local feature can pertinently give out the features of the detection target directly related to the traffic accident, but the accident detection is invalid due to omission or target loss depending on the progress of target detection and tracking. The combination of the two can just make up the defects of single type characteristics on the accident detection layer, and is beneficial to improving the reliability and accuracy of the accident detection.

According to the method provided by the embodiment of the invention, the accident detection is carried out by combining the global characteristics and the local characteristics of the video to be detected, so that the accident detection can be accurately and reliably completed no matter the target detection fails or is lost due to the drastic change of the target or the situation that the scene change is not obvious, and therefore, the traffic accident can be timely monitored, and the timeliness of accident troubleshooting is facilitated.

Based on the previous embodiment, fig. 2 is a schematic flowchart of step 120 in the accident detection method provided by the present invention, and as shown in fig. 2, step 120 includes:

Step 121, performing multilayer three-dimensional convolution on the image frame sequence based on a multilayer three-dimensional convolution network in the global extraction network to obtain a first convolution feature and a second convolution feature, wherein the first convolution feature is obtained by convolution before the second convolution feature;

and step 122, based on the attention network in the global extraction network, applying the first convolution feature, determining the attention weight of the second convolution feature, and applying the attention weight to weight the second convolution feature to obtain the global feature.

In particular, the global extraction network may include a multi-layer three-dimensional convolutional network and an attention network, wherein the multi-layer three-dimensional convolutional network may be embodied as a plurality of cascaded three-dimensional convolutional layers, an output of a previous three-dimensional convolutional layer being an input of a next three-dimensional convolutional layer.

In step 121, the image frame sequence as a whole can be subjected to layer-by-layer three-dimensional convolution feature extraction through the multilayer three-dimensional convolution network, so that a first convolution feature and a second convolution feature are obtained, because each cascaded three-dimensional convolution layer in the multilayer three-dimensional convolution network is subjected to layer-by-layer convolution, there is a sequential difference in the convolution features output through the three-dimensional convolution layer, the first convolution feature is obtained by convolution before the second convolution feature, that is, the multilayer three-dimensional convolution network is firstly subjected to three-dimensional convolution based on the image frame sequence to obtain the first convolution feature, and then is subjected to three-dimensional convolution based on the first convolution feature to obtain the second convolution feature. For example, the multilayer three-dimensional convolutional network may include 5 cascaded three-dimensional convolutional layers, the first convolutional characteristic may be obtained by a third layer of convolution, and the second convolutional characteristic may be obtained by a fifth layer of convolution.

Considering that the convolution features obtained by only applying the multi-layer three-dimensional convolution network are rough, which may contain a lot of useless background information and may generate a certain interference on accident detection, in step 122, the embodiment of the present invention applies the attention network in the global extraction network, improves the attention of the information related to the traffic accident in the information of the whole image frame sequence by determining the attention weight and then performing weighting, and obtains the global features that reflect the global scene of the image frame sequence and can highlight the information related to the traffic accident.

According to the method provided by the embodiment of the invention, the attention mechanism is introduced in the global feature extraction process, so that the global feature can highlight the information related to the traffic accident while reflecting the global scene of the image frame sequence, the filtering of useless information carried in the features obtained by direct convolution is facilitated, and the reliability of accident detection is improved.

According to any one of the above embodiments, the first convolution feature and the second convolution feature are convolution features output by a penultimate layer and a penultimate layer of the multilayer three-dimensional convolution, respectively.

For example, the multilayer three-dimensional convolutional network may include 5 cascaded three-dimensional convolutional layers, the first convolutional feature may be a result of a fourth layer convolution, and the second convolutional feature may be a result of a fifth layer convolution. For another example, the multi-layer three-dimensional convolutional network may include 6 cascaded three-dimensional convolutional layers, the first convolutional characteristic may be obtained by a fifth layer convolution, and the second convolutional characteristic may be obtained by a sixth layer convolution.

In step 122, the applying the first convolution feature and determining the attention weight of the second convolution feature includes:

determining the attention weight based on the third convolution characteristic.

Specifically, since the first convolution feature is obtained by convolution before the second convolution feature, and the feature dimension of the first convolution feature is larger than that of the second convolution feature, it is necessary to perform single-layer three-dimensional convolution on the first convolution feature, so that the feature dimension of the first convolution feature is normalized to the same feature dimension as that of the second convolution feature, thereby obtaining the third convolution feature.

On this basis, the attention weight may be obtained by performing 1 × 1 convolution on the third convolution feature, or performing self-attention transformation on the third convolution feature.

For example, fig. 3 is a schematic structural diagram of a global extraction network provided by the present invention, as shown in fig. 3, an image frame sequence includes images from time T to time T + T, the image frame sequence obtains a first convolution feature and a second convolution feature through a multi-layer three-dimensional convolutional network (3D-CNN), the first convolution feature in fig. 3 is output by a fourth layer of three-dimensional convolution layer in the multi-layer three-dimensional convolutional network and is denoted as f4, and the second convolution feature is output by a fifth layer of three-dimensional convolution layer in the multi-layer three-dimensional convolutional network and is denoted as f 5. The first convolution feature f4 is subjected to space-time attention conversion to obtain an attention weight W, and on the basis, the second convolution feature f5 is added with the original second convolution feature f5 after being weighted by the attention weight W to serve as a final global feature.

The space-time attention conversion is to perform three-dimensional convolution on the first convolution feature f4 to obtain a feature with the same dimension as that of the second convolution feature f5, and perform dimensionality reduction through 1 × 1 convolution to obtain the attention weight W.

In the above process, the global feature may be denoted as f 5', which is specifically embodied as the following formula:

f5’＝f5+W*f5

in the formula, one represents the operation of para-position product.

Based on any of the above embodiments, fig. 4 is a schematic flow chart of step 130 in the accident detection method provided by the present invention, and as shown in fig. 4, step 130 includes:

step 131, determining target characteristics and target positions of detection targets in each frame image of the image frame sequence based on a target detection network in the local extraction network;

and 132, extracting spatio-temporal information of the target feature map of each frame image based on a spatio-temporal extraction network in the local extraction network to obtain the local features, wherein the target feature map is determined based on the target features and the target positions of the detection targets in the corresponding images.

In particular, the local extraction network may include a target detection network and a spatiotemporal extraction network.

The target detection network is used for detecting and positioning a target of an input image, and the target detection network here may be implemented by a single-stage target detection method, or may be implemented by a two-stage target detection method, such as fast-rcn, and this is not specifically limited in the embodiment of the present invention. In step 131, the image frame sequence may be input into a target detection network, and the target detection network may detect and locate a target, such as a vehicle, a pedestrian, and the like, included in each frame image in the image frame sequence, so as to obtain a target feature and a target position of the detected target in each frame image. For any detection target in any frame of image, the target position of the detection target, that is, the coordinate of the minimum bounding box of the target in the image, the target feature of the detection target may be a feature corresponding to a target region defined based on the target position in an image feature, where the image feature may be a feature of the image extracted in a target detection process, for example, when the target detection is performed by applying fast-rcn, the image feature may be a low-dimensional feature extracted by a full connectivity layer (FC layer) in the fast-rcn.

Considering that the occurrence of a traffic accident may be only reflected on one target in the image, and there is usually a target interacting with the target of the traffic accident, after detecting the detection target included in each frame image, the target feature map of each frame image may be constructed based on the target feature and the target position of the detection target. Here, in the target feature graph, with the target feature of each detection target as a node, the distance between the detection targets is calculated based on the target positions of the detection targets, and the weight of the connection edge between the nodes corresponding to the detection targets is determined based on the distance between the detection targets, so that the obtained target feature graph G may be represented as:

G＝(V，E)

where V represents a target feature of the detection targets extracted in the image, and E represents a distance between the detection targets.

The space-time extraction network can be used for extracting the characteristics of the input target characteristic diagram from two dimensions of time sequence information and space information, so that the space relation of each detection target in the video to be detected can be reflected, and the local characteristics of change information of the detection target in each frame image in the video to be detected on the time sequence can also be reflected.

Based on any of the above embodiments, step 132 includes:

Specifically, the spatio-temporal extraction network may include a graph convolution network and a timing extraction network, wherein the graph convolution network is used for implementing the spatial information aggregation of the detection targets, and the timing extraction network is used for implementing the temporal information aggregation of the detection targets.

The graph convolution network can be used for extracting features of an input graph, so that a target feature graph of each frame of image can be input into the graph convolution network, the graph convolution network extracts the target features of each node represented by the target feature graph and the distances of connecting edges among the nodes, and therefore spatial information among detection targets in a video to be detected in a traffic scene is aggregated, and the target spatial relation of each frame of image is obtained.

The time sequence extraction network can be used for aggregating the time sequence relation among the features input every time, specifically, the time sequence extraction network can input the target spatial relation of the images frame by frame, the time sequence extraction network can memorize the features of the target spatial relation extracted at the last moment and apply the features to the feature extraction of the target spatial relation input at the current moment, and the obtained time sequence extraction network is based on the output of the target spatial relation of the last frame of image, namely, the time sequence extraction network contains the local features of the target spatial relations of all the images in the image frame sequence, which are related in time sequence.

Here, the timing extraction Network may be a Long Short-Term Memory (LSTM) Network, a Recurrent Neural Network (RNN) Network, and the like, which is not specifically limited in this embodiment of the present invention.

Based on any of the above embodiments, fig. 5 is a schematic structural diagram of the local extraction network provided by the present invention, and as shown in fig. 5, the local extraction network includes an object detection network, a graph convolution network, and a recurrent neural network, where the recurrent neural network plays a role in time sequence extraction. After the target detection of each frame image in the image frame sequence is completed through the target detection Network, the target feature map of each frame image can be constructed based on the target feature and the target position of the detection target contained in each frame image, and the target feature map is subjected to feature extraction through a Graph Convolution Network (GCN), so that the target spatial relationship of each frame image can be obtained. On the basis, the target spatial relationship of the images is input into the recurrent neural network frame by frame, so that the input of the hidden layer of the recurrent neural network not only comprises the output of the input layer, but also comprises the output of the hidden layer at the last moment, and the local feature which contains the time sequence relevance of the target spatial relationship of all the images in the image frame sequence is obtained.

Based on any of the above embodiments, step 140 includes:

Specifically, after the global feature and the local feature are obtained respectively, the global feature and the local feature may be fused by a fusion classification network, and the accident detection may be classified based on the fused features. In the process, the fusion of the global features and the local features can be realized through splicing or weighted addition and the like, and considering that the traffic accident has high contextual characteristics, after the fused features are obtained, the time sequence information of the fused features can be constructed through RNN (radio network node) or LSTM (local state machine) and the like, namely context extraction is carried out, so that the contextual features capable of reflecting the time sequence of the video to be detected are obtained. After the context features are obtained, classification can be carried out based on the context features, so that an accident detection result of whether a traffic accident exists in the video to be analyzed is obtained, and the accident detection is finished.

The extraction of the context features can be realized through RNN, LSTM and the like, and considering that the LSTM adds filtering on the past states on the basis of RNN, so that the states can be selected to be more influenced on the current, the problem that the gradient of RNN appearing in long-term modeling disappears is solved, long-term dependence is more suitable for constructing, and the LSTM can be set in a fusion classification network as preferable to realize the extraction of the context features.

Based on any of the above embodiments, the global extraction network, the local extraction network, and the converged classification network are determined based on the following steps:

Specifically, the initial global extraction network, the initial local extraction network, and the initial fusion classification network correspond to the initialization networks of the global extraction network, the local extraction network, and the fusion classification network, respectively, and the outputs of the initial global extraction network and the initial local extraction network are connected to the input of the initial fusion classification network, that is, the initial detection network is formed. Here, the network parameters of the initial global extraction network, the initial local extraction network, and the initial fusion classification network may be obtained through initialization or pre-training, and this is not specifically limited in the embodiment of the present invention.

After the initial detection network is determined, the first sample video which is collected in advance and marked with the accident label of whether the accident happens or not can be applied to the training of the initial detection network, so that the supervised training of the initial detection network is realized. And the initial detection network after training comprises a global extraction network, a local extraction network and a fusion classification network applied in the accident detection.

Further, for the accident label of the first sample video, the normal first sample video may not be labeled, the first sample video with the accident may be labeled with a start frame when the accident occurs and an end frame after the accident occurs as the accident label, and the position of the end frame referred to herein may be that all vehicles stop or that the video ends.

Based on any of the above embodiments, in the process of training the initial detection network as a whole, the image frame sequence of the first sample video is used as the input of the initial detection network, so that the prediction detection result output by the initial detection network for the first sample video can be obtained, and the prediction detection result may include the probability that the initial detection network predicts the accident included in the first sample video. After obtaining the predicted detection result, iteration of parameter update may be performed on the initial detection network based on the following Loss function Loss 1:

Where N1 is the number of first sample videos, y_1iAccident tag for ith first sample video, y_1iTaking 0 for normal, 1 for accident, p (y)_1i) For the prediction detection result of the ith first sample video, p (y)_1i) Is between 0 and 1.

Based on any embodiment, the initial global extraction network is obtained by combining a global classification network training based on a second sample video carrying an accident label, and the global classification network is used for carrying out accident detection based on global features.

Specifically, in order to improve the training efficiency and the training effect of the initial detection network, the initial global extraction network may be trained in advance before the initial detection network is constructed based on the initial global extraction network.

Here, when training is performed on the initial global extraction network, implementation needs to be performed in combination with the initial classification network, the initial classification network is connected in series to the initial global extraction network, and after the initial global extraction network performs global feature extraction on the input second sample video, the initial classification network may perform accident classification based on global features extracted by the initial global extraction network, and output a prediction result based on the global features. Here, the prediction result based on the global feature may include that the initial global extraction network and the global classification network jointly predict the probability that the second sample video includes the accident, and the Loss function Loss2 of the joint training of the two networks may be expressed as the following formula:

Where N2 is the number of second sample videos, y_2iAccident tag, y, for the ith second sample video_2iTaking 0 for normal, 1 for accident, p (y)_2i) Global feature based prediction result for ith second sample video, p (y)_2i) Is between 0 and 1.

It should be noted that the second sample video and the first sample video in the foregoing embodiment may be sample videos of the same batch or sample videos of different batches, and this is not specifically limited in this embodiment of the present invention.

Based on any of the above embodiments, the global extraction network, the local extraction network, and the fusion classification network can be obtained based on two-stage training:

in the first stage, an initial global extraction network is constructed, and based on a second sample video carrying an accident label, the initial global extraction network and a global classification network are subjected to combined training; fig. 6 is a schematic structural diagram of the first-stage training provided by the present invention, and as shown in fig. 6, an output of the initial global extraction network is an input of a global classification network, where the global classification network may be a full-connectivity-network (FFN) network or other types of classification networks.

And in the second stage, the trained initial global extraction network, the initial local extraction network and the initial fusion classification network are combined to construct an initial detection network, and the initial detection network is trained based on the first sample video carrying the accident label, so that the trained global extraction network, local extraction network and fusion classification network are obtained.

Based on any of the above embodiments, fig. 7 is a schematic flow diagram of the accident detection method provided by the present invention, and as shown in fig. 7, in an actual scene, a monitoring camera that is erected locally is applied to obtain a monitoring video as a video to be analyzed, and an image frame sequence of the video to be analyzed is transmitted to an accident detection network that includes a global extraction network, a local extraction network, and a fusion classification network.

The attention network determines attention weight of the second convolution characteristic by applying the first convolution characteristic, and weights the second convolution characteristic by applying the attention weight to obtain the global characteristic.

The target detection network in the local extraction network determines the target characteristics and the target positions of detection targets in each frame image of the image frame sequence, and determines the target characteristic graph of each frame image; and (3) locally extracting a space-time extraction network in the network, and extracting space-time information of the target characteristic diagram of each frame image to obtain local characteristics.

And then, the fusion classification network fuses the global features and the local features respectively output by the global extraction network and the local extraction network, performs context extraction based on the features obtained by fusion to obtain context features, performs accident classification by applying the context features, and determines and outputs an accident detection result. The accident detection result here may be a probability of occurrence of an accident, and if the probability is greater than a preset threshold, for example, 0.7, 0.8, etc., it is determined that an accident occurs and the accident is reported to the relevant department, or the accident detection result here may be whether an accident occurs, and if an accident occurs, the accident is directly reported to the relevant department.

According to the method provided by the embodiment of the invention, two dimensions of global and local are divided, the image frame sequence is subjected to feature extraction, and in a global extraction network, the network pays attention to the vehicle and related information around the vehicle as much as possible in a mode of combining multi-layer three-dimensional convolution and space-time attention. In a local extraction network, an image frame sequence firstly passes through a target detection network, target information of a current frame is extracted and aggregated, then a target feature graph is constructed, each node in the graph represents a target feature contained in each frame, the spatial relationship between local targets is aggregated through graph convolution operation, and the spatiotemporal relationship of the targets is obtained on the basis, so that the local features are obtained. And then fusing the features respectively extracted by two dimensions through a fusion classification network, obtaining a time sequence feature aggregating the global context and the local target relationship by extracting the context feature, namely the context feature, and finally outputting the accident occurrence probability based on the context feature, namely an accident detection result. Compared with a 3D convolution traffic accident detection method in the related technology, the traffic accident detection method based on the embodiment of the invention has better universality, and is difficult to identify traffic accidents with small scene change, while a common target detection and tracking traffic accident detection method excessively depends on the target detection and tracking precision, if no target is detected, the occurrence of a traffic accident cannot be completely judged, and the target appearance speed changes violently in some accident scenes, so that the target cannot be tracked easily.

In addition, by applying the fragmented image frame sequences, for example, every 16 frames or every 20 frames as one image frame sequence to perform accident detection, the accident scene characteristics can be fully mined, the occurrence of the traffic accident can be detected in real time, and the method has stronger real-time performance and universality compared with the accident probability of directly outputting the video level.

Based on any of the above embodiments, fig. 8 is a schematic structural diagram of an accident detection apparatus provided by the present invention, and as shown in fig. 8, the apparatus includes:

a sequence determining unit 810, configured to determine a sequence of image frames of a video to be detected;

a global extraction unit 820, configured to perform three-dimensional feature extraction on the image frame sequence based on a global extraction network to obtain a global feature of the to-be-detected video;

a local extraction unit 830, configured to determine a local feature of the video to be detected by applying a detection target and a target position of each frame of image in the image frame sequence based on a local extraction network;

and the fusion classification unit 840 is configured to determine an accident detection result of the video to be detected by applying the global feature and the local feature based on a fusion classification network.

The device provided by the embodiment of the invention can accurately and reliably complete accident detection by combining the global characteristics and the local characteristics of the video to be detected, no matter the condition that the target detection is invalid or lost due to the drastic change of the target or the condition that the scene change is not obvious, thereby ensuring that the traffic accident can be timely monitored and facilitating the timeliness of accident troubleshooting.

Based on any of the embodiments above, the global extraction unit is configured to:

Based on any of the above embodiments, the first convolution feature and the second convolution feature are convolution features output by a penultimate layer and a penultimate layer of the multi-layer three-dimensional convolution, respectively;

the global extraction unit is specifically configured to:

performing single-layer three-dimensional convolution on the first convolution feature to obtain a third convolution feature with the same dimensionality as the second convolution feature;

determining the attention weight based on the third convolution characteristic.

Based on any of the above embodiments, the local extraction unit is configured to:

Based on any of the embodiments, the local extraction unit is specifically configured to:

and performing time sequence feature extraction on the target space relation of each frame image based on a time sequence extraction network in the space-time extraction network to obtain the local features of the video to be detected.

Based on any of the above embodiments, the fusion classification unit is configured to:

Based on any of the above embodiments, the apparatus further comprises a training unit configured to:

Constructing an initial detection network based on an initial global extraction network, an initial local extraction network and an initial fusion classification network;

Fig. 9 illustrates a physical structure diagram of an electronic device, and as shown in fig. 9, the electronic device may include: a processor (processor)910, a communication Interface (Communications Interface)920, a memory (memory)930, and a communication bus 940, wherein the processor 910, the communication Interface 920, and the memory 930 communicate with each other via the communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a fault detection method comprising:

determining an image frame sequence of a video to be detected;

Furthermore, the logic instructions in the memory 930 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the accident detection method provided by the above methods, the method comprising:

determining an image frame sequence of a video to be detected;

based on a global extraction network, performing three-dimensional feature extraction on the image frame sequence to obtain global features of the video to be detected;

and determining the accident detection result of the video to be detected by applying the global features and the local features based on the fusion classification network.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the accident detection method provided above, the method comprising:

determining an image frame sequence of a video to be detected;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of accident detection, comprising:

determining an image frame sequence of a video to be detected;

2. The accident detection method of claim 1, wherein the three-dimensional feature extraction of the image frame sequence based on a global extraction network to obtain the global features of the video to be detected comprises:

3. The accident detection method of claim 2, wherein the first convolution feature and the second convolution feature are convolution features output by a penultimate layer and a penultimate layer, respectively, of the multi-layer three-dimensional convolution;

said applying said first convolution feature to determine an attention weight of said second convolution feature comprises:

determining the attention weight based on the third convolution characteristic.

4. The accident detection method according to claim 1, wherein the determining the local features of the video to be detected by applying the detection target and the target position of each frame of image in the image frame sequence based on the local extraction network comprises:

Determining target characteristics and target positions of detection targets in each frame image of the image frame sequence based on a target detection network in the local extraction network;

5. The accident detection method of claim 4, wherein the performing spatio-temporal information extraction on the target feature map of each frame image based on a spatio-temporal extraction network in the local extraction network to obtain the local features comprises:

6. The accident detection method of claim 1, wherein the determining the accident detection result of the video to be detected by applying the global features and the local features based on the fusion classification network comprises:

7. The incident detection method according to any one of claims 1 to 6, wherein the global extraction network, the local extraction network and the converged classification network are determined based on:

8. The accident detection method of claim 7, wherein the initial global extraction network is trained in conjunction with a global classification network based on the second sample video carrying the accident label, and the global classification network is used for accident detection based on global features.

9. An accident detection apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the fault detection method according to any of claims 1 to 8 are implemented when the processor executes the program.

11. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the accident detection method according to any one of claims 1 to 8.