CN114550032A

CN114550032A - Video smoke detection method of end-to-end three-dimensional convolution target detection network

Info

Publication number: CN114550032A
Application number: CN202210109359.7A
Authority: CN
Inventors: 张启兴; 霍一诺; 张永明
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-27

Abstract

The invention relates to a video smoke detection method and a system based on an end-to-end three-dimensional convolution target detection network, wherein the method comprises the following steps: s1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set; s2: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning. The method provided by the invention can effectively extract static and dynamic characteristics of the smoke, and the combination of the dynamic characteristics and the static characteristics can effectively improve the reliability of the video smoke detection algorithm, thereby accurately identifying and positioning the smoke in the video frame.

Description

Video smoke detection method of end-to-end three-dimensional convolution target detection network

Technical Field

The invention relates to the field of video fire detection and deep learning, in particular to a video smoke detection method and a video smoke detection system based on an end-to-end three-dimensional convolution target detection network.

Background

Currently, the following four types of studies for video smoke detection by using a deep learning method are mainly available: (1) the method has the advantages that each video frame is independently detected, the real-time detection effect is realized by using higher detection speed, and the method completely does not use the time sequence information contained between the continuous video frames, so that serious false negative report and false report are inevitable. (2) A traditional motion detection algorithm is used for extracting a motion region, and the DCNN is used for detecting the motion region, so that only shallow time sequence information is used, false alarm caused by partial static objects can be eliminated, and the method is useless for interference or missing report of moving objects. (3) Firstly, each video frame is independently detected, and when a suspected target is detected, a time sequence network is used for judgment, although effective deep dynamic characteristics are extracted, the method only has the effect of checking the detection result of the target detection network, false alarm can be eliminated, but no strategy is given to missed alarm. Furthermore, such algorithms are not in an end-to-end format and therefore tend to run at a slower speed. (4) The classifier aiming at the video segment is constructed by utilizing a time sequence network, so that the motion features and static features contained in the video can be fully extracted, but the features are only used for classification, and the smoke target is not positioned. Therefore, how to effectively extract static and dynamic characteristics of smoke and improve the reliability of smoke detection becomes an urgent problem to be solved.

Disclosure of Invention

In order to solve the technical problem, the invention provides a video smoke detection method and a video smoke detection system based on an end-to-end three-dimensional convolution target detection network.

The technical solution of the invention is as follows: a video smoke detection method based on an end-to-end three-dimensional convolution target detection network comprises the following steps:

step S1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set;

step S2: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.

Compared with the prior art, the invention has the following advantages:

the invention discloses a video smoke detection method based on an end-to-end three-dimensional convolution target detection network, which can effectively extract static and dynamic characteristics of smoke, and the combination of the dynamic characteristics and the static characteristics can effectively improve the reliability of a video smoke detection algorithm, thereby accurately identifying and positioning smoke in a video frame. The method can be applied to the field of video fire detection, has high application value, and provides a new method for solving the problem of high false alarm rate which puzzles the existing video fire detection.

Drawings

Fig. 1 is a flowchart of a video smoke detection method based on an end-to-end three-dimensional convolution target detection network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure for detecting a target based on end-to-end three-dimensional convolution according to an embodiment of the present invention;

FIG. 3 is a block diagram of a cross-phase local residual network according to an embodiment of the present invention;

fig. 4 is a block diagram of a video smoke detection system based on an end-to-end three-dimensional convolution target detection network in the embodiment of the present invention.

Detailed Description

The invention provides a video smoke detection method based on an end-to-end three-dimensional convolution target detection network, which can effectively extract static and dynamic characteristics of smoke, and the combination of the dynamic characteristics and the static characteristics can effectively improve the reliability of a video smoke detection algorithm, so that the smoke in a video frame can be accurately identified and positioned.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings.

Example one

As shown in fig. 1, a video smoke detection method based on an end-to-end three-dimensional convolution target detection network provided by an embodiment of the present invention includes the following steps:

In one embodiment, step S1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set, wherein the method specifically comprises the following steps:

step S11: acquiring a plurality of smog videos, extracting images at fixed frame intervals, constructing a video frame sequence by extracting 100 images, and marking each image containing smog;

the embodiment of the invention obtains 44 sections of videos meeting requirements from a public fire smoke video image database, wherein 32 sections of videos have smoke for making positive samples, and 12 sections of videos have no smoke for making negative samples. In addition, 28 video segments are shot as supplements, wherein 21 video segments have smoke, and 7 video segments are background and interference objects such as pedestrians. The videos comprise three scenes, namely an indoor scene, an outdoor short-distance scene and an outdoor long-distance scene.

Images are extracted from the video at a fixed interval of 3 frames per second, each 100 pictures is a video frame sequence, and finally, 147 video frame sequences are obtained, wherein 14700 pictures are obtained in total. Where there are 115 sequences of video frames for the positive samples and 32 sequences of video frames for the negative samples. Each positive sample has a corresponding label file, and the label file is made by labelImg software. In the embodiment of the invention, 132 video sequences are randomly selected, 13200 pictures are used as a training set, and 15 sequences are left as a verification set.

Step S12: generating two random numbers a and b, with 0< a <132, 0< b <89, and i ═ a × 100+ b; extracting 12 continuous images from the ith picture, namely the mth picture of the mth video frame sequence in the data set to obtain a new video frame sequence;

because 12 pictures need to be continuously read to input into the network when network training is carried out subsequently, in order to ensure that 12 pictures read each time are from the same video sequence, the embodiment of the invention designs a reading rule of the video sequence. When the training data is read, firstly, two random integers a and b are generated, wherein the value range of a is between 0 and 132, the value range of b is between 0 and 89, a variable i is equal to a multiplied by 100+ b, then, reading is carried out from the ith picture, and 12 pictures are sequentially read. And after the pictures are read, reading the mark file corresponding to the last picture as a label for the training to calculate the loss value. The same rule is also adopted when the verification data is read, and the value range of a is changed to 0 to 15.

Step S13: selecting 4 new video frame sequences, performing enhancement transformation on images in the 4 new video frame sequences, and sequentially performing reduction splicing on the images in the 4 video frame sequences in the same manner to obtain an enhanced video frame sequence, thereby constructing an enhanced data set; the enhancement transform includes: flipping, clipping, size transformation, translation transformation and color gamut warping.

Data enhancement methods commonly used for object detection are image flipping, cropping, size transformation, translation transformation, rotation transformation, gamut warping, etc. The existing data enhancement algorithm is to input each picture after random transformation into a neural network for training, however, when a training sample is a video sequence, each picture in the sequence should be ensured to be subjected to the same transformation, so the existing data enhancement algorithm cannot be directly used in the invention.

The embodiment of the invention designs a data enhancement algorithm aiming at the video frame sequence of the smog, the algorithm comprises the functions of image turning, cutting, size transformation, translation transformation and color gamut distortion, and the movement of the smog has certain directionality, so the embodiment of the invention does not comprise rotation transformation. In addition, because the smoke detection network of the embodiment of the invention takes a plurality of continuous video frame sequences as input, and the pressure of the computing device is high, batch training is difficult to perform, so that the training period is long and the robustness of the model is poor. In order to solve the problem, the data enhancement algorithm designed by the invention can process four video frame sequences at a time, and after the four sequences are respectively transformed, the pictures in the four sequences are sequentially subjected to reduction splicing in the same mode to finally form a new sequence. The method can make each iterative training calculate the data of four pictures, and enrich the background of the detected object. In addition, the spliced images contain the reduced smoke targets, and the images can improve the sensitivity of the detection model to small targets, so that the detection model is more suitable for early fire detection.

In one embodiment, as shown in fig. 2, the step S2 is: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; outputting a smoke recognition result and positioning, specifically comprising:

step S21: the three-dimensional convolutional layer comprises 4 three-dimensional convolutional layers, the input is 12 × 416 × 416 × 3, wherein 12 is 12 images in an enhanced video frame sequence, 416 × 416 is the image size, 3 is the RGB channel of the image, each time the three-dimensional convolutional layer passes through, the length and the width are reduced by half, and the output is 1 × 52 × 52 × 128;

step S22: the cross-stage local residual error network module is formed by connecting 11 two-dimensional depth separable convolutions in series, and the output of the three-dimensional convolution layer is output and decoded by selecting three scales of 52 x 52, 26 x 26 and 13 x 13;

as shown in fig. 3, the cross-phase local residual network module structure diagram introduces a residual structure and a cross-phase local network structure, the residual structure is used to solve the problem that the gradient appearing when the network is too deep disappears, and the cross-phase local network structure can enhance the learning ability of the network. In order to enable the three-dimensional convolution smoke detection network to effectively detect smoke targets with different sizes, the three scales of 52 × 52, 26 × 26 and 13 × 13 are selected for output in the embodiment of the invention.

Step S23: the pyramid pooling module is used for performing enhanced regional characteristic processing corresponding to 3 outputs of the cross-stage local residual error network module respectively;

step S24: performing feature fusion between scales on the output of the pyramid pooling module through a path aggregation network module to obtain 3 feature tensors with the sizes of 52 × 52 × 18, 26 × 26 × 18 and 13 × 13 × 18;

the embodiment of the invention adopts a path aggregation network to perform feature fusion among all scales, performs up-sampling on small-scale feature tensors to sequentially perform fusion with medium-scale and large-scale feature tensors, and then performs down-sampling on large-scale features to sequentially perform fusion with the medium-scale and small-scale feature tensors, so as to shorten an information path and enhance a feature pyramid by using an accurate positioning signal existing in the large-scale feature tensors.

Step S25: and respectively inputting the 3 characteristic tensors into corresponding tensor decoders for decoding, and finally outputting a smoke identification result and positioning.

And obtaining three feature tensors with the sizes of 52 multiplied by 18, 26 multiplied by 18 and 13 multiplied by 18 by the path aggregation network module, respectively inputting the feature tensors into the corresponding yolo _ head for tensor decoding, and obtaining the smoke identification result and positioning. The working principle of yolo head is as follows, taking a tensor size of 52 × 52 × 18 as an example: firstly, an original picture is divided into grids of 52 x 52, each cell can predict 3 potential targets, each target corresponds to 6 parameters, and the parameters are 4 bounding box parameters, 1 confidence coefficient and 1 category probability value respectively.

Example two

As shown in fig. 4, an embodiment of the present invention provides a video smoke detection system based on an end-to-end three-dimensional convolution target detection network, including the following modules:

the data set building module 31 is used for acquiring video frames from a plurality of smoke videos, grouping the video frames, building a video frame sequence, performing data enhancement on the video frame sequence, and building an enhanced data set;

a detection network construction and training module 32 for inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A video smoke detection method based on an end-to-end three-dimensional convolution target detection network is characterized by comprising the following steps:

2. The video smoke detection method based on the end-to-end three-dimensional convolution target detection network of claim 1, wherein the step S1 is: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set, wherein the method specifically comprises the following steps:

step S12: generating two random numbers a and b, with 0< a <132, 0< b <89, and i ═ a × 100+ b; extracting 12 continuous images from the ith picture, namely the mth picture of the mth video frame sequence in the enhanced data set to obtain a new video frame sequence;

step S13: selecting 4 new video frame sequences, performing enhancement transformation on images in the new video frame sequences, and sequentially performing reduction splicing on the images in the 4 video frame sequences in the same manner to obtain an enhanced video frame sequence, thereby constructing an enhanced data set; the enhancement transform includes: flipping, clipping, size transformation, translation transformation and color gamut warping.

3. The video smoke detection method based on the end-to-end three-dimensional convolution target detection network of claim 1, wherein the step S2 is: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; outputting a smoke recognition result and positioning, specifically comprising:

step S21: the three-dimensional convolutional layer comprises 4 three-dimensional convolutional layers, the input is 12 × 416 × 416 × 3, wherein 12 is 12 images in the enhanced video frame sequence, 416 × 416 is the image size, 3 is the RGB channel of the images, each time the three-dimensional convolutional layer passes through, the length and the width are reduced by half, and the output is 1 × 52 × 52 × 128;

step S23: the pyramid pooling module respectively performs enhanced regional feature processing corresponding to 3 outputs of the cross-stage local residual error network module;

step S24: performing feature fusion between scales on the output of the pyramid pooling module through the path aggregation network module to obtain 3 feature tensors with the sizes of 52 × 52 × 18, 26 × 26 × 18 and 13 × 13 × 18;

4. A video smoke detection system based on an end-to-end three-dimensional convolution target detection network is characterized by comprising the following modules:

the data set building module is used for acquiring video frames from a plurality of smoke videos, grouping the video frames, building a video frame sequence, performing data enhancement on the video frame sequence and building an enhanced data set;

a detection network construction and training module for inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.