CN114550032A - Video smoke detection method of end-to-end three-dimensional convolution target detection network - Google Patents

Video smoke detection method of end-to-end three-dimensional convolution target detection network Download PDF

Info

Publication number
CN114550032A
CN114550032A CN202210109359.7A CN202210109359A CN114550032A CN 114550032 A CN114550032 A CN 114550032A CN 202210109359 A CN202210109359 A CN 202210109359A CN 114550032 A CN114550032 A CN 114550032A
Authority
CN
China
Prior art keywords
smoke
video
video frame
network
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210109359.7A
Other languages
Chinese (zh)
Inventor
张启兴
霍一诺
张永明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210109359.7A priority Critical patent/CN114550032A/en
Publication of CN114550032A publication Critical patent/CN114550032A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Fire-Detection Mechanisms (AREA)

Abstract

The invention relates to a video smoke detection method and a system based on an end-to-end three-dimensional convolution target detection network, wherein the method comprises the following steps: s1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set; s2: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning. The method provided by the invention can effectively extract static and dynamic characteristics of the smoke, and the combination of the dynamic characteristics and the static characteristics can effectively improve the reliability of the video smoke detection algorithm, thereby accurately identifying and positioning the smoke in the video frame.

Description

Video smoke detection method of end-to-end three-dimensional convolution target detection network
Technical Field
The invention relates to the field of video fire detection and deep learning, in particular to a video smoke detection method and a video smoke detection system based on an end-to-end three-dimensional convolution target detection network.
Background
Currently, the following four types of studies for video smoke detection by using a deep learning method are mainly available: (1) the method has the advantages that each video frame is independently detected, the real-time detection effect is realized by using higher detection speed, and the method completely does not use the time sequence information contained between the continuous video frames, so that serious false negative report and false report are inevitable. (2) A traditional motion detection algorithm is used for extracting a motion region, and the DCNN is used for detecting the motion region, so that only shallow time sequence information is used, false alarm caused by partial static objects can be eliminated, and the method is useless for interference or missing report of moving objects. (3) Firstly, each video frame is independently detected, and when a suspected target is detected, a time sequence network is used for judgment, although effective deep dynamic characteristics are extracted, the method only has the effect of checking the detection result of the target detection network, false alarm can be eliminated, but no strategy is given to missed alarm. Furthermore, such algorithms are not in an end-to-end format and therefore tend to run at a slower speed. (4) The classifier aiming at the video segment is constructed by utilizing a time sequence network, so that the motion features and static features contained in the video can be fully extracted, but the features are only used for classification, and the smoke target is not positioned. Therefore, how to effectively extract static and dynamic characteristics of smoke and improve the reliability of smoke detection becomes an urgent problem to be solved.
Disclosure of Invention
In order to solve the technical problem, the invention provides a video smoke detection method and a video smoke detection system based on an end-to-end three-dimensional convolution target detection network.
The technical solution of the invention is as follows: a video smoke detection method based on an end-to-end three-dimensional convolution target detection network comprises the following steps:
step S1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set;
step S2: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.
Compared with the prior art, the invention has the following advantages:
the invention discloses a video smoke detection method based on an end-to-end three-dimensional convolution target detection network, which can effectively extract static and dynamic characteristics of smoke, and the combination of the dynamic characteristics and the static characteristics can effectively improve the reliability of a video smoke detection algorithm, thereby accurately identifying and positioning smoke in a video frame. The method can be applied to the field of video fire detection, has high application value, and provides a new method for solving the problem of high false alarm rate which puzzles the existing video fire detection.
Drawings
Fig. 1 is a flowchart of a video smoke detection method based on an end-to-end three-dimensional convolution target detection network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a network structure for detecting a target based on end-to-end three-dimensional convolution according to an embodiment of the present invention;
FIG. 3 is a block diagram of a cross-phase local residual network according to an embodiment of the present invention;
fig. 4 is a block diagram of a video smoke detection system based on an end-to-end three-dimensional convolution target detection network in the embodiment of the present invention.
Detailed Description
The invention provides a video smoke detection method based on an end-to-end three-dimensional convolution target detection network, which can effectively extract static and dynamic characteristics of smoke, and the combination of the dynamic characteristics and the static characteristics can effectively improve the reliability of a video smoke detection algorithm, so that the smoke in a video frame can be accurately identified and positioned.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings.
Example one
As shown in fig. 1, a video smoke detection method based on an end-to-end three-dimensional convolution target detection network provided by an embodiment of the present invention includes the following steps:
step S1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set;
step S2: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.
In one embodiment, step S1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set, wherein the method specifically comprises the following steps:
step S11: acquiring a plurality of smog videos, extracting images at fixed frame intervals, constructing a video frame sequence by extracting 100 images, and marking each image containing smog;
the embodiment of the invention obtains 44 sections of videos meeting requirements from a public fire smoke video image database, wherein 32 sections of videos have smoke for making positive samples, and 12 sections of videos have no smoke for making negative samples. In addition, 28 video segments are shot as supplements, wherein 21 video segments have smoke, and 7 video segments are background and interference objects such as pedestrians. The videos comprise three scenes, namely an indoor scene, an outdoor short-distance scene and an outdoor long-distance scene.
Images are extracted from the video at a fixed interval of 3 frames per second, each 100 pictures is a video frame sequence, and finally, 147 video frame sequences are obtained, wherein 14700 pictures are obtained in total. Where there are 115 sequences of video frames for the positive samples and 32 sequences of video frames for the negative samples. Each positive sample has a corresponding label file, and the label file is made by labelImg software. In the embodiment of the invention, 132 video sequences are randomly selected, 13200 pictures are used as a training set, and 15 sequences are left as a verification set.
Step S12: generating two random numbers a and b, with 0< a <132, 0< b <89, and i ═ a × 100+ b; extracting 12 continuous images from the ith picture, namely the mth picture of the mth video frame sequence in the data set to obtain a new video frame sequence;
because 12 pictures need to be continuously read to input into the network when network training is carried out subsequently, in order to ensure that 12 pictures read each time are from the same video sequence, the embodiment of the invention designs a reading rule of the video sequence. When the training data is read, firstly, two random integers a and b are generated, wherein the value range of a is between 0 and 132, the value range of b is between 0 and 89, a variable i is equal to a multiplied by 100+ b, then, reading is carried out from the ith picture, and 12 pictures are sequentially read. And after the pictures are read, reading the mark file corresponding to the last picture as a label for the training to calculate the loss value. The same rule is also adopted when the verification data is read, and the value range of a is changed to 0 to 15.
Step S13: selecting 4 new video frame sequences, performing enhancement transformation on images in the 4 new video frame sequences, and sequentially performing reduction splicing on the images in the 4 video frame sequences in the same manner to obtain an enhanced video frame sequence, thereby constructing an enhanced data set; the enhancement transform includes: flipping, clipping, size transformation, translation transformation and color gamut warping.
Data enhancement methods commonly used for object detection are image flipping, cropping, size transformation, translation transformation, rotation transformation, gamut warping, etc. The existing data enhancement algorithm is to input each picture after random transformation into a neural network for training, however, when a training sample is a video sequence, each picture in the sequence should be ensured to be subjected to the same transformation, so the existing data enhancement algorithm cannot be directly used in the invention.
The embodiment of the invention designs a data enhancement algorithm aiming at the video frame sequence of the smog, the algorithm comprises the functions of image turning, cutting, size transformation, translation transformation and color gamut distortion, and the movement of the smog has certain directionality, so the embodiment of the invention does not comprise rotation transformation. In addition, because the smoke detection network of the embodiment of the invention takes a plurality of continuous video frame sequences as input, and the pressure of the computing device is high, batch training is difficult to perform, so that the training period is long and the robustness of the model is poor. In order to solve the problem, the data enhancement algorithm designed by the invention can process four video frame sequences at a time, and after the four sequences are respectively transformed, the pictures in the four sequences are sequentially subjected to reduction splicing in the same mode to finally form a new sequence. The method can make each iterative training calculate the data of four pictures, and enrich the background of the detected object. In addition, the spliced images contain the reduced smoke targets, and the images can improve the sensitivity of the detection model to small targets, so that the detection model is more suitable for early fire detection.
In one embodiment, as shown in fig. 2, the step S2 is: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; outputting a smoke recognition result and positioning, specifically comprising:
step S21: the three-dimensional convolutional layer comprises 4 three-dimensional convolutional layers, the input is 12 × 416 × 416 × 3, wherein 12 is 12 images in an enhanced video frame sequence, 416 × 416 is the image size, 3 is the RGB channel of the image, each time the three-dimensional convolutional layer passes through, the length and the width are reduced by half, and the output is 1 × 52 × 52 × 128;
step S22: the cross-stage local residual error network module is formed by connecting 11 two-dimensional depth separable convolutions in series, and the output of the three-dimensional convolution layer is output and decoded by selecting three scales of 52 x 52, 26 x 26 and 13 x 13;
as shown in fig. 3, the cross-phase local residual network module structure diagram introduces a residual structure and a cross-phase local network structure, the residual structure is used to solve the problem that the gradient appearing when the network is too deep disappears, and the cross-phase local network structure can enhance the learning ability of the network. In order to enable the three-dimensional convolution smoke detection network to effectively detect smoke targets with different sizes, the three scales of 52 × 52, 26 × 26 and 13 × 13 are selected for output in the embodiment of the invention.
Step S23: the pyramid pooling module is used for performing enhanced regional characteristic processing corresponding to 3 outputs of the cross-stage local residual error network module respectively;
step S24: performing feature fusion between scales on the output of the pyramid pooling module through a path aggregation network module to obtain 3 feature tensors with the sizes of 52 × 52 × 18, 26 × 26 × 18 and 13 × 13 × 18;
the embodiment of the invention adopts a path aggregation network to perform feature fusion among all scales, performs up-sampling on small-scale feature tensors to sequentially perform fusion with medium-scale and large-scale feature tensors, and then performs down-sampling on large-scale features to sequentially perform fusion with the medium-scale and small-scale feature tensors, so as to shorten an information path and enhance a feature pyramid by using an accurate positioning signal existing in the large-scale feature tensors.
Step S25: and respectively inputting the 3 characteristic tensors into corresponding tensor decoders for decoding, and finally outputting a smoke identification result and positioning.
And obtaining three feature tensors with the sizes of 52 multiplied by 18, 26 multiplied by 18 and 13 multiplied by 18 by the path aggregation network module, respectively inputting the feature tensors into the corresponding yolo _ head for tensor decoding, and obtaining the smoke identification result and positioning. The working principle of yolo head is as follows, taking a tensor size of 52 × 52 × 18 as an example: firstly, an original picture is divided into grids of 52 x 52, each cell can predict 3 potential targets, each target corresponds to 6 parameters, and the parameters are 4 bounding box parameters, 1 confidence coefficient and 1 category probability value respectively.
The invention discloses a video smoke detection method based on an end-to-end three-dimensional convolution target detection network, which can effectively extract static and dynamic characteristics of smoke, and the combination of the dynamic characteristics and the static characteristics can effectively improve the reliability of a video smoke detection algorithm, thereby accurately identifying and positioning smoke in a video frame. The method can be applied to the field of video fire detection, has high application value, and provides a new method for solving the problem of high false alarm rate which puzzles the existing video fire detection.
Example two
As shown in fig. 4, an embodiment of the present invention provides a video smoke detection system based on an end-to-end three-dimensional convolution target detection network, including the following modules:
the data set building module 31 is used for acquiring video frames from a plurality of smoke videos, grouping the video frames, building a video frame sequence, performing data enhancement on the video frame sequence, and building an enhanced data set;
a detection network construction and training module 32 for inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.
The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims (4)

1. A video smoke detection method based on an end-to-end three-dimensional convolution target detection network is characterized by comprising the following steps:
step S1: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set;
step S2: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.
2. The video smoke detection method based on the end-to-end three-dimensional convolution target detection network of claim 1, wherein the step S1 is: acquiring video frames from a plurality of smoke videos, grouping the video frames, constructing a video frame sequence, performing data enhancement on the video frame sequence, and constructing an enhanced data set, wherein the method specifically comprises the following steps:
step S11: acquiring a plurality of smog videos, extracting images at fixed frame intervals, constructing a video frame sequence by extracting 100 images, and marking each image containing smog;
step S12: generating two random numbers a and b, with 0< a <132, 0< b <89, and i ═ a × 100+ b; extracting 12 continuous images from the ith picture, namely the mth picture of the mth video frame sequence in the enhanced data set to obtain a new video frame sequence;
step S13: selecting 4 new video frame sequences, performing enhancement transformation on images in the new video frame sequences, and sequentially performing reduction splicing on the images in the 4 video frame sequences in the same manner to obtain an enhanced video frame sequence, thereby constructing an enhanced data set; the enhancement transform includes: flipping, clipping, size transformation, translation transformation and color gamut warping.
3. The video smoke detection method based on the end-to-end three-dimensional convolution target detection network of claim 1, wherein the step S2 is: inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; outputting a smoke recognition result and positioning, specifically comprising:
step S21: the three-dimensional convolutional layer comprises 4 three-dimensional convolutional layers, the input is 12 × 416 × 416 × 3, wherein 12 is 12 images in the enhanced video frame sequence, 416 × 416 is the image size, 3 is the RGB channel of the images, each time the three-dimensional convolutional layer passes through, the length and the width are reduced by half, and the output is 1 × 52 × 52 × 128;
step S22: the cross-stage local residual error network module is formed by connecting 11 two-dimensional depth separable convolutions in series, and the output of the three-dimensional convolution layer is output and decoded by selecting three scales of 52 x 52, 26 x 26 and 13 x 13;
step S23: the pyramid pooling module respectively performs enhanced regional feature processing corresponding to 3 outputs of the cross-stage local residual error network module;
step S24: performing feature fusion between scales on the output of the pyramid pooling module through the path aggregation network module to obtain 3 feature tensors with the sizes of 52 × 52 × 18, 26 × 26 × 18 and 13 × 13 × 18;
step S25: and respectively inputting the 3 characteristic tensors into corresponding tensor decoders for decoding, and finally outputting a smoke identification result and positioning.
4. A video smoke detection system based on an end-to-end three-dimensional convolution target detection network is characterized by comprising the following modules:
the data set building module is used for acquiring video frames from a plurality of smoke videos, grouping the video frames, building a video frame sequence, performing data enhancement on the video frame sequence and building an enhanced data set;
a detection network construction and training module for inputting the enhanced data set into a three-dimensional convolutional smoke detection network, the smoke detection network comprising: the system comprises a three-dimensional convolution layer, a cross-stage local residual error network module, a pyramid pooling module, a path aggregation network module and a tensor decoder; and outputting a smoke recognition result and positioning.
CN202210109359.7A 2022-01-28 2022-01-28 Video smoke detection method of end-to-end three-dimensional convolution target detection network Pending CN114550032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210109359.7A CN114550032A (en) 2022-01-28 2022-01-28 Video smoke detection method of end-to-end three-dimensional convolution target detection network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210109359.7A CN114550032A (en) 2022-01-28 2022-01-28 Video smoke detection method of end-to-end three-dimensional convolution target detection network

Publications (1)

Publication Number Publication Date
CN114550032A true CN114550032A (en) 2022-05-27

Family

ID=81674475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210109359.7A Pending CN114550032A (en) 2022-01-28 2022-01-28 Video smoke detection method of end-to-end three-dimensional convolution target detection network

Country Status (1)

Country Link
CN (1) CN114550032A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580365A (en) * 2023-05-30 2023-08-11 苏州大学 Millimeter wave radar and vision fused target vehicle detection method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580365A (en) * 2023-05-30 2023-08-11 苏州大学 Millimeter wave radar and vision fused target vehicle detection method and system

Similar Documents

Publication Publication Date Title
CN112884064B (en) Target detection and identification method based on neural network
CN112183240B (en) Double-current convolution behavior identification method based on 3D time stream and parallel space stream
Lyu et al. Small object recognition algorithm of grain pests based on SSD feature fusion
CN111145222A (en) Fire detection method combining smoke movement trend and textural features
CN112801037A (en) Face tampering detection method based on continuous inter-frame difference
CN112257659A (en) Detection tracking method, apparatus and medium
CN114066937B (en) Multi-target tracking method for large-scale remote sensing image
CN116597438A (en) Improved fruit identification method and system based on Yolov5
CN111860457A (en) Fighting behavior recognition early warning method and recognition early warning system thereof
CN114550032A (en) Video smoke detection method of end-to-end three-dimensional convolution target detection network
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium
CN117475353A (en) Video-based abnormal smoke identification method and system
CN113570615A (en) Image processing method based on deep learning, electronic equipment and storage medium
CN112329550A (en) Weak supervision learning-based disaster-stricken building rapid positioning evaluation method and device
Wang et al. YOLOv5-light: efficient convolutional neural networks for flame detection
CN115661858A (en) 2D human body posture estimation method based on coupling of local features and global characterization
CN116912670A (en) Deep sea fish identification method based on improved YOLO model
CN115512263A (en) Dynamic visual monitoring method and device for falling object
CN115393788A (en) Multi-scale monitoring pedestrian re-identification method based on global information attention enhancement
Prabakaran et al. Key frame extraction analysis based on optimized convolution neural network (ocnn) using intensity feature selection (ifs)
CN113378598A (en) Dynamic bar code detection method based on deep learning
CN113888604A (en) Target tracking method based on depth optical flow
Sun et al. Light-YOLOv3: License plate detection in multi-vehicle scenario
CN112380970A (en) Video target detection method based on local area search
CN113283279B (en) Multi-target tracking method and device in video based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination