CN114580541A

CN114580541A - Fire disaster video smoke identification method based on time-space domain double channels

Info

Publication number: CN114580541A
Application number: CN202210215812.2A
Authority: CN
Inventors: 郑远攀; 王振宇; 许博阳; 牛依青; 高宇飞
Original assignee: Zhengzhou University of Light Industry
Current assignee: Zhengzhou University of Light Industry
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-06-03

Abstract

The invention provides a fire disaster video smoke identification method based on time-space domain two-channel, which comprises the following steps: collecting and making a smoke data set containing cloud and fog interference; building a static characteristic extraction network and a dynamic characteristic extraction network, and fusing and connecting the static characteristic extraction network and the dynamic characteristic extraction network to construct a video smoke identification network model; training the video smoke recognition network model by using a smoke data set to obtain an optimized video smoke recognition network model; the optimized network model is utilized to process the smoke video collected in real time, the static feature extraction network extracts the static features of the image in a space domain, the dynamic feature extraction network extracts the dynamic features of the video in a time domain, the static features and the dynamic features are fused to generate smoke features, and the smoke features are identified to judge whether smoke exists. The invention has higher accuracy and recall rate and lower false alarm rate, and can effectively identify and early warn the smoke in real time.

Description

Fire disaster video smoke identification method based on time-space domain double channels

Technical Field

The invention relates to the technical field of smoke identification, in particular to a fire disaster video smoke identification method based on time-space domain double channels.

Background

Existing smoke detection devices based on traditional physical sensors are widely used due to their low cost and simple installation. It is only necessary to mount it in a fixed position to substantially meet the standard requirements for smoke detection. However, such detection devices also have certain limitations: on the one hand they must be installed in the vicinity of the fire source and the triggering of the alarm requires a sufficient smoke concentration or air temperature, which greatly affects the real-time nature of the smoke detection. On the other hand, the detector is in direct contact with dust and smoke all the year round, so that the sensor is easy to break down in a severe environment, and the detection effect is lost. Therefore, the smoke detection method based on the physical sensor type is difficult to meet the requirements of the current industrial process and smoke safety early warning, and a smoke detection method with higher performance and higher accuracy is urgently needed.

The smoke detection algorithm based on the traditional image processing overcomes the defects of a physical sensor type method. Firstly, acquiring a smoke video in a scene by using a video monitoring device; then extracting the characteristics of the smoke, such as color, texture, shape and the like in the video image through a manual design algorithm; and finally, training a smoke classifier so as to identify smoke and perform early warning. Aphana et al convert the smoke image from RGB color space to HSV color space, and threshold the saturation and brightness to realize the segmentation of the smoke region. Peng et al extract the minimum circumscribed rectangle of the motion region in the image as the shape feature of the smoke, thereby achieving smoke detection. However, such methods require a lot of time and effort in feature extraction and feature selection, increasing the complexity of the algorithm. Meanwhile, due to the characteristic of variable smoke, the detection algorithm has a high false alarm rate.

Smoke detection at the early stage of a fire plays an important role in fire early warning. In recent years, Convolutional Neural Network (CNN) based video smoke recognition algorithms have been widely proposed and applied in various fields of industry. However, due to the characteristic of variable smoke, the algorithm still faces the problems of low accuracy, poor robustness, insufficient data set and the like.

Compared with a manually designed feature extraction algorithm, the convolutional neural network can automatically learn basic information and more complex pixel information in the smoke image, so that the complexity of the feature extraction process is reduced, and the characteristic of variable smoke is overcome. Gu et al utilizes deep two-channel neural network (DCNN) to fuse basic information and detailed information of smog and carry out smog classification, have better smog recognition rate. Zhang et al extract the generalized characteristics of smoke by using transfer learning, thereby accelerating the convergence speed and the recognition speed during network training. Cao et al propose a feature foreground model to construct an enhanced feature foreground network for smoke source prediction and detection, and improve the learning ability of the network to smoke. He et al combine attention mechanism, feature level and decision level fusion modules to achieve recognition of small smoke targets. But it is difficult to make use of only the static characteristics of smoke to make a breakthrough in accuracy and false alarm rate. Research finds that the diffusion characteristic of smoke is a key characteristic in smoke identification, and the accuracy of the model can be obviously improved. Therefore, how to effectively extract the motion features in the smoke becomes a main problem for smoke identification. The existing motion characteristic extraction methods are mainly divided into two types: one is a method based on conventional image processing, and the other is a method using 3D CNN (convolutional neural network). In the traditional image processing method, a binary image of a moving object is calculated by an interframe difference method, a Background difference method, an optical flow method or a ViBe (Visual Background Extractor) algorithm and is used as a motion characteristic. Liu et al introduced the concept of visual change images to describe the diffusion characteristics of smoke, and proposed a two-stage smoke detection algorithm combining a depth normalization network and a Support Vector Machine (SVM), which reduces the false alarm rate caused by objects such as cloud and fog, but the visual change images still need to be calculated by artificially designing complex algorithms. Gao et al, using the MSER (maximum Stable extreme Regions) algorithm, remedies the blank region problem generated by the ViBe algorithm under a short distance, generates a more complete candidate smoke region representing the smoke motion characteristics, improves the accuracy of smoke identification, reduces the omission ratio, but increases the algorithm cost by 20 times under the same hardware condition. The Sheng et al extracts all statistical images of time domain, frequency domain and time-frequency domain in the image as static characteristics and motion characteristics of the smoke, and inputs the statistical images into a deep confidence network for smoke identification, thereby improving the speed and accuracy of smoke detection in a complex environment. The above algorithm is complex in flow and is susceptible to image quality and complex environments. For example, when a non-smog moving object exists in a scene, the extracted motion characteristics often include interference of other moving objects, and the accuracy of identification is reduced. The method based on the 3D CNN can automatically learn the motion characteristics in the video image, does not need to utilize prior knowledge to design an algorithm to screen the interference characteristics of the motion, and has the advantages of simple algorithm, easy realization, small environmental interference and the like. Thus, 3D CNN has inherent advantages for smoke recognition tasks with diffuse features. However, the existing methods have few documents adopting 3D CNN, and the accuracy of the methods is yet to be improved.

Even if many scholars continuously optimize the network model, the accuracy of the smoke recognition algorithm is improved. However, CNN-based smoke recognition algorithms still have many disadvantages: on one hand, the existing numerous algorithms are only simple applications of CNN in smoke picture recognition, and do not consider the motion characteristics of smoke. Even if the diffusion characteristics of smoke are considered by a small part of algorithms, the motion characteristics in the video smoke are still extracted by using the traditional image processing mode, the algorithms are complex, and a large amount of high-frequency detail information in the smoke image is lost. On the other hand, most of the existing methods are based on the traditional deep learning model, and the accuracy of smoke detection cannot be further improved when the network model is nearly outdated. And the network model which is more novel and has more excellent detection effect is not widely applied. Meanwhile, the training data set has a single scene, so that the smoke recognition model has poor generalization capability and low robustness.

Disclosure of Invention

Aiming at the technical problems that the existing smoke identification method is low in accuracy and poor in robustness and cannot meet the smoke variable characteristics, the invention provides a fire disaster video smoke identification method based on a time-space domain double-channel, dynamic characteristics and static characteristics of smoke are respectively extracted based on an improved 3D convolutional neural network and a residual error attention block (RAB), and are fused, so that the smoke can be effectively identified and early warned in real time.

In order to achieve the purpose, the technical scheme of the invention is realized as follows: a fire video smoke identification method based on time-space domain double channels comprises the following steps:

the method comprises the following steps: collecting and making a smoke data set containing cloud and fog interference images and videos;

step two: constructing a static characteristic extraction network and a dynamic characteristic extraction network, carrying out fusion connection on the static characteristic extraction network and the dynamic characteristic extraction network, and constructing a video smoke identification network model;

step three: training the video smoke recognition network model constructed in the step two by using the smoke data set in the step one to obtain an optimized video smoke recognition network model;

step four: processing the smoke video acquired in real time by using the optimized video smoke identification network model, extracting static characteristics on an image space domain by using a static characteristic extraction network, extracting dynamic characteristics on a time domain by using a dynamic characteristic extraction network, fusing the static characteristics and the dynamic characteristics to generate smoke characteristics, identifying the smoke characteristics, judging whether smoke exists or not, and giving an alarm if the smoke exists.

The smoke data set in the first step comprises smoke images or videos in forest, field, indoor, playground, construction site, city and highway scenes, and the smoke data set comprises images and videos of various positive samples and negative samples.

The video smoke identification network model comprises a static characteristic extraction network and a dynamic characteristic extraction network which are connected in parallel, the static characteristic extraction network and the dynamic characteristic extraction network are both connected with a fusion component, and the fusion component is connected with a full-connection unit; the fusion component adopts a self-adaptive fusion method and utilizes the learning capability of the neural network to redistribute the weight for the fusion characteristics.

The fusion component comprises a feature fusion unit, and the feature fusion unit combines the static features extracted by the static feature extraction network and the dynamic features extracted by the dynamic feature extraction network: obtaining a group of 1 x (n + k) eigenvectors I by adopting 3D global average pooling, and converting the eigenvectors into eigenvectors through recombination

The feature matrix I is convolved to obtain a weight matrix II

Converting the feature matrix II into a weight vector II of 1 (x (n + k); and multiplying the weight vector II with the corresponding static characteristics and dynamic characteristics to obtain fused characteristics, and inputting the fused characteristics into a full connection layer of the full connection unit.

The fusion method of the video smoke recognition network model comprises the following steps

Wherein the parameters

And

obtained by back propagation autonomous learning, F_stAnd F_dyStatic features and dynamic features for fusion are separately represented,

is a feature after fusion.

The static feature extraction network is built based on a residual error attention module; the static feature extraction network is sequentially connected with 12 residual error attention modules, and 1 pooling layer is connected behind every two 2 residual error attention modules; the residual attention blocks all adopt convolution kernels with the size of 3 multiplied by 3, and the step length is 1; the pooling layers are all subjected to maximal pooling with the size of 2 multiplied by 2 and the step length is 2; the static feature extraction network adopts a ReLU nonlinear non-saturated activation function.

The residual error attention block comprises a channel attention unit, a space attention unit and a residual error structure, wherein an input feature diagram X is transmitted to the channel attention unit through convolution operation in a trunk branch, and then the feature diagram X is obtained through processing of the channel attention unit

Transmitting to a space attention unit, and processing the space attention unit to obtain a feature map

Characteristic diagram

Is a feature graph after weight value is redistributed

Adding the input characteristic diagram X in the jump branch to obtain an output characteristic diagram

The dynamic feature extraction network is a weight 3D convolutional neural network and comprises a feature extraction module and an attention module, the feature extraction module comprises at least two feature extraction units which are connected in sequence, and each feature extraction unit comprises a 3D convolutional layer and a pooling layer which are connected in sequence; the feature graphs processed by the feature extraction module are subjected to convolution operation to respectively obtain feature graphs F ═ { F ═ F }₁,f₂,...,f_kA and attention diagram a ═ a₁,a₂,...,a_kAttention is sought for a ═ a₁,a₂,...,a_kFeature a in }_kAsWeight and feature map F ═ F₁,f₂,...,f_kFeature of (1) } c_kMultiplying in sequence to obtain dynamic characteristics

And is

The feature extraction module of the dynamic feature extraction network comprises 5 feature extraction units, wherein 3D convolution layers in the feature extraction units all adopt 3D convolution kernels with the size of 3 multiplied by 3, the step length stride is 1 multiplied by 1, and the padding attribute padding is 1; pooling layer II was 2 × 2 × 2 3D maximal pooling.

Compared with the prior art, the invention has the following beneficial effects: building a static feature extraction network by using the RAB to enhance the smoke static features of a spatial domain in the image; aiming at the diffusion characteristic of the smoke, a weight 3D convolutional neural network (W-3DCNN) is provided for extracting the dynamic characteristic of the smoke on a time domain; meanwhile, static and dynamic characteristics in the smoke are fused by adopting a self-adaptive characteristic fusion method for final smoke identification. In the experimental stage, the provided recognition model is evaluated on the user-defined data set, and the evaluation result shows that the accuracy and the false alarm rate of the method are respectively 98.87% and 1.06%, so that the method which is practical and feasible and has higher accuracy is provided for smoke recognition.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of the present invention.

Fig. 2 is a schematic structural diagram of a static feature extraction network according to the present invention.

FIG. 3 is a diagram of the residual attention block of FIG. 2 according to the present invention.

Fig. 4 is a schematic structural diagram of a dynamic feature extraction network according to the present invention.

Fig. 5 is a schematic structural diagram of a video smoke recognition network model according to the present invention.

FIG. 6 shows some samples in the experiment of the present invention, wherein (a) is a missing detection sample and (b) is a false detection sample.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, a fire video smoke identification method based on time-space domain two-channel includes the following steps:

the method comprises the following steps: and collecting and making a smoke data set containing cloud and fog interference images.

In order to enhance the robustness of the constructed smoke recognition network model, smoke data sets under multiple scenes including forests, fields, rooms, playgrounds, construction sites, cities, roads and the like are collected and manufactured and are used for training and testing the network model. The image or video in the smoke data set contains cloud and fog interference, the smoke data set contains image data and video data of various positive samples and negative samples, and the problem of insufficient smoke data sets is solved. The positive swatch is the image with smoke and the negative swatch is the image without smoke.

Step two: and constructing a static characteristic extraction network and a dynamic characteristic extraction network, and carrying out fusion connection on the static characteristic extraction network and the dynamic characteristic extraction network to construct a video smoke identification network model.

The video smoke identification network model is a two-channel network based on static smoke characteristics and dynamic smoke characteristics, and can detect smoke images under the condition of smoke-like interference. The static feature extraction network is based on the residual error attention block, so that the learning capability of the network is improved, and the problem that the network cannot give full play to the performance of the smoke image due to the fact that high-frequency information and low-frequency information in the smoke image are treated equally is solved.

The static feature extraction network is used for extracting the static features of the smoke at a deeper level and improving the accuracy of smoke identification. The characteristics of color, texture, edge and the like in the smoke image are main characteristics for effectively identifying smoke, in order to extract more remarkable static smoke characteristics, the static characteristic extraction network is built based on RAB, as shown in FIG. 2, the static characteristic extraction network is sequentially connected with 12 residual error attention modules, and 1 maximum pooling is connected behind every two 2 residual error attention modules. The last residual attention module connects full connection layers fc1 and fc2 in turn.

The static feature extraction network is formed by stacking an RAB layer and a pooling layer, and the tail end of the network is used for collecting the smoke features by a full connection layer and classifying the smoke features. On one hand, the RAB layer searches for stronger channel expression in the smoke characteristic diagram by using channel attention, and then enhances the expression capability of the channels; on the other hand, spatial attention can enhance more effective information and suppress ineffective information in smoke features, so that the network focuses on smoke objects more. The introduction of the pooling layer can effectively reduce the size of the smoke characteristic diagram, and plays roles in reducing dimension and removing redundant information. And (4) re-distributing the weight value to the original feature diagram to realize that the feature diagram after the weight value is re-distributed is the feature diagram with attention, and the process adds spatial attention. And after the image is subjected to a series of rigid transformation, the pooling layer ensures that the network can still extract effective characteristics. When the input image of the whole model changes, the model can still recognize smoke. For example: when the input is a generic smoke image model, the smoke in the image can be identified, but if the image is subjected to a rigid transformation, i.e. flipping, rotation or scaling, the model can still identify. Namely, the input of the model is simply transformed without influencing the result of the model.

The input to the network is an RGB image of 256 × 256 pixels, RAB uses a convolution kernel of 3 × 3 size, and when stride is 1, the RAB layer does not change the size of the input image. The pooling layer employs a global maximum pooling of size 2 x 2. Every time the network passes through the pooling layer with the step length of 2, the image size is reduced to a half of the original size, so that the number of convolution kernels of RABs connected after the pooling layer with the step length of 2 is increased to be twice of the original size. The size of the feature map after pooling is reduced, which means that information in the feature map is lost, in order to compensate for the information loss caused by the reduction of the feature map, the number of convolution kernels of the next RAB is increased, the number of convolution kernels is increased, and the loss caused by the reduction of the size of the feature map is compensated by the increase of the number of the feature map. When the neural network carries out reverse propagation, the gradient attenuates one layer every time the gradient is propagated until the gradient almost disappears, so that the training network converges more and more slowly. In order to overcome gradient disappearance and accelerate the network convergence speed, the static feature extraction network adopts a ReLU nonlinear non-saturated activation function.

For high resolution images, the depth of CNN plays a crucial role, but too deep networks are difficult to train. Low-resolution images and features contain abundant low-frequency information, but are treated equally among channels, thereby hindering the representation capability of the network. A Residual Attention Block (RAB) combines channel Attention, space Attention and a Residual structure, finds stronger static representation capability in a smoke feature map, enhances more effective information and inhibits invalid information in smoke features, and further improves the accuracy of smoke identification.

A Residual Attention Block (RAB) is used to extract pixel features deeper in smoke. The RAB is constructed by stacking a plurality of layers of attention modules and a plurality of jump connections, as shown in fig. 3, wherein channel attention represents a Spatial attention layer, and Spatial attention represents a channel attention layer, so that the model focuses more on the smoke image in the image, and ignores some unimportant regions. conv represents convolution operation, and the input feature X is transmitted to a space attention layer through convolution to obtain a feature

Feature(s)

Feature acquisition through channel attention layer

By inputting feature X through residual structure and feature

Adding to obtain an output characteristic F_st(X). Each RAB is divided into two parts: a skip branch (mask branch) and a trunk branch (trunk branch). The main branch is focused on high-frequency information in the learning image, and a feature graph with refined features is extracted through continuous arrangement of the attention modules. The skip branch is a main method for realizing the residual block, thereby accelerating network training and allowing rich low-frequency information to be directly transmitted through a plurality of skip connections.

The calculation method of the residual attention block comprises the following steps:

wherein,

the characteristic diagram after the weight is redistributed is represented, the method for rewriting and distributing the weight is that the characteristic diagram is multiplied by a weight matrix, X represents the output characteristic of the previous layer of the network, and the output characteristic is transmitted to the next layer through a trunk branch and a jump branch. F_st(X) represents an output characteristic of the residual attention block.

On one hand, the characteristics of the smoke such as color, texture, edge and the like have poor anti-interference capability, are easily confused with the smoke such as cloud, shadow, fog and the like, and reduce the identification accuracy; on the other hand, the shape and color of smoke change from generation to diffusion, and the common neural network cannot sense the various changes in time sequence. The input of the 3D neural convolutional network (CNN) is a continuous video sequence, and the convolution operation is performed by sliding on three dimensions of the frame width, the frame height and the frame number of the video, so that the relevant information of moving objects in the video can be effectively extracted, and the relevant information can be flexibly applied to various complex tasks. Through the analysis of false detection and missed detection samples, a smoke dynamic feature extraction network based on 3D CNN is provided. In the network, the recognition capability of similar interference targets such as smoke, cloud and fog is improved by utilizing the motion characteristics of the smoke, and the problems of high complexity and weak interference resistance in the motion characteristic extraction of the traditional image processing algorithm are solved.

In order to further improve the accuracy and robustness of smoke detection, a 3D neural convolution network is adopted to extract the motion characteristics of smoke in the diffusion process, so that the motion characteristics are distinguished from the smoke-like characteristics; meanwhile, an attention mechanism is added in the network, corresponding feature weights are learned for smoke features in different periods, and the purpose is to increase the generalization capability of smoke identification, identify the smoke features in different periods and discover fire as early as possible. The network structure of the dynamic feature extraction network is shown in fig. 4. The dynamic feature extraction network comprises a feature extraction module and an attention module, wherein the feature extraction module comprises 5 feature extraction units which are connected in sequence, and each feature extraction unit comprises a 3D convolution layer and a pooling layer. The feature graph processed by the feature extraction module is convolved to obtain a feature graph and an input attention module respectively, the attention module processes the convolved feature graph into an attention graph through convolution operation, the attention graph and the features in the feature graph are multiplied in sequence to obtain feature mapping, and finally the feature mapping is processed through a full connection layer II to generate dynamic features.

The dynamic feature extraction network consists of a feature extraction module and an attention module, and the tail end of the network classifies the dynamic features of the smoke collected in a full-connection mode. A video sequence of 25 × 3 × 256 × 256 is taken as an input of the network, the network uses 3D convolution kernels of 3 × 3 × 3 in the feature extraction module, the stride is 1 × 1 × 1, and the padding attribute padding is 1. The pooling layer is a 2 × 2 × 2 3D maximum pooling. Similarly, when the pooling layer reduces the input image size to half of the original, the number of convolution kernels increases by a factor of 2.

The dynamic feature extraction network extracts feature maps F ═ F respectively₁,f₂,...,f_kAndattention is drawn to a ═ a₁,a₂,...,a_kFourthly, the feature maps f are arranged one by one_kCorresponding attention diagram a_kMultiplying to generate the final feature map

As in equation (2). The purpose is to expect that each weight map can represent different importance degrees of each channel of the 3D feature map on one hand and can enhance the target region of interest in each 3D feature map on the other hand.

The dynamic feature extraction network adds 3D attention connection in 3D convolution to obtain a weight 3D convolution neural network (W-3D CNN) which is used for more fully and efficiently extracting smoke motion information between adjacent frames in a video image, and the W-3D CNN is used for extracting the smoke motion information between the adjacent frames in the video image, so that the interference of cloud, fog, shadow and other smoke-like factors can be more effectively eliminated, and the false alarm rate is reduced.

Common feature fusion modes include concat fusion, add fusion and max fusion, and the add fusion and the max fusion are based on the premise that feature graphs have the same scale, so that the flexibility of the neural network output feature graph is greatly limited. While concat fusion simply splices the feature maps and ignores the relevance between the feature maps. This also results in the feature fusion being less effective in some scenarios. In order to solve the above problems, the present invention provides an adaptive fusion method. The learning ability of the neural network is utilized to redistribute the weight values for the feature graphs, so that more important feature graphs are determined by model learning, manual intervention is reduced, fused feature graphs are not required to have the same resolution, and the flexibility of the network is improved. First, the full connection layers of the static feature extraction network and the dynamic feature extraction network are removed respectively, and then the extracted features are input into a fusion component to perform feature fusion, and the overall structure of the network is shown in fig. 5.

Fusing component pairs to n sets of static feature mapsAnd k sets of dynamic features are respectively subjected to 3D global average pooling to obtain a real matrix of 1 x (k + n). In order to reduce the number of parameters in the network, a convolutional network is used to obtain a corresponding weight matrix. Where reshape denotes converting a real number matrix of 1 × (k + n) into

The nonreshape indicates that

The feature matrix of (c) is converted into a 1 x (k + n) weight matrix. Finally inputting the fused features into a tail end full-connection layer for smoke identification, wherein the fusion process is as follows:

wherein the parameters

And

obtained from the network through back-propagation autonomous learning, F_stAnd F_dyStatic features and dynamic features for fusion are separately represented,

is a feature after fusion.

The self-adaptive feature fusion method fuses static features of the smoke image and dynamic features in the smoke video in a network automatic selection mode to generate features for smoke identification, and has higher flexibility.

Step three: and (4) training the video smoke recognition network model constructed in the step two through the smoke data set in the step one to obtain the optimized video smoke recognition network model.

The specific training method is a gradient descent method, and the updating of the training parameters is realized.

Step four: processing images in the smoke video acquired in real time by using the optimized video smoke identification network model, extracting static features on a spatial domain by using a static feature extraction network, extracting dynamic features on a time domain by using a dynamic feature extraction network, fusing the static features and the dynamic features to generate smoke features, identifying the smoke features, judging whether smoke exists or not, and giving an alarm if smoke exists.

The self-adaptive feature fusion method fuses static features of the smoke image and dynamic features in the smoke video to generate summary features for smoke identification.

Since the fused feature map is a multi-dimensional matrix and the input of the fully connected layer is a 1-dimensional vector, it is common practice to stretch the multi-dimensional matrix into a vector as the input of the fully connected layer. The fully-connected layer outputs a vector [ y1, y2] containing two elements, y1, y2 are respectively decimal numbers from 0 to 1, respectively representing the probability of smoke or smoke, for example, y1 represents the probability of smoke or smoke, and y2 represents the probability of smoke or smoke. If y1 is greater than y2([0.8,0.2]), the recognition result is smoke, and if y2 is greater than y1([0.3,0.7]), the recognition result is smoke-free. Smoke type: the results were two in number, either smoke or no smoke.

To verify the validity of the RAB model proposed by the present invention, the static feature extraction network was trained and tested on data set 1. set1 is the image in the data set, the following set2 is the video in the data set. For further comparison, nine mainstream and advanced network models and a static feature extraction network are selected for comparison experiments. The method specifically comprises the following steps: AlexNet, ZF-Net, GoogleNet, VGG16, ResNet, DenseNet, SE-Net, Squeezenet, MobileNet. The 10 models of this experiment were trained and tested on dataset set1, and the specific experimental results are shown in table 1. As can be seen from table 1, the RAB model achieves excellent performance in all evaluation indexes.

TABLE 1 comparison of different smoke image recognition algorithms

Experimental data show that the static feature extraction network with the RAB module achieves the highest accuracy of 96.61%, the highest recall rate of 95.69% and the lowest false alarm rate of 4.45%. Compared with the SE-Net structure with the second ranking, the accuracy is high by 1.46%, the recall rate is high by 0.84%, and the false alarm rate is low by 2.0%, so that the RAB module is more sensitive to smoke images, and the system has higher reliability when giving an alarm to smoke.

Through training and testing the RAB module for multiple times, and analyzing and discovering false detection and missed detection samples in each experimental result: 68% -75% of the missed and false samples are interference samples of cloud, fog and pure background, as shown in fig. 6. Therefore, in order to further reduce the false alarm rate and improve the accuracy and robustness of smoke identification, the dynamic features of smoke are extracted by adopting W-3D CNN, the smoke identification is carried out by utilizing the fused features, and the effectiveness of the whole method is verified through experimental data.

In order to verify the overall performance of the video Smoke recognition network provided by the invention, a data set2 is utilized to carry out overall comparison experiments on the optimized and fused Smoke recognition network model of the invention, as well as VGG16, ResNet, DenseNet, SE-Net, method [1] -document Real-time video-Based Smoke Detection with high access and effectiveness and method [2] -document Visual Smoke Detection Based on Embedded sample Detection CNNs with better performance, and the experimental results are shown in Table 2.

Table 2 comprehensive comparison of the present invention with mainstream network

As can be seen from the data in Table 2, the VGGNet has the lowest Accuracy (ACC) and Recall (Recall) and the highest False Alarm Rate (FAR); ResNet and SE-Net perform better than VGGNet, but are inferior to other methods, which also indicate that deep learning based generic recognition models cannot perform well in complex tasks (such as video smoke recognition). The methods [1] and [2] extract the motion characteristics of the smoke by using the traditional manual algorithm, although the low-frequency information in the smoke can be effectively extracted, part of high-frequency information is ignored, and therefore the optimal performance is not shown. The 3D convolution and residual error attention-based network provided by the invention not only can better extract static information with stronger representation capability in smoke, but also can simultaneously extract high-frequency information and low-frequency information in smoke motion characteristics through the improved 3D CNN and redistribute weight for the high-frequency information and the low-frequency information, so that the interference of a smoke-like target can be effectively reduced, the accuracy is 98.73%, the recall rate is 98.24%, and the highest level is achieved; the false alarm rate is reduced to 1.06% of the lowest. Compared with the performance of the static feature extraction network in the table 2, after the W-3D CNN network is added, the accuracy is improved by 2.12%, the recall rate is improved by 2.55%, and the false alarm rate is reduced by 3.39%. Meanwhile, through multiple experiments, the detection rate of the invention can averagely reach 48 frames/second, and the real-time detection of 25-30 frames/second of common monitoring videos can be met.

In conclusion, the method provided by the invention obviously leads to other methods in three indexes, and can effectively identify and early warn smoke in real time. Through a large number of experiments and comparisons, the method provided by the invention has higher detection accuracy (98.73%), recall rate (98.24%) and lower false alarm rate (1.06%) than the existing method.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A fire video smoke identification method based on time-space domain double channels is characterized by comprising the following steps:

step two: building a static characteristic extraction network and a dynamic characteristic extraction network, carrying out fusion connection on the static characteristic extraction network and the dynamic characteristic extraction network, and constructing a video smoke identification network model;

2. The fire video smoke identification method based on the time-space domain two-channel according to claim 1, wherein the smoke data set in the first step comprises smoke images or videos in forest, field, indoor, playground, construction site, city and road scenes, and the smoke data set comprises images and videos comprising a plurality of positive samples and negative samples.

3. The fire video smoke identification method based on the time-space domain two-channel according to claim 1 or 2, wherein the video smoke identification network model comprises a static feature extraction network and a dynamic feature extraction network which are connected in parallel, the static feature extraction network and the dynamic feature extraction network are both connected with a fusion component, and the fusion component is connected with a full-connection unit; the fusion component adopts a self-adaptive fusion method and utilizes the learning capability of the neural network to redistribute the weight for the fusion characteristics.

4. The fire video smoke identification method based on the time-space domain two-channel according to claim 3, wherein the fusion component comprises a feature fusion unit, and the feature fusion unit combines the static features extracted by the static feature extraction network and the dynamic features extracted by the dynamic feature extraction network: obtaining a group of 1 x (n + k) eigenvectors I by adopting 3D global average pooling, and converting the eigenvectors into eigenvectors through recombination

The feature matrix I is convolved to obtain a weight matrix II

5. The fire video smoke identification method based on the time-space domain two-channel according to claim 4, wherein the fusion method of the video smoke identification network model is

Wherein the parameters

And

is a feature after fusion.

6. The fire video smoke identification method based on the time-space domain two-channel according to any one of claims 3 to 5, wherein the static feature extraction network is built based on a residual attention module; the static feature extraction network is sequentially connected with 12 residual error attention modules, and 1 pooling layer is connected behind every two 2 residual error attention modules; the residual attention blocks all adopt convolution kernels with the size of 3 multiplied by 3, and the step length is 1; the pooling layers are all maximum pooling with the size of 2 multiplied by 2 and the step length is 2; the static feature extraction network adopts a ReLU nonlinear non-saturated activation function.

7. The fire video smoke identification method based on the time-space domain two-channel according to claim 6, wherein the residual attention block comprises a channel attention unit, a space attention unit and a residual structure, the input feature map X is transmitted to the channel attention unit through convolution operation in a trunk branch, and then the feature map obtained through processing of the channel attention unit

Characteristic diagram

Is a feature graph after weight is redistributed

8. The fire video smoke identification method based on the time-space domain two-channel according to any one of claims 3-5 and 7, wherein the dynamic feature extraction network is a weight 3D convolutional neural network and comprises a feature extraction module and an attention module, the feature extraction module comprises at least two feature extraction units which are connected in sequence, and each feature extraction unit comprises a 3D convolutional layer and a pooling layer which are connected in sequence; the feature graphs processed by the feature extraction module are subjected to convolution operation to respectively obtain feature graphs F ═ { F ═ F }₁,f₂,...,f_kAnd attention diagram a ═{a₁,a₂,...,a_kAttention is sought for a ═ a₁,a₂,...,a_kFeature a in }_kAs a weight sum feature map F ═ F₁,f₂,...,f_kFeature of (1) } c_kMultiplying in sequence to obtain dynamic characteristics

And is provided with

9. The fire video smoke identification method based on the time-space domain two-channel according to claim 8, wherein the feature extraction module of the dynamic feature extraction network comprises 5 feature extraction units, 3D convolution layers in the feature extraction units all adopt 3D convolution kernels with the size of 3 x 3, the step length stride is 1 x 1, and the padding attribute padding is 1; pooling layer II was 2 × 2 × 2 3D maximal pooling.