CN115171047A

CN115171047A - Fire image detection method based on lightweight long-short distance attention transformer network

Info

Publication number: CN115171047A
Application number: CN202210852895.6A
Authority: CN
Inventors: 赵亚琴; 赵文轩
Original assignee: Nanjing Forestry University
Current assignee: Nanjing Forestry University
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-10-11

Abstract

A fire disaster image detection method based on a lightweight long-and-short-distance attention transducer network comprises the steps of firstly collecting flame pictures; then using a flame detection network for detection, comprising the following steps: 1) Processing and inputting a flame picture to be detected by using a designed lightweight feature extraction backbone network, and outputting extracted multi-scale flame features with three different resolutions; 2) Constructing a BiFPN-based feature fusion network to perform feature fusion processing on the multi-scale flame features, and outputting fusion features fused with three different resolution layers; 3) And the classification layer of the network performs classification prediction on the fusion characteristics and judges the existence of flame and the position of the flame in the image. In the invention: the detection speed is improved by the light-weight backbone network structure; the transformer of the long and short attention mechanism improves the detection precision; the characteristic fusion mechanism BiFPN improves the capability of detecting small target flames in the fire image, and improves the identification accuracy of the early fire and the fire image shot in a long distance.

Description

Fire image detection method based on lightweight long-short distance attention transformer network

Technical Field

The technical scheme belongs to the field of image processing, and particularly relates to a fire detection method for a flame detection neural network by using image processing.

Background

Among various disasters, a fire is one of the major disasters threatening public safety and social development. In environments with many combustibles, such as residential buildings, industrial warehouses, forests, and the like, if an early fire is not discovered in time, a significant loss is usually caused. In recent years, fire detection has been the focus of research, and the research of reliable fire detection systems has important significance for the safety of people's lives and properties and the protection of ecosystems.

Early sensors based on temperature and smoke had the disadvantage of being highly influenced by environmental variables due to their slow response speed. While the traditional fire detection method based on computer vision relies on characteristics such as color, texture and the like which are artificially defined by researchers, the characteristics which are artificially set are usually only suitable for specific scenes and videos, the generalization capability is poor, and the flame detection accuracy is low. In recent years, convolutional Neural Networks (CNNs) have been widely used in the field of flame detection. However, the existing flame detection network directly utilizes the existing target detection network, has low detection accuracy for small target flames, has many problems of high model complexity, low detection speed, difficult training and the like, and is difficult to apply to flame scenes in real time.

Disclosure of Invention

The invention provides a fire disaster image detection method based on a lightweight long-short distance attention transformer network.

The network provided by the invention designs a lightweight and efficient convolution and transform module combined structure based on multi-scale feature learning, a long-short distance attention mechanism and a bidirectional feature fusion strategy, and forms a new fire detection framework.

The backbone network provided by the invention extracts the multi-scale characteristics of the flame; the introduction of a transformer increases the global attention of the network and improves the capability of the network for distinguishing flames from backgrounds; the feature fusion network will better predict different sizes, especially early small-size flames;

the method comprises the following specific steps:

step 1, extracting a flame picture to be detected by using a designed lightweight feature extraction backbone network, and outputting extracted multi-scale flame features with three different resolutions;

step 1.1, using a standard convolution module to carry out feature pre-extraction on flame features in an input flame picture, and establishing an initialized feature tensor for network learning;

step 1.2, performing depth feature extraction on the initialized feature tensor by using four groups of depth separable convolution modules, and reducing the size of a flame feature map;

step 1.3, introducing a transfromer module into a depth separable convolution module, and constructing a lightweight transformer module based on a long-and-short distance attention mechanism, wherein the introduction of the transformer module is not only used for extracting the global features of the flame image, but also ensures that the feature extraction is not influenced by the shooting distance;

step 1.4, three groups of sequentially connected long and short distance attention mechanism lightweight transform modules in series are embedded into a flame feature extraction backbone network, and local and global features of a flame image are extracted;

step 1.5, the light-weight transducer module of each group of long-short distance attention mechanism reduces the size of the input flame characteristic diagram by half, and outputs the flame characteristic diagram of the resolution layer to a characteristic fusion module;

processing the flame characteristic diagrams of the current picture through the step 1.4) and the step 1.5) to obtain three flame characteristic diagrams with different resolutions, and sending the three flame characteristic diagrams with different resolutions into the step 2) for processing;

step 2, constructing a BiFPN-based feature fusion network to perform feature fusion processing on the received multi-scale flame features, and outputting fusion features of three different resolution layers;

step 2.1, in a feature fusion network, sequentially carrying out deconvolution up-sampling and pooling down-sampling on the three flame feature maps with different resolutions obtained in the step 1 from low to high and then from high to low, and cascading input and output to further fuse different convolution layer flame feature information;

and 2.2) repeating the step 2.1) twice to finally obtain three characteristic diagrams of the flame to be detected.

Step 3, classifying and positioning the flames on the three resolution layers by using a prediction network, sequencing scores of output results, and finally obtaining a final prediction result through a non-maximum value inhibition layer;

the invention has the beneficial effects that:

(1) The invention provides a lightweight transform neural network based on a long-distance attention machine mechanism and applied to fire flame detection. The main framework of the network is the framework of Lite Transformer and MobileViT, and is optimized in the fusion mode of the convolutional layer and the Transformer layer. Specifically, prior to computing the self-attention module, the present invention splits the feature mapping along the channel dimension into convolution branches and transformer branches, which are then merged together. The design enables the model to avoid the problems of high complexity of the model and difficult model fitting caused by only using a transformer in a fire data set;

(2) The invention introduces a long-distance attention block in the backbone network to make up for the performance loss possibly caused by parameter reduction. On the other hand, due to the limitation of the receptive field, the CNN module can only obtain the attention of a short distance, which is not beneficial to the extraction of fire characteristics. The Transformer is able to focus more on distant objects independent of distance. By introducing the attention blocks with long and short distances into the backbone network, the local characteristics and the global characteristics of the fire video captured by the network are not influenced by the shooting distance, so that the accuracy of fire detection is improved;

(3) In order to improve the utilization rate of multi-scale fire characteristics, the invention utilizes BiFPN to fuse the characteristics of different resolution layers, uniformly scales the resolution, depth and width of all trunks, is favorable for better processing large-scale fire and small-scale fire simultaneously, and distinguishes fire objects from backgrounds;

(4) In the process of extracting the long and short distance attention, the convolution layer is not nested in the self-attention module of the Transformer block, and the convolution branch and the Transformer branch are selected to be parallel to extract local information and global information respectively. Specifically, before computing the self-attention module, the feature map along the channel dimension is split into convolution branches and transformation branches, which are then merged together. The ingenious design allows the model to avoid the problems of higher complexity and difficult model fitting that result from using only a transform in the fire data set.

Drawings

FIG. 1 is a flow chart of a method of flame detection according to the present invention;

FIG. 2 is a diagram of the flame detection network of the present invention;

FIG. 3 is a block diagram of a depth separable convolution module of the present invention;

FIG. 4 (a) is a diagram showing a conventional structure of a Transformer;

FIG. 4 (b) is a diagram of a lightweight transform module of the short-and-long-distance attention mechanism according to the present invention;

FIG. 5 is a diagram of a transform module according to the present invention;

fig. 6 is a picture of the detection result of the fire detection method of the present invention for a picture of a fire that is difficult to recognize.

Detailed Description

Aiming at the problems of low detection precision, low speed and high calculation cost caused by the fact that the existing target detection network is directly utilized in the fire detection method based on machine vision, the invention provides a lightweight transformer neural network based on a long-distance attention mechanism for real-time fire detection. The method adopts a lightweight backbone network to extract the multi-scale characteristics of the flame with lower calculation cost; a transformer mechanism is introduced into the convolutional layer, and an attention block with a long distance and a short distance is constructed to extract global attention features irrelevant to the distance and help a network to identify flames and a background; the feature fusion module processes the multi-scale features extracted from the trunks, and improves the detection effect on different-scale fires, particularly early small-scale fires.

Compared with the existing deep learning flame detection network, the invention designs a lightweight backbone network, thereby improving the detection speed; the invention adopts a transformer mechanism with long and short attention, thereby improving the detection precision; the invention uses a characteristic fusion mechanism BiFPN, and improves the capability of detecting early small-scale flames.

The present disclosure will be further described with reference to the accompanying drawings and embodiments.

The method adopts a lightweight transform flame detection network with attention to long and short distances to detect the frame images of the monitoring video of the fire scene.

As shown in FIG. 1, in the flame detection network, multi-scale flame features based on a long and short attention mechanism are extracted, and the multi-scale flame features and the predicted flame position are fused. The architecture of the overall network of the present invention is shown in fig. 2. Specifically, the method comprises the following steps:

step 1.1, a standard convolution module is used for carrying out feature pre-extraction on an input flame picture, and an initialization feature tensor for network learning is established. The standard convolution module comprises a convolution layer, a batch normalization layer and a SiLU active layer;

and 1.2, performing depth feature extraction on the initialized feature tensor by using four groups of depth separable convolution modules, and reducing the size of the flame feature map. These depth separable convolution modules are identical in structure, all as shown in FIG. 3, and differ only in convolution step size of the convolution layer. As shown in fig. 3, each depth separable convolution module includes a point-by-point convolution layer, a depth convolution layer, and a channel restoration layer. The point-by-point convolution layer is composed of a 1 x 1 convolution layer, a batch normalization layer and a SilU layer, the deep convolution layer is composed of a 3 x 3 grouping convolution layer, a batch normalization layer and a SilU layer, and the channel restoration layer is composed of a 1 x 1 convolution layer and a batch normalization layer. The four depth separable convolution modules are divided into two depth separable convolution modules with depth convolution step size of 1 and without changing the feature map size and two depth separable convolution modules with depth convolution step size of 2 and with halving the feature map size;

step 1.3, introducing a Transfromer module into a depth separable convolution module, and constructing a lightweight Transformer module based on a long-and-short distance attention mechanism, wherein the module is shown in FIG. 4 (b). The introduction of the Transformer module is used for extracting the global features of the flame image, and the feature extraction is not influenced by the shooting distance;

(1) For input flame signature

Performing local feature expression on a 3 × 3 convolution layer and a 1 × 1 convolution layer, performing channel separation operation on the processed features, and averagely dividing the flame features into two parts along the channel dimension

(2) Using a depth separable pair of convolution modules f _c Extracting local characteristic information, and outputting each pixel of the characteristic diagram to only obtain the characteristic information in the receptive field range, namely obtaining local attention; local receptive field size rf obtainable per pixel _m Calculated using equation (1):

wherein k is _m Denotes the size, s, of the m-th layer convolution kernel _n Represents a step size of the nth layer;

(3) Using a transform module pair f _t Global attention information extraction is performed, and the module obtains information of all pixels of the input feature map for each pixel of the output feature map as shown in fig. 5. The method comprises the following steps:

1) For input flame characteristic f _t It is first divided into N p × p tiles, which do not overlap each other, wherein,

2) Performing word vector embedding on all the image blocks to convert into p ² Adding a word vector of x 1, adding position codes, and splicing all the word vectors to obtain a word vector

The position coding used in the transform module adopts sine and cosine coding, and is characterized in that cos codes are used for odd-number word embedding vector elements, sin codes are used for even-number word embedding vector elements, and the coding mode is calculated by a formula (2):

wherein: pos is the position of the word vector, d _mod I represents the ith element of the word vector as the length of the word vector;

since the attention relationship captured by the transducer is independent of the relative position relationship of the blocks (word vectors before embedding) of each original flame picture, i.e. the order of the word vectors, and the information of the visual image is closely related to the position relationship of each block, encoding the original positions of all the word vectors of the transducer can help the transducer to consider the order of the vectors when learning global attention.

3) For each word vector, pass through three weight matrix W ^Q ,W ^K ,W ^V Converting into a query vector q, a key vector k and a value vector v required for calculating attention; attention a is calculated by equation (3):

4) Performing inverse coding on the output attention sequence to obtain a global attention diagram with the same size as the original input characteristic;

(4) The output feature maps of the two parallel branches are cascaded and channel restoration is performed by one 1 × 1 convolutional layer and one 3 × 3 convolutional layer. The formula of the module is expressed as formula (4):

f′＝conv(concat(CNN(f _c ),transformer(f _t ))) (3)

the overall backbone network structure parameters of this embodiment are shown in table 1:

TABLE 1 backbone network architecture parameters

Input size	Network layer	Step size	Output size
				256×256×3	Standard convolution module	2	128×128×16
128×128×16	Depth separable convolution module 1	1	128×128×32
				128×128×32	Depth separable convolution module 2	2	64×64×64
64×64×64	Depth separable convolution module 3	1	64×64×64
				64×64×64	Depth separable convolution module 4	1	64×64×96
64×64×96	Depth separable convolution module 4	2	32×32×96
				32×32×96	Attention Transformer layer 1	1	32×32×96
32×32×96	Depth separable convolution module 4	2	16×16×128
				16×16×128	Attention Transformer layer 2	1	16×16×128
16×16×128	Depth separable convolution module 4	2	8×8×640
				8×8×640	Attention Transformer layer 3	1	8×8×640

Step 1.3, introducing a transfromer module into a depth separable convolution module, and constructing a lightweight transformer module based on a long-short distance attention mechanism, wherein the introduction of the transformer module is not only used for extracting the global features of the flame image, but also ensures that the feature extraction is not influenced by the shooting distance;

processing the flame characteristic diagram of the current picture through the step 1.4) and the step 1.5) to obtain three flame characteristic diagrams with different resolutions, and sending the three flame characteristic diagrams with different resolutions into the step 2) for processing;

and 2.1, in a feature fusion network, sequentially carrying out deconvolution up-sampling and pooling down-sampling on the three flame feature maps with different resolutions obtained in the step 1 from low to high and then from high to low, and cascading input and output to fuse flame feature information of different convolution layers. Three flame profiles for backbone network output

First to p ₂ Up-sampled to p ₁ Size and concatenation therewith to give p ₁ '; then p is paired ₁ ' Up-sampling to p ₀ Size and concatenation therewith to give p ₀ '; then p for ₀ ' Down-sampling to p ₁ Size and p is equal to ₁ ' Cascade to give a novel p ₁ '; for new p ₁ ' Down-sampling to p ₂ Size and cascade with it to

Upsampling once with new p ₁ ' Cascade and sum original input p ₁ Cascade to obtain

Upsampling once and p ₀ ' Cascade to

Final output

As input for the next layer of BiFPN.

Step 2.2, repeating the step 2.1 twice to finally obtain three characteristic diagrams of the flame to be detected;

and 3, classifying and positioning the flames on the three resolution layers by using a prediction network, sequencing scores of output results, and finally obtaining a final prediction result through a non-maximum value inhibition layer. According to the result of the network learning, the suspected flame area is activated, and the three output flame characteristic graphs of the network are predicted

Carrying out flame position information regression and confidence coefficient prediction;

step 3.1, the prediction head of each feature map comprises a depth convolution and a convolution regression, and finally five prediction values are output corresponding to each suspected flame area and respectively represent four position coordinates and a confidence coefficient of the predicted flame;

step 3.2, scoring and sorting the results predicted by all the prediction heads, taking out a frame of which the score of each suspected flame judgment area is greater than a set threshold value, and determining that flames exist at the frame;

and 3.3, identifying the frames with high overlapping degree as the same flame, inhibiting the non-maximum value, and finally obtaining the flame area with the highest confidence as the final prediction result.

In order to prove the advancement of the proposed method, the proposed flame detection method is compared with a common flame detection network on a constructed video data set. The comparison includes prediction accuracy, recall, model complexity, and run speed. The results of the experiments are shown in the following table:

TABLE 2 comparison of the results

Description of the drawings: the average accuracy of the cross-over threshold of 0.5 is used for the accuracy calculation, and the average of 10 recalls obtained at an interval of 0.05 is used for the recall calculation, wherein the cross-over threshold of 0.5 to 0.95 is used for the recall calculation.

As can be seen from the comparison of the experimental results in Table 2, the flame detection method provided by the invention has obvious advantages in the accuracy and the false alarm rate of flame detection. Meanwhile, although the model of the SSDLite is simpler, the processing accuracy is quite low, and the model complexity of the model is lower than that of other flame detection methods, so that the operation speed is higher.

Fig. 6 shows the detection result of the fire detection method proposed by the present invention for a fire picture that is difficult to identify. In fig. 6, the first column shows the result of a plurality of flame regions in one picture, the second column shows the result of a small flame region in a picture of a fire, the third column shows the result of a picture with no obvious flame, and the fourth column shows the result of an interfering object such as light in the picture.

To summarize:

the invention relates to a fire detection method based on an economical and efficient long-and-short-distance attention network real-time fire detection technology. The invention introduces a Transformer which is good for extracting global information in a backbone, and combines the Transformer with CNN to provide a long and short distance attention block (LSB). Compared with a common lightweight backbone network, the backbone network of the present invention achieves the best feature extraction performance on image datasets. Fire detection networks based on this backbone network exhibit significant performance on test data sets. Meanwhile, the invention also introduces a characteristic fusion module BiFPN into the network, thereby improving the detection precision of different-scale fires, especially small-scale fires.

Experiments show that the detection method has obvious advantages in the aspects of accuracy and speed of detecting fires of various scales.

Claims

1. A fire disaster image detection method based on a lightweight long-short distance attention transducer network comprises the following steps: firstly, collecting a flame picture; then, detecting by using a flame detection network; the method is characterized in that the step of detecting the flame picture by using the flame detection network comprises the following steps:

1) Processing and inputting a flame picture to be detected by using a designed lightweight feature extraction backbone network, and outputting extracted multi-scale flame features with three different resolutions;

2) Constructing a BiFPN-based feature fusion network to perform feature fusion processing on the multi-scale flame features obtained in the step 1), and outputting fusion features fused with three different resolution layers;

3) The classification layer of the flame detection network performs classification prediction on the fusion characteristics obtained in the step 2), and judges the existence of flame and the position of the flame in the image;

in the step 1):

1.1 Using a standard convolution module to carry out feature pre-extraction on the flame features in the input flame picture, and establishing an initialization feature tensor for network learning;

1.2 Using four sets of depth-separable convolution modules to perform depth feature extraction on the initialized feature tensor and reduce the size of the flame feature map, wherein the four depth-separable convolution modules are divided into two depth-separable convolution modules with depth convolution step size of 1 and feature map size unchanged and two depth-separable convolution modules with depth convolution step size of 2 and feature map size halved;

1.3 Introducing a transfromer module into a depth separable convolution module to construct a lightweight transformer module based on a long-short distance attention mechanism; the transform module is used for extracting the global features of the flame image on the premise of not being influenced by the shooting distance;

1.4 Three groups of sequentially connected long-and-short-distance attention mechanism lightweight transform modules are embedded into a flame feature extraction backbone network, and local and global features of a flame image are extracted;

1.5 Each group of the lightweight transducer modules of the long-short distance attention mechanism halves the size of the input flame feature map and outputs the flame feature map of the resolution layer to the feature fusion module;

in the step 2):

2.1 In a feature fusion network, sequentially carrying out deconvolution up-sampling and pooling down-sampling on the three flame feature maps with different resolutions obtained in the step 1.4) from low to high and then from high to low, and cascading input and output to further fuse different convolution layer flame feature information;

2.2 Repeating the step 2.1) twice, and finally obtaining the characteristic diagrams of the three flames to be detected.

2. The fire disaster image detection method based on the lightweight long-short distance attention transducer network as claimed in claim 1, wherein in step 1.3), a transducer mechanism is used to add global attention to the depth separable convolution module, so as to construct a lightweight transducer module based on the long-short distance attention mechanism, and the steps include:

1.3.1 For input flame signature

Carrying out local feature expression through a 3 × 3 convolution layer and a 1 × 1 convolution layer; and then carrying out channel separation operation on the processed characteristics, and averagely dividing the flame characteristics into two flame characteristic graphs with the same size along the channel dimension

1.3.2 Parallel processing:

a. using a depth-separable pair of convolution modules f _c Extracting local feature information, and outputting each pixel of the feature map to obtain only feature information in a receptive field range, namely obtaining local attention; local receptive field size rf obtainable per pixel _m The calculation method of (2) is calculated by the formula (1):

b. using a Transformer module pair f _t Performing global attention information extraction, wherein each pixel of the output feature map obtains information of all pixels of the input feature map, and the method comprises the following steps:

first, for the input flame characteristic f _t It is first divided into N p × p tiles, which do not overlap each other, wherein,

then, all the blocks are subjected to word vector embedding conversion to p ² Adding position codes to the word vectors of x 1, and splicing all the word vectors to obtain

Then, for each word vector, pass through three weight matrices W ^Q ,W ^K ,W ^V Converting into a query vector q, a key vector k and a value vector v required for calculating attention; the method for calculating the attention a comprises the following steps:

finally, performing inverse coding on the output attention sequence to obtain a global attention diagram with the same size as the original input characteristic;

1.3.3 A, b, and channel restoration is performed by a 1 × 1 convolutional layer and a 3 × 3 convolutional layer.

3. The fire disaster image detection method based on the lightweight long-short distance attention transducer network as claimed in claim 1, wherein in the step 2), the feature fusion network is a three-layer series structure instead of a five-layer structure of an original BiFPN network;

in the step 2.1), three flame characteristic maps output in the step 1) are obtained

First to p ₂ Up-sampled to p ₁ Size and concatenation therewith to give p ₁ '; then p is paired ₁ ' upsampling to p ₀ Size and concatenation therewith to give p ₀ '; then p for ₀ ' Down-sampling to p ₁ Size and p is equal to ₁ ' Cascade to give a novel p ₁ '; for new p ₁ ' Down-sampling to p ₂ Size and cascade with it to

Upsampling once and p ₀ ' Cascade derivation

Final output

As input for the next layer of BiFPN.

4. The fire image detection method based on the lightweight short-and-long-distance attention transducer network as claimed in claim 1, wherein in step 3), the flame is classified and positioned on three resolution layers by using a prediction network, the scores of the output results are sorted, and finally the final prediction result is obtained through a non-maximum suppression layer.

5. The fire disaster image detection method based on lightweight short-distance attention transducer network as claimed in claim 1 or 4, wherein in step 3), based on the result of network learning, the suspected flame area is activated, and the prediction network predicts the three output flame characteristic maps of the network

The method comprises the following steps of performing position information regression and confidence prediction on flame, wherein the steps comprise:

3.1 The prediction head of each feature map comprises a depth convolution and a convolution regression, and finally five prediction values are output corresponding to each suspected flame area and respectively represent four position coordinates and a confidence coefficient of the predicted flame;

3.2 Score sorting is carried out on the results predicted by all the prediction heads, a frame with the score of each suspected flame judgment area being larger than a set threshold value is taken out, and the flame is determined to exist at the position;

3.3 The frames with high overlapping degree are identified as the same flame area, non-maximum value suppression is carried out, and finally the flame area with the highest confidence degree is obtained as a final prediction result.

6. The fire disaster image detection method based on lightweight long and short distance attention transducer network as claimed in claim 2, wherein in step 1.3.1), f _c ,f _t Respectively inputting depth separable convolution module and Transformer module which are operated in parallel, and finally splicing output features back to original features

f′＝conv(concat(CNN(f _c ),transformer(f _t )))

Wherein: concat (. Cndot.) is a characteristic cascade operation;

the flame signature map will capture the global attention that the depth separable convolution layer lacks.