CN115171047A - Fire image detection method based on lightweight long-short distance attention transformer network - Google Patents

Fire image detection method based on lightweight long-short distance attention transformer network Download PDF

Info

Publication number
CN115171047A
CN115171047A CN202210852895.6A CN202210852895A CN115171047A CN 115171047 A CN115171047 A CN 115171047A CN 202210852895 A CN202210852895 A CN 202210852895A CN 115171047 A CN115171047 A CN 115171047A
Authority
CN
China
Prior art keywords
flame
network
feature
attention
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210852895.6A
Other languages
Chinese (zh)
Inventor
赵亚琴
赵文轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Forestry University
Original Assignee
Nanjing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Forestry University filed Critical Nanjing Forestry University
Priority to CN202210852895.6A priority Critical patent/CN115171047A/en
Publication of CN115171047A publication Critical patent/CN115171047A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

A fire disaster image detection method based on a lightweight long-and-short-distance attention transducer network comprises the steps of firstly collecting flame pictures; then using a flame detection network for detection, comprising the following steps: 1) Processing and inputting a flame picture to be detected by using a designed lightweight feature extraction backbone network, and outputting extracted multi-scale flame features with three different resolutions; 2) Constructing a BiFPN-based feature fusion network to perform feature fusion processing on the multi-scale flame features, and outputting fusion features fused with three different resolution layers; 3) And the classification layer of the network performs classification prediction on the fusion characteristics and judges the existence of flame and the position of the flame in the image. In the invention: the detection speed is improved by the light-weight backbone network structure; the transformer of the long and short attention mechanism improves the detection precision; the characteristic fusion mechanism BiFPN improves the capability of detecting small target flames in the fire image, and improves the identification accuracy of the early fire and the fire image shot in a long distance.

Description

Fire image detection method based on lightweight long-short distance attention transformer network
Technical Field
The technical scheme belongs to the field of image processing, and particularly relates to a fire detection method for a flame detection neural network by using image processing.
Background
Among various disasters, a fire is one of the major disasters threatening public safety and social development. In environments with many combustibles, such as residential buildings, industrial warehouses, forests, and the like, if an early fire is not discovered in time, a significant loss is usually caused. In recent years, fire detection has been the focus of research, and the research of reliable fire detection systems has important significance for the safety of people's lives and properties and the protection of ecosystems.
Early sensors based on temperature and smoke had the disadvantage of being highly influenced by environmental variables due to their slow response speed. While the traditional fire detection method based on computer vision relies on characteristics such as color, texture and the like which are artificially defined by researchers, the characteristics which are artificially set are usually only suitable for specific scenes and videos, the generalization capability is poor, and the flame detection accuracy is low. In recent years, convolutional Neural Networks (CNNs) have been widely used in the field of flame detection. However, the existing flame detection network directly utilizes the existing target detection network, has low detection accuracy for small target flames, has many problems of high model complexity, low detection speed, difficult training and the like, and is difficult to apply to flame scenes in real time.
Disclosure of Invention
The invention provides a fire disaster image detection method based on a lightweight long-short distance attention transformer network.
The network provided by the invention designs a lightweight and efficient convolution and transform module combined structure based on multi-scale feature learning, a long-short distance attention mechanism and a bidirectional feature fusion strategy, and forms a new fire detection framework.
The backbone network provided by the invention extracts the multi-scale characteristics of the flame; the introduction of a transformer increases the global attention of the network and improves the capability of the network for distinguishing flames from backgrounds; the feature fusion network will better predict different sizes, especially early small-size flames;
the method comprises the following specific steps:
step 1, extracting a flame picture to be detected by using a designed lightweight feature extraction backbone network, and outputting extracted multi-scale flame features with three different resolutions;
step 1.1, using a standard convolution module to carry out feature pre-extraction on flame features in an input flame picture, and establishing an initialized feature tensor for network learning;
step 1.2, performing depth feature extraction on the initialized feature tensor by using four groups of depth separable convolution modules, and reducing the size of a flame feature map;
step 1.3, introducing a transfromer module into a depth separable convolution module, and constructing a lightweight transformer module based on a long-and-short distance attention mechanism, wherein the introduction of the transformer module is not only used for extracting the global features of the flame image, but also ensures that the feature extraction is not influenced by the shooting distance;
step 1.4, three groups of sequentially connected long and short distance attention mechanism lightweight transform modules in series are embedded into a flame feature extraction backbone network, and local and global features of a flame image are extracted;
step 1.5, the light-weight transducer module of each group of long-short distance attention mechanism reduces the size of the input flame characteristic diagram by half, and outputs the flame characteristic diagram of the resolution layer to a characteristic fusion module;
processing the flame characteristic diagrams of the current picture through the step 1.4) and the step 1.5) to obtain three flame characteristic diagrams with different resolutions, and sending the three flame characteristic diagrams with different resolutions into the step 2) for processing;
step 2, constructing a BiFPN-based feature fusion network to perform feature fusion processing on the received multi-scale flame features, and outputting fusion features of three different resolution layers;
step 2.1, in a feature fusion network, sequentially carrying out deconvolution up-sampling and pooling down-sampling on the three flame feature maps with different resolutions obtained in the step 1 from low to high and then from high to low, and cascading input and output to further fuse different convolution layer flame feature information;
and 2.2) repeating the step 2.1) twice to finally obtain three characteristic diagrams of the flame to be detected.
Step 3, classifying and positioning the flames on the three resolution layers by using a prediction network, sequencing scores of output results, and finally obtaining a final prediction result through a non-maximum value inhibition layer;
the invention has the beneficial effects that:
(1) The invention provides a lightweight transform neural network based on a long-distance attention machine mechanism and applied to fire flame detection. The main framework of the network is the framework of Lite Transformer and MobileViT, and is optimized in the fusion mode of the convolutional layer and the Transformer layer. Specifically, prior to computing the self-attention module, the present invention splits the feature mapping along the channel dimension into convolution branches and transformer branches, which are then merged together. The design enables the model to avoid the problems of high complexity of the model and difficult model fitting caused by only using a transformer in a fire data set;
(2) The invention introduces a long-distance attention block in the backbone network to make up for the performance loss possibly caused by parameter reduction. On the other hand, due to the limitation of the receptive field, the CNN module can only obtain the attention of a short distance, which is not beneficial to the extraction of fire characteristics. The Transformer is able to focus more on distant objects independent of distance. By introducing the attention blocks with long and short distances into the backbone network, the local characteristics and the global characteristics of the fire video captured by the network are not influenced by the shooting distance, so that the accuracy of fire detection is improved;
(3) In order to improve the utilization rate of multi-scale fire characteristics, the invention utilizes BiFPN to fuse the characteristics of different resolution layers, uniformly scales the resolution, depth and width of all trunks, is favorable for better processing large-scale fire and small-scale fire simultaneously, and distinguishes fire objects from backgrounds;
(4) In the process of extracting the long and short distance attention, the convolution layer is not nested in the self-attention module of the Transformer block, and the convolution branch and the Transformer branch are selected to be parallel to extract local information and global information respectively. Specifically, before computing the self-attention module, the feature map along the channel dimension is split into convolution branches and transformation branches, which are then merged together. The ingenious design allows the model to avoid the problems of higher complexity and difficult model fitting that result from using only a transform in the fire data set.
Drawings
FIG. 1 is a flow chart of a method of flame detection according to the present invention;
FIG. 2 is a diagram of the flame detection network of the present invention;
FIG. 3 is a block diagram of a depth separable convolution module of the present invention;
FIG. 4 (a) is a diagram showing a conventional structure of a Transformer;
FIG. 4 (b) is a diagram of a lightweight transform module of the short-and-long-distance attention mechanism according to the present invention;
FIG. 5 is a diagram of a transform module according to the present invention;
fig. 6 is a picture of the detection result of the fire detection method of the present invention for a picture of a fire that is difficult to recognize.
Detailed Description
Aiming at the problems of low detection precision, low speed and high calculation cost caused by the fact that the existing target detection network is directly utilized in the fire detection method based on machine vision, the invention provides a lightweight transformer neural network based on a long-distance attention mechanism for real-time fire detection. The method adopts a lightweight backbone network to extract the multi-scale characteristics of the flame with lower calculation cost; a transformer mechanism is introduced into the convolutional layer, and an attention block with a long distance and a short distance is constructed to extract global attention features irrelevant to the distance and help a network to identify flames and a background; the feature fusion module processes the multi-scale features extracted from the trunks, and improves the detection effect on different-scale fires, particularly early small-scale fires.
Compared with the existing deep learning flame detection network, the invention designs a lightweight backbone network, thereby improving the detection speed; the invention adopts a transformer mechanism with long and short attention, thereby improving the detection precision; the invention uses a characteristic fusion mechanism BiFPN, and improves the capability of detecting early small-scale flames.
The present disclosure will be further described with reference to the accompanying drawings and embodiments.
The method adopts a lightweight transform flame detection network with attention to long and short distances to detect the frame images of the monitoring video of the fire scene.
As shown in FIG. 1, in the flame detection network, multi-scale flame features based on a long and short attention mechanism are extracted, and the multi-scale flame features and the predicted flame position are fused. The architecture of the overall network of the present invention is shown in fig. 2. Specifically, the method comprises the following steps:
step 1, extracting a flame picture to be detected by using a designed lightweight feature extraction backbone network, and outputting extracted multi-scale flame features with three different resolutions;
step 1.1, a standard convolution module is used for carrying out feature pre-extraction on an input flame picture, and an initialization feature tensor for network learning is established. The standard convolution module comprises a convolution layer, a batch normalization layer and a SiLU active layer;
and 1.2, performing depth feature extraction on the initialized feature tensor by using four groups of depth separable convolution modules, and reducing the size of the flame feature map. These depth separable convolution modules are identical in structure, all as shown in FIG. 3, and differ only in convolution step size of the convolution layer. As shown in fig. 3, each depth separable convolution module includes a point-by-point convolution layer, a depth convolution layer, and a channel restoration layer. The point-by-point convolution layer is composed of a 1 x 1 convolution layer, a batch normalization layer and a SilU layer, the deep convolution layer is composed of a 3 x 3 grouping convolution layer, a batch normalization layer and a SilU layer, and the channel restoration layer is composed of a 1 x 1 convolution layer and a batch normalization layer. The four depth separable convolution modules are divided into two depth separable convolution modules with depth convolution step size of 1 and without changing the feature map size and two depth separable convolution modules with depth convolution step size of 2 and with halving the feature map size;
step 1.3, introducing a Transfromer module into a depth separable convolution module, and constructing a lightweight Transformer module based on a long-and-short distance attention mechanism, wherein the module is shown in FIG. 4 (b). The introduction of the Transformer module is used for extracting the global features of the flame image, and the feature extraction is not influenced by the shooting distance;
(1) For input flame signature
Figure BDA0003755350790000031
Performing local feature expression on a 3 × 3 convolution layer and a 1 × 1 convolution layer, performing channel separation operation on the processed features, and averagely dividing the flame features into two parts along the channel dimension
Figure BDA0003755350790000032
(2) Using a depth separable pair of convolution modules f c Extracting local characteristic information, and outputting each pixel of the characteristic diagram to only obtain the characteristic information in the receptive field range, namely obtaining local attention; local receptive field size rf obtainable per pixel m Calculated using equation (1):
Figure BDA0003755350790000033
wherein k is m Denotes the size, s, of the m-th layer convolution kernel n Represents a step size of the nth layer;
(3) Using a transform module pair f t Global attention information extraction is performed, and the module obtains information of all pixels of the input feature map for each pixel of the output feature map as shown in fig. 5. The method comprises the following steps:
1) For input flame characteristic f t It is first divided into N p × p tiles, which do not overlap each other, wherein,
Figure BDA0003755350790000034
2) Performing word vector embedding on all the image blocks to convert into p 2 Adding a word vector of x 1, adding position codes, and splicing all the word vectors to obtain a word vector
Figure BDA0003755350790000035
The position coding used in the transform module adopts sine and cosine coding, and is characterized in that cos codes are used for odd-number word embedding vector elements, sin codes are used for even-number word embedding vector elements, and the coding mode is calculated by a formula (2):
Figure BDA0003755350790000041
wherein: pos is the position of the word vector, d mod I represents the ith element of the word vector as the length of the word vector;
since the attention relationship captured by the transducer is independent of the relative position relationship of the blocks (word vectors before embedding) of each original flame picture, i.e. the order of the word vectors, and the information of the visual image is closely related to the position relationship of each block, encoding the original positions of all the word vectors of the transducer can help the transducer to consider the order of the vectors when learning global attention.
3) For each word vector, pass through three weight matrix W Q ,W K ,W V Converting into a query vector q, a key vector k and a value vector v required for calculating attention; attention a is calculated by equation (3):
Figure BDA0003755350790000042
4) Performing inverse coding on the output attention sequence to obtain a global attention diagram with the same size as the original input characteristic;
(4) The output feature maps of the two parallel branches are cascaded and channel restoration is performed by one 1 × 1 convolutional layer and one 3 × 3 convolutional layer. The formula of the module is expressed as formula (4):
f′=conv(concat(CNN(f c ),transformer(f t ))) (3)
the overall backbone network structure parameters of this embodiment are shown in table 1:
TABLE 1 backbone network architecture parameters
Input size Network layer Step size Output size
256×256×3 Standard convolution module 2 128×128×16
128×128×16 Depth separable convolution module 1 1 128×128×32
128×128×32 Depth separable convolution module 2 2 64×64×64
64×64×64 Depth separable convolution module 3 1 64×64×64
64×64×64 Depth separable convolution module 4 1 64×64×96
64×64×96 Depth separable convolution module 4 2 32×32×96
32×32×96 Attention Transformer layer 1 1 32×32×96
32×32×96 Depth separable convolution module 4 2 16×16×128
16×16×128 Attention Transformer layer 2 1 16×16×128
16×16×128 Depth separable convolution module 4 2 8×8×640
8×8×640 Attention Transformer layer 3 1 8×8×640
Step 1.3, introducing a transfromer module into a depth separable convolution module, and constructing a lightweight transformer module based on a long-short distance attention mechanism, wherein the introduction of the transformer module is not only used for extracting the global features of the flame image, but also ensures that the feature extraction is not influenced by the shooting distance;
step 1.4, three groups of sequentially connected long and short distance attention mechanism lightweight transform modules in series are embedded into a flame feature extraction backbone network, and local and global features of a flame image are extracted;
step 1.5, the light-weight transducer module of each group of long-short distance attention mechanism reduces the size of the input flame characteristic diagram by half, and outputs the flame characteristic diagram of the resolution layer to a characteristic fusion module;
processing the flame characteristic diagram of the current picture through the step 1.4) and the step 1.5) to obtain three flame characteristic diagrams with different resolutions, and sending the three flame characteristic diagrams with different resolutions into the step 2) for processing;
step 2, constructing a BiFPN-based feature fusion network to perform feature fusion processing on the received multi-scale flame features, and outputting fusion features of three different resolution layers;
and 2.1, in a feature fusion network, sequentially carrying out deconvolution up-sampling and pooling down-sampling on the three flame feature maps with different resolutions obtained in the step 1 from low to high and then from high to low, and cascading input and output to fuse flame feature information of different convolution layers. Three flame profiles for backbone network output
Figure BDA0003755350790000051
Figure BDA0003755350790000052
First to p 2 Up-sampled to p 1 Size and concatenation therewith to give p 1 '; then p is paired 1 ' Up-sampling to p 0 Size and concatenation therewith to give p 0 '; then p for 0 ' Down-sampling to p 1 Size and p is equal to 1 ' Cascade to give a novel p 1 '; for new p 1 ' Down-sampling to p 2 Size and cascade with it to
Figure BDA0003755350790000053
Upsampling once with new p 1 ' Cascade and sum original input p 1 Cascade to obtain
Figure BDA0003755350790000054
Upsampling once and p 0 ' Cascade to
Figure BDA0003755350790000055
Final output
Figure BDA0003755350790000056
As input for the next layer of BiFPN.
Step 2.2, repeating the step 2.1 twice to finally obtain three characteristic diagrams of the flame to be detected;
and 3, classifying and positioning the flames on the three resolution layers by using a prediction network, sequencing scores of output results, and finally obtaining a final prediction result through a non-maximum value inhibition layer. According to the result of the network learning, the suspected flame area is activated, and the three output flame characteristic graphs of the network are predicted
Figure BDA0003755350790000057
Carrying out flame position information regression and confidence coefficient prediction;
step 3.1, the prediction head of each feature map comprises a depth convolution and a convolution regression, and finally five prediction values are output corresponding to each suspected flame area and respectively represent four position coordinates and a confidence coefficient of the predicted flame;
step 3.2, scoring and sorting the results predicted by all the prediction heads, taking out a frame of which the score of each suspected flame judgment area is greater than a set threshold value, and determining that flames exist at the frame;
and 3.3, identifying the frames with high overlapping degree as the same flame, inhibiting the non-maximum value, and finally obtaining the flame area with the highest confidence as the final prediction result.
In order to prove the advancement of the proposed method, the proposed flame detection method is compared with a common flame detection network on a constructed video data set. The comparison includes prediction accuracy, recall, model complexity, and run speed. The results of the experiments are shown in the following table:
TABLE 2 comparison of the results
Figure BDA0003755350790000058
Description of the drawings: the average accuracy of the cross-over threshold of 0.5 is used for the accuracy calculation, and the average of 10 recalls obtained at an interval of 0.05 is used for the recall calculation, wherein the cross-over threshold of 0.5 to 0.95 is used for the recall calculation.
As can be seen from the comparison of the experimental results in Table 2, the flame detection method provided by the invention has obvious advantages in the accuracy and the false alarm rate of flame detection. Meanwhile, although the model of the SSDLite is simpler, the processing accuracy is quite low, and the model complexity of the model is lower than that of other flame detection methods, so that the operation speed is higher.
Fig. 6 shows the detection result of the fire detection method proposed by the present invention for a fire picture that is difficult to identify. In fig. 6, the first column shows the result of a plurality of flame regions in one picture, the second column shows the result of a small flame region in a picture of a fire, the third column shows the result of a picture with no obvious flame, and the fourth column shows the result of an interfering object such as light in the picture.
To summarize:
the invention relates to a fire detection method based on an economical and efficient long-and-short-distance attention network real-time fire detection technology. The invention introduces a Transformer which is good for extracting global information in a backbone, and combines the Transformer with CNN to provide a long and short distance attention block (LSB). Compared with a common lightweight backbone network, the backbone network of the present invention achieves the best feature extraction performance on image datasets. Fire detection networks based on this backbone network exhibit significant performance on test data sets. Meanwhile, the invention also introduces a characteristic fusion module BiFPN into the network, thereby improving the detection precision of different-scale fires, especially small-scale fires.
Experiments show that the detection method has obvious advantages in the aspects of accuracy and speed of detecting fires of various scales.

Claims (6)

1. A fire disaster image detection method based on a lightweight long-short distance attention transducer network comprises the following steps: firstly, collecting a flame picture; then, detecting by using a flame detection network; the method is characterized in that the step of detecting the flame picture by using the flame detection network comprises the following steps:
1) Processing and inputting a flame picture to be detected by using a designed lightweight feature extraction backbone network, and outputting extracted multi-scale flame features with three different resolutions;
2) Constructing a BiFPN-based feature fusion network to perform feature fusion processing on the multi-scale flame features obtained in the step 1), and outputting fusion features fused with three different resolution layers;
3) The classification layer of the flame detection network performs classification prediction on the fusion characteristics obtained in the step 2), and judges the existence of flame and the position of the flame in the image;
in the step 1):
1.1 Using a standard convolution module to carry out feature pre-extraction on the flame features in the input flame picture, and establishing an initialization feature tensor for network learning;
1.2 Using four sets of depth-separable convolution modules to perform depth feature extraction on the initialized feature tensor and reduce the size of the flame feature map, wherein the four depth-separable convolution modules are divided into two depth-separable convolution modules with depth convolution step size of 1 and feature map size unchanged and two depth-separable convolution modules with depth convolution step size of 2 and feature map size halved;
1.3 Introducing a transfromer module into a depth separable convolution module to construct a lightweight transformer module based on a long-short distance attention mechanism; the transform module is used for extracting the global features of the flame image on the premise of not being influenced by the shooting distance;
1.4 Three groups of sequentially connected long-and-short-distance attention mechanism lightweight transform modules are embedded into a flame feature extraction backbone network, and local and global features of a flame image are extracted;
1.5 Each group of the lightweight transducer modules of the long-short distance attention mechanism halves the size of the input flame feature map and outputs the flame feature map of the resolution layer to the feature fusion module;
processing the flame characteristic diagram of the current picture through the step 1.4) and the step 1.5) to obtain three flame characteristic diagrams with different resolutions, and sending the three flame characteristic diagrams with different resolutions into the step 2) for processing;
in the step 2):
2.1 In a feature fusion network, sequentially carrying out deconvolution up-sampling and pooling down-sampling on the three flame feature maps with different resolutions obtained in the step 1.4) from low to high and then from high to low, and cascading input and output to further fuse different convolution layer flame feature information;
2.2 Repeating the step 2.1) twice, and finally obtaining the characteristic diagrams of the three flames to be detected.
2. The fire disaster image detection method based on the lightweight long-short distance attention transducer network as claimed in claim 1, wherein in step 1.3), a transducer mechanism is used to add global attention to the depth separable convolution module, so as to construct a lightweight transducer module based on the long-short distance attention mechanism, and the steps include:
1.3.1 For input flame signature
Figure FDA0003755350780000011
Carrying out local feature expression through a 3 × 3 convolution layer and a 1 × 1 convolution layer; and then carrying out channel separation operation on the processed characteristics, and averagely dividing the flame characteristics into two flame characteristic graphs with the same size along the channel dimension
Figure FDA0003755350780000012
1.3.2 Parallel processing:
a. using a depth-separable pair of convolution modules f c Extracting local feature information, and outputting each pixel of the feature map to obtain only feature information in a receptive field range, namely obtaining local attention; local receptive field size rf obtainable per pixel m The calculation method of (2) is calculated by the formula (1):
Figure FDA0003755350780000021
wherein k is m Denotes the size, s, of the m-th layer convolution kernel n Represents a step size of the nth layer;
b. using a Transformer module pair f t Performing global attention information extraction, wherein each pixel of the output feature map obtains information of all pixels of the input feature map, and the method comprises the following steps:
first, for the input flame characteristic f t It is first divided into N p × p tiles, which do not overlap each other, wherein,
Figure FDA0003755350780000022
then, all the blocks are subjected to word vector embedding conversion to p 2 Adding position codes to the word vectors of x 1, and splicing all the word vectors to obtain
Figure FDA0003755350780000023
Then, for each word vector, pass through three weight matrices W Q ,W K ,W V Converting into a query vector q, a key vector k and a value vector v required for calculating attention; the method for calculating the attention a comprises the following steps:
Figure FDA0003755350780000024
finally, performing inverse coding on the output attention sequence to obtain a global attention diagram with the same size as the original input characteristic;
1.3.3 A, b, and channel restoration is performed by a 1 × 1 convolutional layer and a 3 × 3 convolutional layer.
3. The fire disaster image detection method based on the lightweight long-short distance attention transducer network as claimed in claim 1, wherein in the step 2), the feature fusion network is a three-layer series structure instead of a five-layer structure of an original BiFPN network;
in the step 2.1), three flame characteristic maps output in the step 1) are obtained
Figure FDA0003755350780000025
Figure FDA0003755350780000026
First to p 2 Up-sampled to p 1 Size and concatenation therewith to give p 1 '; then p is paired 1 ' upsampling to p 0 Size and concatenation therewith to give p 0 '; then p for 0 ' Down-sampling to p 1 Size and p is equal to 1 ' Cascade to give a novel p 1 '; for new p 1 ' Down-sampling to p 2 Size and cascade with it to
Figure FDA0003755350780000027
Figure FDA0003755350780000028
Upsampling once with new p 1 ' Cascade and sum original input p 1 Cascade to obtain
Figure FDA0003755350780000029
Figure FDA00037553507800000210
Upsampling once and p 0 ' Cascade derivation
Figure FDA00037553507800000211
Final output
Figure FDA00037553507800000212
As input for the next layer of BiFPN.
4. The fire image detection method based on the lightweight short-and-long-distance attention transducer network as claimed in claim 1, wherein in step 3), the flame is classified and positioned on three resolution layers by using a prediction network, the scores of the output results are sorted, and finally the final prediction result is obtained through a non-maximum suppression layer.
5. The fire disaster image detection method based on lightweight short-distance attention transducer network as claimed in claim 1 or 4, wherein in step 3), based on the result of network learning, the suspected flame area is activated, and the prediction network predicts the three output flame characteristic maps of the network
Figure FDA00037553507800000213
The method comprises the following steps of performing position information regression and confidence prediction on flame, wherein the steps comprise:
3.1 The prediction head of each feature map comprises a depth convolution and a convolution regression, and finally five prediction values are output corresponding to each suspected flame area and respectively represent four position coordinates and a confidence coefficient of the predicted flame;
3.2 Score sorting is carried out on the results predicted by all the prediction heads, a frame with the score of each suspected flame judgment area being larger than a set threshold value is taken out, and the flame is determined to exist at the position;
3.3 The frames with high overlapping degree are identified as the same flame area, non-maximum value suppression is carried out, and finally the flame area with the highest confidence degree is obtained as a final prediction result.
6. The fire disaster image detection method based on lightweight long and short distance attention transducer network as claimed in claim 2, wherein in step 1.3.1), f c ,f t Respectively inputting depth separable convolution module and Transformer module which are operated in parallel, and finally splicing output features back to original features
Figure FDA0003755350780000031
f′=conv(concat(CNN(f c ),transformer(f t )))
Wherein: concat (. Cndot.) is a characteristic cascade operation;
the flame signature map will capture the global attention that the depth separable convolution layer lacks.
CN202210852895.6A 2022-07-20 2022-07-20 Fire image detection method based on lightweight long-short distance attention transformer network Pending CN115171047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210852895.6A CN115171047A (en) 2022-07-20 2022-07-20 Fire image detection method based on lightweight long-short distance attention transformer network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210852895.6A CN115171047A (en) 2022-07-20 2022-07-20 Fire image detection method based on lightweight long-short distance attention transformer network

Publications (1)

Publication Number Publication Date
CN115171047A true CN115171047A (en) 2022-10-11

Family

ID=83494185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210852895.6A Pending CN115171047A (en) 2022-07-20 2022-07-20 Fire image detection method based on lightweight long-short distance attention transformer network

Country Status (1)

Country Link
CN (1) CN115171047A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453053A (en) * 2023-04-18 2023-07-18 南京恩博科技有限公司 Fire detection method, fire detection device, computer equipment and storage medium
CN116895050A (en) * 2023-09-11 2023-10-17 四川高速公路建设开发集团有限公司 Tunnel fire disaster identification method and device
CN117218606A (en) * 2023-11-09 2023-12-12 四川泓宝润业工程技术有限公司 Escape door detection method and device, storage medium and electronic equipment
CN117351354A (en) * 2023-10-18 2024-01-05 耕宇牧星(北京)空间科技有限公司 Lightweight remote sensing image target detection method based on improved MobileViT

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453053A (en) * 2023-04-18 2023-07-18 南京恩博科技有限公司 Fire detection method, fire detection device, computer equipment and storage medium
CN116453053B (en) * 2023-04-18 2023-09-26 南京恩博科技有限公司 Fire detection method, fire detection device, computer equipment and storage medium
CN116895050A (en) * 2023-09-11 2023-10-17 四川高速公路建设开发集团有限公司 Tunnel fire disaster identification method and device
CN116895050B (en) * 2023-09-11 2023-12-08 四川高速公路建设开发集团有限公司 Tunnel fire disaster identification method and device
CN117351354A (en) * 2023-10-18 2024-01-05 耕宇牧星(北京)空间科技有限公司 Lightweight remote sensing image target detection method based on improved MobileViT
CN117351354B (en) * 2023-10-18 2024-04-16 耕宇牧星(北京)空间科技有限公司 Lightweight remote sensing image target detection method based on improved MobileViT
CN117218606A (en) * 2023-11-09 2023-12-12 四川泓宝润业工程技术有限公司 Escape door detection method and device, storage medium and electronic equipment
CN117218606B (en) * 2023-11-09 2024-02-02 四川泓宝润业工程技术有限公司 Escape door detection method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111259850B (en) Pedestrian re-identification method integrating random batch mask and multi-scale representation learning
CN115171047A (en) Fire image detection method based on lightweight long-short distance attention transformer network
US11836224B2 (en) Cross-modality person re-identification method based on local information learning
Ding et al. DSA-Net: A novel deeply supervised attention-guided network for building change detection in high-resolution remote sensing images
CN108710868B (en) Human body key point detection system and method based on complex scene
Tao et al. Smoke detection based on deep convolutional neural networks
CN111126360A (en) Cross-domain pedestrian re-identification method based on unsupervised combined multi-loss model
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN109446922A (en) A kind of method for detecting human face of real-time robust
CN114067444A (en) Face spoofing detection method and system based on meta-pseudo label and illumination invariant feature
CN115690564A (en) Outdoor fire smoke image detection method based on Recursive BIFPN network
Li et al. ConvTransNet: A CNN-transformer network for change detection with multi-scale global-local representations
CN111259736B (en) Real-time pedestrian detection method based on deep learning in complex environment
CN114663986B (en) Living body detection method and system based on double decoupling generation and semi-supervised learning
Zhao et al. A triple-stream network with cross-stage feature fusion for high-resolution image change detection
CN114662605A (en) Flame detection method based on improved YOLOv5 model
Chen et al. Combining the Convolution and Transformer for Classification of Smoke-Like Scenes in Remote Sensing Images
Wang et al. Early smoke and flame detection based on transformer
CN111898440B (en) Mountain fire detection method based on three-dimensional convolutional neural network
Ma et al. CNN-TransNet: A Hybrid CNN-Transformer Network with Differential Feature Enhancement for Cloud Detection
CN113052139A (en) Deep learning double-flow network-based climbing behavior detection method and system
CN113011359A (en) Method for simultaneously detecting plane structure and generating plane description based on image and application
CN116778346A (en) Pipeline identification method and system based on improved self-attention mechanism
CN109409224B (en) Method for detecting flame in natural scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination