CN114399723B - Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation - Google Patents

Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation Download PDF

Info

Publication number
CN114399723B
CN114399723B CN202111320633.7A CN202111320633A CN114399723B CN 114399723 B CN114399723 B CN 114399723B CN 202111320633 A CN202111320633 A CN 202111320633A CN 114399723 B CN114399723 B CN 114399723B
Authority
CN
China
Prior art keywords
feature
level
representing
low
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111320633.7A
Other languages
Chinese (zh)
Other versions
CN114399723A (en
Inventor
陶焕杰
胡震武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202111320633.7A priority Critical patent/CN114399723B/en
Publication of CN114399723A publication Critical patent/CN114399723A/en
Application granted granted Critical
Publication of CN114399723B publication Critical patent/CN114399723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation

Abstract

The invention relates to a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which comprises the following steps: (1) Inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers; (2) Inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features; (3) Inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features; (4) Inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result. The invention can further reduce the missing report rate and the false report rate of forest smoke and fire identification.

Description

Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation
Technical Field
The invention belongs to the technical field of pattern recognition, deep learning and image video processing, and relates to a forest smoke and fire recognition method based on enhanced deformable convolution and label correlation.
Background
Forest fires can have devastating consequences for the ecosystem and human life. The most important precursor signal for forest fires is smoke, so vision-based smoke and fire identification methods are critical for early detection of fires and for reducing the risk of fires. However, in smoke identification, achieving lower smoke detection rates and false positive rates in complex real scenes is a very challenging task due to the wide variety of smoke features and continuous disturbance of smoke-like objects.
To date, researchers have proposed many vision-based smoke recognition/detection methods to improve recognition accuracy with great success. These methods can be broadly divided into two categories: traditional methods and deep learning methods. The conventional method generally extracts manual features and classifies the manual features to obtain a recognition result. Manual features include colors, gradients, shapes, texture features, motion features, frequency domain features, mathematical models, sparse representation based features, and the like. Traditional methods cannot efficiently extract semantic information because manual feature extraction is a complex function of local geometry, structure, and context. In addition, there is a large intra-class variation in smoke color, shape, texture, etc. Therefore, it is difficult to achieve the low false positive rate and low false positive rate goals of cigarettes by designing a smoke hand feature with discrimination.
Deep learning methods typically first extract depth features using various convolutional neural network variants (CNNs) and classify the extracted depth features to obtain recognition results. Depth features may describe semantic features of smoke and thus tend to achieve better performance than manual features. Existing depth models can be further divided into category-level and pixel-level supervision models according to the difference of model supervision information. Category-level supervised deep learning models typically use the binary category information with/without smoke and bounding boxes of smoke regions as training sample labels to guide the model training process. Most existing smoke identification methods employ category-level supervision information. However, simple category supervision information has local semantic properties, i.e. it is only relevant to the small area where the smoke target is located. In addition, smoke is typically translucent and the extracted features contain background information or are affected by imaging conditions such as lighting, weather conditions, viewing angles, etc., interfering with model recognition accuracy.
To address the above issues, some researchers have attempted to explore pixel-level supervised deep learning models to provide pixel-level labels to guide model training. For example, consider smoke recognition as a smoke segmentation task by giving smoke/smoke-free binary class information for each pixel in a sample image. However, due to the diversity of cigarettes and the constant interference of smoke-like objects in real complex scenes, pixel-level supervised smoke recognition methods still face a large number of missed and false positives.
The model does not really solve the core problem of forest smoke and fire identification technically and in engineering. By fully analyzing existing smoke recognition models and patents, we consider that existing models still have the following three-point problem: 1) Most forest pan-tilt cameras can monitor a wide area range of 3-8 km, which results in large variations in the size, point of view and distortion of smoke in the surveillance video. Existing smoke identification methods typically employ data enhancement in training samples or use different receptive fields to obtain multi-scale features. However, data augmentation may prevent generalization to new tasks with unknown geometric transformations, and the effectiveness of multiple receptive fields may not be prominent due to the fixed geometry of the depth network using CNN modules. 2) Related studies have demonstrated that different levels of convolutional layers are typically concerned with different levels of information. Activation from shallow layers focuses on low-level texture details (e.g., colors, lines, edges, and shapes), while activation from deep layers focuses on high-level semantic content (e.g., objects and concepts). Because the features of different layers have complementarity, the improvement of the recognition capability by utilizing the fusion and interaction of the features between layers has important significance. Existing smoke identification methods typically employ unidirectional feature transmission strategies from low-level to high-level. However, such a strategy may lead to a problem that low-level features used to characterize local details are completely suppressed by high-response high-level features during feature interactions. 3) Having a pixel-level supervised smoke recognition model typically uses image form labels, such as image segmentation is essentially a pixel-level classification problem. Existing methods typically employ pixel-level loss training (e.g., cross entropy loss, mean square error loss) and labels for each pixel are predicted independently. However, these losses ignore the correlation of pixels in the label image, possibly resulting in some information loss of correlation between pixels.
In order to solve the problems, the patent provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which realizes bidirectional interaction and lower smoke and fire rate by utilizing label correlation information through designing enhanced deformable convolution.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which can further reduce the missing report rate and false report rate of smoke.
Technical proposal
A forest smoke and fire recognition method based on enhanced deformable convolution and label correlation is characterized by comprising the following steps:
step 1: inputting the forest monitoring video image into a feature extraction module FEM to obtain non-interactive features of different layers:
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
wherein DB j (j=4, 5, 6) represents a feature decoding block with index j, EB i (i=1, 2, 3) denotes a feature code block with index i, F i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks and feature decoding blocks, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module;
the feature extraction module FEM comprises a plurality of feature encoding blocks EB and a plurality of feature decoding blocks DB which are connected in series;
step 2: and (2) carrying out the step (1) on the non-interactive feature graphs F of different layers i (i=1, 2, …, 6) input to the low-level feature interaction module to obtain low-level interaction features:
L 1 =Downconv2d(F 6 )
L i+1 =LIE i (L i ,F 6-i ),i=1,2
V 1 =CNN(L 3 )
L j+1 =LID j (L j ,F 7-i ),j=3,4,5
wherein LIE is i (i=1, 2) denotes a low-level interactive coded block with index i, LID j (j=4, 5, 6) represents a low-level interactive decoding block with index j, L i (i=1, 2, …, 6) an output feature map representing a low-level interactive coded block or a low-level interactive decoded block with index i, L 6 And V 1 The representation is from a low levelThe low-level interaction features learned in the feature interaction module;
the low-level feature interaction module LFDM comprises a plurality of low-level interaction coding blocks LIE and a plurality of low-level interaction decoding blocks LID which are connected in series, a channel enhancement design using a three-dimensional convolutional neural network CNN is added behind an input layer of each LIE and LID, the size of a convolution kernel is designed to be 1 multiplied by 3, and an enhancement deformable convolution EDC is arranged in the last step of the low-level feature interaction module;
each low-level interactive coding block LIE is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
each low-level interactive decoding block LID is connected with the output of the encoding block EB of the feature extraction module FEM in the step 1;
step 3: and (3) carrying out the step (1) on the non-interactive features F with different layers i (i=1, 2, …, 6) input high-level feature interaction module, resulting in high-level interaction features:
H 1 =Conv2d(F 1 )
H i+1 =HIE i (H i ,F i+1 ),i=1,2
V 3 =CNN(H 3 )
H j+1 =HID j (L j ,F J+1 ),j=3,4,5
wherein HIE is i (i=1, 2) denotes a high-level interactive coded block with index i, HID j (j=3, 4, 5) denotes a high-level inter-coded block with index j, H i (i=1, 2, …, 6) an output feature map representing a high-level interactive coded block or a high-level interactive decoded block with index i, H 6 And V 3 Representing the high-level interaction features learned from the high-level feature interaction module;
the high-level feature interaction module HFDM comprises a plurality of high-level interaction coding blocks HIE and a plurality of high-level interaction decoding blocks HID which are connected in series, an enhanced deformable convolution EDC is arranged in the last step of the high-level feature interaction module, and each high-level interaction coding block HIE is connected with the output of the coding block EB of the feature extraction module FEM in the step 1;
each high-level interactive decoding block HID is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result;
(41) Using the extracted features (L 6 ,F 6 ,H 6 ) Predicting a background image sequence and a smoke density image sequence corresponding to the input image sequence:
wherein,and->Respectively representing a predicted background image sequence and a smoke density image sequence, conv (x) representing a convolution operation, and ReLU (x) representing a ReLU function;
(42) Will beInput into C3D model to extract space-time characteristic vector V 4 I.e.
Wherein FC (x) represents a fully connected layer and C3D (x) represents a C3D model;
(43) Since the low-level features may bring about noise, the upper-level features may suppress the low-level beneficial information, and feature vectors (V 1 ,V 2 ,V 3 ,V 4 ) Sorting, i.e.
Wherein concat represents the series operation of the feature map,a prediction result of smoke classification is indicated.
The step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:
step 1.1: the calculation method of the enhanced deformable convolution EDC comprises the following steps: for the output feature map y, position p 0 The method comprises the following steps:
wherein x represents an input feature map, p n Representing a collectionThe positions listed in w r And w d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively n Represents the offset and is typically a fraction;
bilinear interpolation is used to calculate x (p 0 +p n +Δp n ) The method comprises the following steps:
x(p)=∑ q G(p.q)x(q)
wherein q enumerates the integer spatial positions of the feature map x, p is the fractional position, G (x) represents the kernel of bilinear interpolation;
step 1.2: EB recording i (i=1, 2, 3) represents a feature code block with index i, DB j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I 1 ,I 2 ,…,I L ' representing a forest monitoring video comprising L frames of images, the sequence of images being transformed into a three-dimensional tensor F by channel stacking 0 And input into a feature extraction module to extract non-interactive features of different levels, namely
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
Wherein F is i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks EB and feature decoding blocks DB, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module.
In the model training in the step 4, the total loss functionExpressed as:
wherein,and->Respectively representing pixel cross entropy loss, contrast loss and classification loss, lambda 1 ,λ 2 And lambda (lambda) 3 Respectively representing weight coefficients of the corresponding losses;
pixel cross entropy lossThe calculation method of (1) is that
Wherein (1)>Pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;
countering lossesThe method aims to make up correlation and distribution information of pixels in a tag image ignored by the prior method, increase the correlation information among the pixels, capture the distribution of the whole image, punish mismatching in tag statistics, and comprises the following calculation method of
Wherein,refers to an empirical estimation of the probability expectation, x 1 And x 2 Respectively represent model D 1 And D 2 Input true smoke density image and true background image, P data Representing probability distribution of data compliance, DDis and BDis respectively represent discriminator models in generating countermeasure, respectively denoted as D 1 And D 2 Log (x) represents a logarithmic function;
classification lossThe calculation method of (1) is that
Wherein c andrepresenting the true and predicted categories of the image sequence, respectively.
Advantageous effects
The invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which comprises the following steps: (1) Inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers; (2) Inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features; (3) Inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features; (4) Inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result. The invention can further reduce the missing report rate and the false report rate of forest smoke and fire identification.
The beneficial effects of the invention are as follows: (1) In order to make the feature representation have robustness to scale, viewpoint, deformation and the like, the invention discloses an enhanced deformable convolution module, and feature complementation is realized by breaking the fixed geometric structure of the convolution module and considering the most representative features and local weak features; (2) In order to avoid that the low-level features for describing local details are completely inhibited by the high-response high-level features in the process of feature fusion and interaction, a multi-directional feature interaction module is invented to acquire complementary interaction information of different levels of features; (3) In order to utilize the correlation and distribution information of pixels in the label image, we invented a discrimination loss term based on generative countermeasure learning to measure the distribution similarity between the network predicted image and the real image, and eliminate the distribution inconsistency. The invention further improves the detection rate and reduces the false alarm rate by combining the advantages of the three points.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is an example of a predicted smoke density image of the present invention FIG. 1
Fig. 3 shows the overall network architecture of the proposed model, comprising four parts in total, a Feature Extraction Module (FEM), a low-level feature interaction module (LFDM), a high-level feature interaction module (HFDM) and a Feature Fusion Module (FFM). The feature extraction module is obtained by connecting a plurality of feature Encoding Blocks (EB) and a plurality of feature Decoding Blocks (DB) in series. The low-level feature interaction module includes a plurality of low-level interaction encoding blocks (LIE) and low-level interaction decoding blocks (LID). The high-level feature interaction module includes a plurality of high-level interaction encoding blocks (HIEs) and high-level interaction decoding blocks (HIDs). The feature fusion module comprises multiple levels of feature fusion.
Fig. 4 shows a specific network structure of the feature encoding block EB (fig. 4 a) and the feature decoding block DB (fig. 4 b).
Fig. 5 shows a specific structural design of the low-level interactive coding block LIE and the low-level interactive decoding block LID. Specifically, a channel enhancement design using three-dimensional convolution is added after the input layer, the convolution kernel size is designed to be 1×1×3, and an Enhanced Deformable Convolution (EDC) is also added in the last step of the low-level feature interaction decoding module.
Fig. 6 shows a specific network structure that enhances the deformable convolution EDC.
Fig. 7 shows a specific network structure of a high-level interactive encoding block HIE (fig. 7 a) and a high-level interactive decoding block HID (fig. 7 b), to which an Enhanced Deformable Convolution (EDC) is further added in the last step of the high-level feature interactive decoding module.
Detailed Description
The invention will now be further described with reference to examples, figures:
the invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which is shown in a flow chart in fig. 1 and specifically comprises the following steps:
step 1: inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers;
step 2: inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features;
step 3: inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features;
step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result.
The step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:
step 1.1: the addition of parallel paths to take into account internal information to achieve enhanced variability convolution prevents locally weak smoke features from being ignored or suppressed, thereby extracting complementary smoke features, for the output feature map y, at position p 0 There is at
Wherein x represents an input feature map, p n Representing a collectionThe positions listed in (a) are exemplified by a 3X 3 local area,/->Is defined as { (-1, -1), (-1, 0), …, (1, 1) }, w for a 3 x 3 convolution kernel r And w d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively n Representing the offset, and typically the fraction, bilinear interpolation is used to calculate x (p 0 +p n +Δp n ) I.e.
x(p)=∑ q G(p.q)x(q)
Wherein q enumerates the integer spatial positions of the feature map x, p is the fractional position, G (x) represents the kernel of bilinear interpolation;
step 1.2: in a feature coding module, carrying out average pooling and maximum pooling in parallel to extract complementary information, adding three-dimensional convolution before the maximum pooling operation, coding time-space information, respectively designing enhanced deformable convolution at the beginning and the end for extracting features with discrimination on the dimension, the viewpoint, the deformation and the like of smoke, in a feature decoding module, carrying out aggregation on the extracted different level features before the upsampling operation, and using the enhanced deformable convolution to adapt to various geometric transformations to promote the extraction of essential features of the smoke;
step 1.3: EB recording i (i=1, 2, 3) denotes a feature encoding module with index i, DB j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I 1 ,I 2 ,…,I L ' representing a forest monitoring video comprising L frames of images, the sequence of images being transformed into a three-dimensional tensor F by channel stacking 0 And input into a feature extraction module to extract non-interactive features of different levels, namely
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
Wherein F is i (i=1, 2, …, 6) represents the output feature map of the feature encoding module and the feature decoding module with index i, i.e. no interactive features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding modules and feature decoding modules, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module.
The step 2 of inputting the non-interactive features of different levels into the low-level feature interaction module to obtain the low-level interactive features comprises the following steps:
step 2.1: the low-level feature interaction module consists of a plurality of low-level interaction encoders and low-level interaction decoders, the features are interacted from a deep layer to a shallow layer of a network, deep high-response features are gradually transferred downwards, and high-level semantic information of local details (texture and space information) is highlighted;
the step 2.2: non-interactive features F of different layers i (i=1, 2, …, 6) input low-level feature interaction module, i.e.
L 1 =Downconv2d(F 6 )
L i+1 =LIE i (L i ,F 6-i ),i=1,2
V 1 =CNN(L 3 )
L j+1 =LID j (L j ,F 7-i ),j=3,4,5
Wherein LIE is i (i=1, 2) denotes a low-level interactive encoder, LID, with index i j (j=4, 5, 6) denotes a low-level interactive decoder with index j, L i (i=1, 2, …, 6) represents an output feature map of a low-level interactive encoder or a low-level interactive decoder with index i, L 6 And V 1 Representing low-level interaction features learned from the low-level feature interaction module.
The step 3 of inputting the non-interactive features of different levels into the high-level feature interaction module to obtain the high-level interactive features comprises the following steps:
step 3.1: the high-level feature interaction module consists of a plurality of high-level interaction encoders and high-level interaction decoders, wherein multi-layer feature fusion interaction is from shallow to deep, shallow low-response features are gradually accumulated, the deep high-response features are prevented from being completely inhibited, low-level information of local details is effectively reserved, channel enhancement design and enhanced deformable convolution are adopted, and a plurality of skip connections are added into aggregation characteristics to accelerate model convergence;
step 3.2: non-interactive features F of different layers i (i=1, 2, …, 6) input low-level feature interaction module, i.e.
H 1 =Conv2d(F 1 )
H i+1 =HIE i (H i ,F i+1 ),i=1,2
V 3 =CNN(H 3 )
H j+1 =HID j (L j ,F J+1 ),j=3,4,5
Wherein HIE is i (i=1, 2) denotes a high-level interactive encoder with index i, HID j (j=3, 4, 5) denotes a high-level interactive encoder with index j, H i (i=1, 2, …, 6) represents an output feature map of a high-level interactive encoder or a high-level interactive decoder with index i, H 6 And V 3 Representing the high-level interaction features learned from the high-level feature interaction module.
The step 4 of inputting the non-interactive features, the low-level interactive features and the high-level interactive features into the feature fusion module and classifying to obtain the forest smoke and fire recognition result comprises the following steps:
step 4.1: using the extracted features (L 6 ,F 6 ,H 6 ) Predicting a sequence of background images and a sequence of smoke density images corresponding to a sequence of input images, i.e
Wherein,and->Respectively representing a predicted background image sequence and a smoke density image sequence, conv (x) representing a convolution operation, and ReLU (x) representing a ReLU function;
step 4.2: will beInput into C3D model to extract space-time characteristic vector V 4 I.e.
Wherein FC (x) represents a fully connected layer and C3D (x) represents a C3D model;
step 4.3: since the bottom layer features may bring noise, the upper layer features may suppress the bottom layer beneficial information, then feature vectors (V 1 ,V 2 ,V 3 ,V 4 ) Sorting, i.e.
Wherein concat represents the series operation of the feature map,a prediction result representing the smoke classification;
step 4.4: in model training, the total loss functionCan be expressed as
Wherein,and->Respectively representing pixel cross entropy loss, contrast loss and classification loss, lambda 1 ,λ 2 And lambda (lambda) 3 Respectively representing weight coefficients of the corresponding losses;
pixel cross entropy lossThe calculation method of (1) is that
Wherein,pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;
countering lossesThe method aims to make up correlation and distribution information of pixels in a tag image ignored by the prior method, increase the correlation information among the pixels, capture the distribution of the whole image, punish mismatching in tag statistics, and comprises the following calculation method of
Wherein,refers to an empirical estimation of the probability expectation, x 1 And x 2 Respectively represent model D 1 And D 2 Input true smoke density image and true background image, P data Representing probability distribution of data compliance, DDis and BDis respectively represent discriminator models in generating countermeasure, respectively denoted as D 1 And D 2 Log (x) represents a logarithmic function;
classification lossThe calculation method of (1) is that
Wherein c andrepresenting the true and predicted categories of the image sequence, respectively.
Fig. 2 shows an example of prediction of smoke density images using the present invention, the first line of images representing a sequence of four smoke images and the second line of images representing a corresponding predicted smoke density image, in which brighter and whiter pixels have higher smoke density in the smoke images.
Fig. 3 shows the overall network architecture of the proposed model, comprising four parts in total, a Feature Extraction Module (FEM), a low-level feature interaction module (LFDM), a high-level feature interaction module (HFDM) and a Feature Fusion Module (FFM). Wherein the feature extraction module comprises a plurality of Encoding Blocks (EB) and Decoding Blocks (DB). The low-level feature interaction module includes a plurality of low-Level Interaction Encoders (LIEs) and low-Level Interaction Decoders (LIDs). The high-level feature interaction module includes a plurality of high-level interaction encoders (HIEs) and high-level interaction decoders (HIDs). The feature fusion module comprises multi-stage feature fusion.
Fig. 4 shows a specific network structure of EB and DB.
Fig. 5 shows a specific structural design of LIE and LID. Specifically, a channel enhancement design using three-dimensional convolution is added after the input layer, the convolution kernel size is designed to be 1×1×3, and an Enhanced Deformable Convolution (EDC) is also added in the last step of the low-level feature interaction decoding module.
Fig. 6 shows a specific network structure of EDC.
Fig. 7 shows a specific network structure of the HIE and HID.

Claims (3)

1. A forest smoke and fire recognition method based on enhanced deformable convolution and label correlation is characterized by comprising the following steps:
step 1: inputting the forest monitoring video image into a feature extraction module FEM to obtain non-interactive features of different layers:
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
wherein DB j (j=4, 5, 6) represents a feature decoding block with index j, EB i (i=1, 2, 3) denotes a feature code block with index i, F i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks and feature decoding blocks, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module;
the feature extraction module FEM comprises a plurality of feature encoding blocks EB and a plurality of feature decoding blocks DB which are connected in series;
step 2: and (2) carrying out the step (1) on the non-interactive feature graphs F of different layers i (i=1, 2, …, 6) input to the low-level feature interaction module to obtain low-level interaction features:
L 1 =Downconv2d(F 6 )
L i+1 =LIE i (L i ,F 6-i ),i=1,2
V 1 =CNN(L 3 )
L j+1 =LID j (L j ,F 7-i ),j=3,4,5
wherein LIE is i (i=1, 2) denotes a low-level interactive coded block with index i, LID j (j=4, 5, 6) represents a low-level interactive decoding block with index j, L i (i=1, 2, …, 6) represents a low-level interactive coded block or a low-level interactive decoded block with index iOutput feature map L 6 And V 1 Representing low-level interaction features learned from the low-level feature interaction module;
the low-level feature interaction module LFDM comprises a plurality of low-level interaction coding blocks LIE and a plurality of low-level interaction decoding blocks LID which are connected in series, a channel enhancement design using a three-dimensional convolutional neural network CNN is added behind an input layer of each LIE and LID, the size of a convolution kernel is designed to be 1 multiplied by 3, and an enhancement deformable convolution EDC is arranged in the last step of the low-level feature interaction module;
each low-level interactive coding block LIE is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
each low-level interactive decoding block LID is connected with the output of the encoding block EB of the feature extraction module FEM in the step 1;
step 3: and (3) carrying out the step (1) on the non-interactive features F with different layers i (i=1, 2, …, 6) input high-level feature interaction module, resulting in high-level interaction features:
H 1 =Conv2d(F 1 )
H i+1 =HIE i (H i ,F i+1 ),i=1,2
V 3 =CNN(H 3 )
H j+1 =HID j (L j ,F J+1 ),j=3,4,5
wherein HIE is i (i=1, 2) denotes a high-level interactive coded block with index i, HID j (j=3, 4, 5) denotes a high-level inter-coded block with index j, H i (i=1, 2, …, 6) an output feature map representing a high-level interactive coded block or a high-level interactive decoded block with index i, H 6 And V 3 Representing the high-level interaction features learned from the high-level feature interaction module;
the high-level feature interaction module HFDM comprises a plurality of high-level interaction coding blocks HIE and a plurality of high-level interaction decoding blocks HID which are connected in series, an enhanced deformable convolution EDC is arranged in the last step of the high-level feature interaction module, and each high-level interaction coding block HIE is connected with the output of the coding block EB of the feature extraction module FEM in the step 1;
each high-level interactive decoding block HID is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result;
(41) Using the extracted features (L 6 ,F 6 ,H 6 ) Predicting a background image sequence and a smoke density image sequence corresponding to the input image sequence:
wherein,and->Respectively representing a predicted background image sequence and a smoke density image sequence, conv (x) representing a convolution operation, and ReLU (x) representing a ReLU function;
(42) Will beInput into C3D model to extract space-time characteristic vector V 4 I.e.
Wherein FC (x) represents a fully connected layer and C3D (x) represents a C3D model;
(43) Since the low-level features may bring about noise, the upper-level features may suppress the low-level beneficial information, and feature vectors (V 1 ,V 2 ,V 3 ,V 4 ) Sorting, i.e.
Wherein concat represents the series operation of the feature map,a prediction result of smoke classification is indicated.
2. The enhanced deformable convolution and tag correlation based forest fire identification method according to claim 1, wherein: the step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:
step 1.1: the calculation method of the enhanced deformable convolution EDC comprises the following steps: for the output feature map y, position p 0 The method comprises the following steps:
wherein x represents an input feature map, p n Representing a collectionThe positions listed in w r And w d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively n Represents the offset and is typically a fraction;
bilinear interpolation is used to calculate x (p 0 +p n +Δp n ) The method comprises the following steps:
x(p)=∑ q G(p.q)x(q)
wherein q enumerates the integer spatial positions of the feature map x, p is the fractional position, G (x) represents the kernel of bilinear interpolation;
step 1.2: EB recording i (i=1, 2, 3) represents a feature code block with index i, DB j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I 1 ,I 2 ,…,I L ' represents a forest containing L frames of imagesMonitoring video, converting the image sequence into a three-dimensional tensor F by channel stacking 0 And input into a feature extraction module to extract non-interactive features of different levels, namely
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
Wherein F is i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks EB and feature decoding blocks DB, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module.
3. The enhanced deformable convolution and tag correlation based forest fire identification method according to claim 1, wherein: in the model training in the step 4, the total loss functionExpressed as:
wherein,and->Respectively representing pixel cross entropy loss, contrast loss and classification loss, lambda 1 ,λ 2 And lambda (lambda) 3 Respectively representing weight coefficients of the corresponding losses;
pixel cross entropy lossThe calculation method of (1) is that
Wherein (1)>Pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;
countering lossesThe method aims to make up correlation and distribution information of pixels in a tag image ignored by the prior method, increase the correlation information among the pixels, capture the distribution of the whole image, punish mismatching in tag statistics, and comprises the following calculation method of
Wherein,refers to an empirical estimation of the probability expectation, x 1 And x 2 Respectively represent model D 1 And D 2 Input true smoke density image and true background image, P data Representing probability distribution of data compliance, DDis and BDis respectively represent discriminator models in generating countermeasure, respectively denoted as D 1 And D 2 Log (x) represents a logarithmic function;
classification lossThe calculation method of (1) is that
Wherein c andrepresenting the true and predicted categories of the image sequence, respectively.
CN202111320633.7A 2021-11-09 2021-11-09 Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation Active CN114399723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111320633.7A CN114399723B (en) 2021-11-09 2021-11-09 Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111320633.7A CN114399723B (en) 2021-11-09 2021-11-09 Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation

Publications (2)

Publication Number Publication Date
CN114399723A CN114399723A (en) 2022-04-26
CN114399723B true CN114399723B (en) 2024-03-05

Family

ID=81225797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111320633.7A Active CN114399723B (en) 2021-11-09 2021-11-09 Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation

Country Status (1)

Country Link
CN (1) CN114399723B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409256A (en) * 2018-10-10 2019-03-01 东南大学 A kind of forest rocket detection method based on 3D convolutional neural networks
CN110490043A (en) * 2019-06-10 2019-11-22 东南大学 A kind of forest rocket detection method based on region division and feature extraction
CN113486697A (en) * 2021-04-16 2021-10-08 成都思晗科技股份有限公司 Forest smoke and fire monitoring method based on space-based multi-modal image fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11494616B2 (en) * 2019-05-09 2022-11-08 Shenzhen Malong Technologies Co., Ltd. Decoupling category-wise independence and relevance with self-attention for multi-label image classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409256A (en) * 2018-10-10 2019-03-01 东南大学 A kind of forest rocket detection method based on 3D convolutional neural networks
CN110490043A (en) * 2019-06-10 2019-11-22 东南大学 A kind of forest rocket detection method based on region division and feature extraction
CN113486697A (en) * 2021-04-16 2021-10-08 成都思晗科技股份有限公司 Forest smoke and fire monitoring method based on space-based multi-modal image fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜嘉欣 ; 常青 ; 刘鑫 ; .面向森林火灾烟雾识别的深度信念卷积网络.现代电子技术.2020,(第13期),全文. *

Also Published As

Publication number Publication date
CN114399723A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
Frizzi et al. Convolutional neural network for video fire and smoke detection
Andrearczyk et al. Convolutional neural network on three orthogonal planes for dynamic texture classification
Gong et al. Recognition of group activities using dynamic probabilistic networks
CN111832516B (en) Video behavior recognition method based on unsupervised video representation learning
CN110929593A (en) Real-time significance pedestrian detection method based on detail distinguishing and distinguishing
Sekma et al. Human action recognition based on multi-layer fisher vector encoding method
CN113449660A (en) Abnormal event detection method of space-time variation self-coding network based on self-attention enhancement
CN113158983A (en) Airport scene activity behavior recognition method based on infrared video sequence image
Li et al. Real-time video-based smoke detection with high accuracy and efficiency
CN111914731A (en) Multi-mode LSTM video motion prediction method based on self-attention mechanism
CN115294563A (en) 3D point cloud analysis method and device based on Transformer and capable of enhancing local semantic learning ability
CN115527271A (en) Elevator car passenger abnormal behavior detection system and method
Tao et al. Smoke vehicle detection based on robust codebook model and robust volume local binary count patterns
Hu et al. Deep learning for distinguishing computer generated images and natural images: A survey
CN114399723B (en) Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation
CN111680618B (en) Dynamic gesture recognition method based on video data characteristics, storage medium and device
CN110765982A (en) Video smoke detection method based on change accumulation graph and cascaded depth network
US11941884B2 (en) Multi-source panoptic feature pyramid network
Nguyen et al. A comprehensive taxonomy of dynamic texture representation
CN115439930A (en) Multi-feature fusion gait recognition method based on space-time dimension screening
CN113283393B (en) Deepfake video detection method based on image group and two-stream network
CN113887443A (en) Industrial smoke emission identification method based on attribute perception attention convergence
CN106355566A (en) Smoke and flame detection method applied to fixed camera dynamic video sequence
CN106530300A (en) Flame identification algorithm of low-rank analysis
CN117197727B (en) Global space-time feature learning-based behavior detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant