CN114399723B - Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation - Google Patents
Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation Download PDFInfo
- Publication number
- CN114399723B CN114399723B CN202111320633.7A CN202111320633A CN114399723B CN 114399723 B CN114399723 B CN 114399723B CN 202111320633 A CN202111320633 A CN 202111320633A CN 114399723 B CN114399723 B CN 114399723B
- Authority
- CN
- China
- Prior art keywords
- feature
- level
- representing
- low
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000000779 smoke Substances 0.000 title claims abstract description 69
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000002452 interceptive effect Effects 0.000 claims abstract description 91
- 230000003993 interaction Effects 0.000 claims abstract description 83
- 238000000605 extraction Methods 0.000 claims abstract description 42
- 230000004927 fusion Effects 0.000 claims abstract description 15
- 238000012544 monitoring process Methods 0.000 claims abstract description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 28
- 238000004364 calculation method Methods 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 8
- 238000013461 design Methods 0.000 claims description 7
- 230000009286 beneficial effect Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 235000019504 cigarettes Nutrition 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 239000002243 precursor Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4007—Interpolation-based scaling, e.g. bilinear interpolation
Abstract
The invention relates to a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which comprises the following steps: (1) Inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers; (2) Inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features; (3) Inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features; (4) Inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result. The invention can further reduce the missing report rate and the false report rate of forest smoke and fire identification.
Description
Technical Field
The invention belongs to the technical field of pattern recognition, deep learning and image video processing, and relates to a forest smoke and fire recognition method based on enhanced deformable convolution and label correlation.
Background
Forest fires can have devastating consequences for the ecosystem and human life. The most important precursor signal for forest fires is smoke, so vision-based smoke and fire identification methods are critical for early detection of fires and for reducing the risk of fires. However, in smoke identification, achieving lower smoke detection rates and false positive rates in complex real scenes is a very challenging task due to the wide variety of smoke features and continuous disturbance of smoke-like objects.
To date, researchers have proposed many vision-based smoke recognition/detection methods to improve recognition accuracy with great success. These methods can be broadly divided into two categories: traditional methods and deep learning methods. The conventional method generally extracts manual features and classifies the manual features to obtain a recognition result. Manual features include colors, gradients, shapes, texture features, motion features, frequency domain features, mathematical models, sparse representation based features, and the like. Traditional methods cannot efficiently extract semantic information because manual feature extraction is a complex function of local geometry, structure, and context. In addition, there is a large intra-class variation in smoke color, shape, texture, etc. Therefore, it is difficult to achieve the low false positive rate and low false positive rate goals of cigarettes by designing a smoke hand feature with discrimination.
Deep learning methods typically first extract depth features using various convolutional neural network variants (CNNs) and classify the extracted depth features to obtain recognition results. Depth features may describe semantic features of smoke and thus tend to achieve better performance than manual features. Existing depth models can be further divided into category-level and pixel-level supervision models according to the difference of model supervision information. Category-level supervised deep learning models typically use the binary category information with/without smoke and bounding boxes of smoke regions as training sample labels to guide the model training process. Most existing smoke identification methods employ category-level supervision information. However, simple category supervision information has local semantic properties, i.e. it is only relevant to the small area where the smoke target is located. In addition, smoke is typically translucent and the extracted features contain background information or are affected by imaging conditions such as lighting, weather conditions, viewing angles, etc., interfering with model recognition accuracy.
To address the above issues, some researchers have attempted to explore pixel-level supervised deep learning models to provide pixel-level labels to guide model training. For example, consider smoke recognition as a smoke segmentation task by giving smoke/smoke-free binary class information for each pixel in a sample image. However, due to the diversity of cigarettes and the constant interference of smoke-like objects in real complex scenes, pixel-level supervised smoke recognition methods still face a large number of missed and false positives.
The model does not really solve the core problem of forest smoke and fire identification technically and in engineering. By fully analyzing existing smoke recognition models and patents, we consider that existing models still have the following three-point problem: 1) Most forest pan-tilt cameras can monitor a wide area range of 3-8 km, which results in large variations in the size, point of view and distortion of smoke in the surveillance video. Existing smoke identification methods typically employ data enhancement in training samples or use different receptive fields to obtain multi-scale features. However, data augmentation may prevent generalization to new tasks with unknown geometric transformations, and the effectiveness of multiple receptive fields may not be prominent due to the fixed geometry of the depth network using CNN modules. 2) Related studies have demonstrated that different levels of convolutional layers are typically concerned with different levels of information. Activation from shallow layers focuses on low-level texture details (e.g., colors, lines, edges, and shapes), while activation from deep layers focuses on high-level semantic content (e.g., objects and concepts). Because the features of different layers have complementarity, the improvement of the recognition capability by utilizing the fusion and interaction of the features between layers has important significance. Existing smoke identification methods typically employ unidirectional feature transmission strategies from low-level to high-level. However, such a strategy may lead to a problem that low-level features used to characterize local details are completely suppressed by high-response high-level features during feature interactions. 3) Having a pixel-level supervised smoke recognition model typically uses image form labels, such as image segmentation is essentially a pixel-level classification problem. Existing methods typically employ pixel-level loss training (e.g., cross entropy loss, mean square error loss) and labels for each pixel are predicted independently. However, these losses ignore the correlation of pixels in the label image, possibly resulting in some information loss of correlation between pixels.
In order to solve the problems, the patent provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which realizes bidirectional interaction and lower smoke and fire rate by utilizing label correlation information through designing enhanced deformable convolution.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which can further reduce the missing report rate and false report rate of smoke.
Technical proposal
A forest smoke and fire recognition method based on enhanced deformable convolution and label correlation is characterized by comprising the following steps:
step 1: inputting the forest monitoring video image into a feature extraction module FEM to obtain non-interactive features of different layers:
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
wherein DB j (j=4, 5, 6) represents a feature decoding block with index j, EB i (i=1, 2, 3) denotes a feature code block with index i, F i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks and feature decoding blocks, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module;
the feature extraction module FEM comprises a plurality of feature encoding blocks EB and a plurality of feature decoding blocks DB which are connected in series;
step 2: and (2) carrying out the step (1) on the non-interactive feature graphs F of different layers i (i=1, 2, …, 6) input to the low-level feature interaction module to obtain low-level interaction features:
L 1 =Downconv2d(F 6 )
L i+1 =LIE i (L i ,F 6-i ),i=1,2
V 1 =CNN(L 3 )
L j+1 =LID j (L j ,F 7-i ),j=3,4,5
wherein LIE is i (i=1, 2) denotes a low-level interactive coded block with index i, LID j (j=4, 5, 6) represents a low-level interactive decoding block with index j, L i (i=1, 2, …, 6) an output feature map representing a low-level interactive coded block or a low-level interactive decoded block with index i, L 6 And V 1 The representation is from a low levelThe low-level interaction features learned in the feature interaction module;
the low-level feature interaction module LFDM comprises a plurality of low-level interaction coding blocks LIE and a plurality of low-level interaction decoding blocks LID which are connected in series, a channel enhancement design using a three-dimensional convolutional neural network CNN is added behind an input layer of each LIE and LID, the size of a convolution kernel is designed to be 1 multiplied by 3, and an enhancement deformable convolution EDC is arranged in the last step of the low-level feature interaction module;
each low-level interactive coding block LIE is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
each low-level interactive decoding block LID is connected with the output of the encoding block EB of the feature extraction module FEM in the step 1;
step 3: and (3) carrying out the step (1) on the non-interactive features F with different layers i (i=1, 2, …, 6) input high-level feature interaction module, resulting in high-level interaction features:
H 1 =Conv2d(F 1 )
H i+1 =HIE i (H i ,F i+1 ),i=1,2
V 3 =CNN(H 3 )
H j+1 =HID j (L j ,F J+1 ),j=3,4,5
wherein HIE is i (i=1, 2) denotes a high-level interactive coded block with index i, HID j (j=3, 4, 5) denotes a high-level inter-coded block with index j, H i (i=1, 2, …, 6) an output feature map representing a high-level interactive coded block or a high-level interactive decoded block with index i, H 6 And V 3 Representing the high-level interaction features learned from the high-level feature interaction module;
the high-level feature interaction module HFDM comprises a plurality of high-level interaction coding blocks HIE and a plurality of high-level interaction decoding blocks HID which are connected in series, an enhanced deformable convolution EDC is arranged in the last step of the high-level feature interaction module, and each high-level interaction coding block HIE is connected with the output of the coding block EB of the feature extraction module FEM in the step 1;
each high-level interactive decoding block HID is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result;
(41) Using the extracted features (L 6 ,F 6 ,H 6 ) Predicting a background image sequence and a smoke density image sequence corresponding to the input image sequence:
wherein,and->Respectively representing a predicted background image sequence and a smoke density image sequence, conv (x) representing a convolution operation, and ReLU (x) representing a ReLU function;
(42) Will beInput into C3D model to extract space-time characteristic vector V 4 I.e.
Wherein FC (x) represents a fully connected layer and C3D (x) represents a C3D model;
(43) Since the low-level features may bring about noise, the upper-level features may suppress the low-level beneficial information, and feature vectors (V 1 ,V 2 ,V 3 ,V 4 ) Sorting, i.e.
Wherein concat represents the series operation of the feature map,a prediction result of smoke classification is indicated.
The step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:
step 1.1: the calculation method of the enhanced deformable convolution EDC comprises the following steps: for the output feature map y, position p 0 The method comprises the following steps:
wherein x represents an input feature map, p n Representing a collectionThe positions listed in w r And w d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively n Represents the offset and is typically a fraction;
bilinear interpolation is used to calculate x (p 0 +p n +Δp n ) The method comprises the following steps:
x(p)=∑ q G(p.q)x(q)
wherein q enumerates the integer spatial positions of the feature map x, p is the fractional position, G (x) represents the kernel of bilinear interpolation;
step 1.2: EB recording i (i=1, 2, 3) represents a feature code block with index i, DB j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I 1 ,I 2 ,…,I L ' representing a forest monitoring video comprising L frames of images, the sequence of images being transformed into a three-dimensional tensor F by channel stacking 0 And input into a feature extraction module to extract non-interactive features of different levels, namely
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
Wherein F is i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks EB and feature decoding blocks DB, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module.
In the model training in the step 4, the total loss functionExpressed as:
wherein,and->Respectively representing pixel cross entropy loss, contrast loss and classification loss, lambda 1 ,λ 2 And lambda (lambda) 3 Respectively representing weight coefficients of the corresponding losses;
pixel cross entropy lossThe calculation method of (1) is that
Wherein (1)>Pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;
countering lossesThe method aims to make up correlation and distribution information of pixels in a tag image ignored by the prior method, increase the correlation information among the pixels, capture the distribution of the whole image, punish mismatching in tag statistics, and comprises the following calculation method of
Wherein,refers to an empirical estimation of the probability expectation, x 1 And x 2 Respectively represent model D 1 And D 2 Input true smoke density image and true background image, P data Representing probability distribution of data compliance, DDis and BDis respectively represent discriminator models in generating countermeasure, respectively denoted as D 1 And D 2 Log (x) represents a logarithmic function;
classification lossThe calculation method of (1) is that
Wherein c andrepresenting the true and predicted categories of the image sequence, respectively.
Advantageous effects
The invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which comprises the following steps: (1) Inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers; (2) Inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features; (3) Inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features; (4) Inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result. The invention can further reduce the missing report rate and the false report rate of forest smoke and fire identification.
The beneficial effects of the invention are as follows: (1) In order to make the feature representation have robustness to scale, viewpoint, deformation and the like, the invention discloses an enhanced deformable convolution module, and feature complementation is realized by breaking the fixed geometric structure of the convolution module and considering the most representative features and local weak features; (2) In order to avoid that the low-level features for describing local details are completely inhibited by the high-response high-level features in the process of feature fusion and interaction, a multi-directional feature interaction module is invented to acquire complementary interaction information of different levels of features; (3) In order to utilize the correlation and distribution information of pixels in the label image, we invented a discrimination loss term based on generative countermeasure learning to measure the distribution similarity between the network predicted image and the real image, and eliminate the distribution inconsistency. The invention further improves the detection rate and reduces the false alarm rate by combining the advantages of the three points.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is an example of a predicted smoke density image of the present invention FIG. 1
Fig. 3 shows the overall network architecture of the proposed model, comprising four parts in total, a Feature Extraction Module (FEM), a low-level feature interaction module (LFDM), a high-level feature interaction module (HFDM) and a Feature Fusion Module (FFM). The feature extraction module is obtained by connecting a plurality of feature Encoding Blocks (EB) and a plurality of feature Decoding Blocks (DB) in series. The low-level feature interaction module includes a plurality of low-level interaction encoding blocks (LIE) and low-level interaction decoding blocks (LID). The high-level feature interaction module includes a plurality of high-level interaction encoding blocks (HIEs) and high-level interaction decoding blocks (HIDs). The feature fusion module comprises multiple levels of feature fusion.
Fig. 4 shows a specific network structure of the feature encoding block EB (fig. 4 a) and the feature decoding block DB (fig. 4 b).
Fig. 5 shows a specific structural design of the low-level interactive coding block LIE and the low-level interactive decoding block LID. Specifically, a channel enhancement design using three-dimensional convolution is added after the input layer, the convolution kernel size is designed to be 1×1×3, and an Enhanced Deformable Convolution (EDC) is also added in the last step of the low-level feature interaction decoding module.
Fig. 6 shows a specific network structure that enhances the deformable convolution EDC.
Fig. 7 shows a specific network structure of a high-level interactive encoding block HIE (fig. 7 a) and a high-level interactive decoding block HID (fig. 7 b), to which an Enhanced Deformable Convolution (EDC) is further added in the last step of the high-level feature interactive decoding module.
Detailed Description
The invention will now be further described with reference to examples, figures:
the invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which is shown in a flow chart in fig. 1 and specifically comprises the following steps:
step 1: inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers;
step 2: inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features;
step 3: inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features;
step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result.
The step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:
step 1.1: the addition of parallel paths to take into account internal information to achieve enhanced variability convolution prevents locally weak smoke features from being ignored or suppressed, thereby extracting complementary smoke features, for the output feature map y, at position p 0 There is at
Wherein x represents an input feature map, p n Representing a collectionThe positions listed in (a) are exemplified by a 3X 3 local area,/->Is defined as { (-1, -1), (-1, 0), …, (1, 1) }, w for a 3 x 3 convolution kernel r And w d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively n Representing the offset, and typically the fraction, bilinear interpolation is used to calculate x (p 0 +p n +Δp n ) I.e.
x(p)=∑ q G(p.q)x(q)
Wherein q enumerates the integer spatial positions of the feature map x, p is the fractional position, G (x) represents the kernel of bilinear interpolation;
step 1.2: in a feature coding module, carrying out average pooling and maximum pooling in parallel to extract complementary information, adding three-dimensional convolution before the maximum pooling operation, coding time-space information, respectively designing enhanced deformable convolution at the beginning and the end for extracting features with discrimination on the dimension, the viewpoint, the deformation and the like of smoke, in a feature decoding module, carrying out aggregation on the extracted different level features before the upsampling operation, and using the enhanced deformable convolution to adapt to various geometric transformations to promote the extraction of essential features of the smoke;
step 1.3: EB recording i (i=1, 2, 3) denotes a feature encoding module with index i, DB j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I 1 ,I 2 ,…,I L ' representing a forest monitoring video comprising L frames of images, the sequence of images being transformed into a three-dimensional tensor F by channel stacking 0 And input into a feature extraction module to extract non-interactive features of different levels, namely
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
Wherein F is i (i=1, 2, …, 6) represents the output feature map of the feature encoding module and the feature decoding module with index i, i.e. no interactive features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding modules and feature decoding modules, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module.
The step 2 of inputting the non-interactive features of different levels into the low-level feature interaction module to obtain the low-level interactive features comprises the following steps:
step 2.1: the low-level feature interaction module consists of a plurality of low-level interaction encoders and low-level interaction decoders, the features are interacted from a deep layer to a shallow layer of a network, deep high-response features are gradually transferred downwards, and high-level semantic information of local details (texture and space information) is highlighted;
the step 2.2: non-interactive features F of different layers i (i=1, 2, …, 6) input low-level feature interaction module, i.e.
L 1 =Downconv2d(F 6 )
L i+1 =LIE i (L i ,F 6-i ),i=1,2
V 1 =CNN(L 3 )
L j+1 =LID j (L j ,F 7-i ),j=3,4,5
Wherein LIE is i (i=1, 2) denotes a low-level interactive encoder, LID, with index i j (j=4, 5, 6) denotes a low-level interactive decoder with index j, L i (i=1, 2, …, 6) represents an output feature map of a low-level interactive encoder or a low-level interactive decoder with index i, L 6 And V 1 Representing low-level interaction features learned from the low-level feature interaction module.
The step 3 of inputting the non-interactive features of different levels into the high-level feature interaction module to obtain the high-level interactive features comprises the following steps:
step 3.1: the high-level feature interaction module consists of a plurality of high-level interaction encoders and high-level interaction decoders, wherein multi-layer feature fusion interaction is from shallow to deep, shallow low-response features are gradually accumulated, the deep high-response features are prevented from being completely inhibited, low-level information of local details is effectively reserved, channel enhancement design and enhanced deformable convolution are adopted, and a plurality of skip connections are added into aggregation characteristics to accelerate model convergence;
step 3.2: non-interactive features F of different layers i (i=1, 2, …, 6) input low-level feature interaction module, i.e.
H 1 =Conv2d(F 1 )
H i+1 =HIE i (H i ,F i+1 ),i=1,2
V 3 =CNN(H 3 )
H j+1 =HID j (L j ,F J+1 ),j=3,4,5
Wherein HIE is i (i=1, 2) denotes a high-level interactive encoder with index i, HID j (j=3, 4, 5) denotes a high-level interactive encoder with index j, H i (i=1, 2, …, 6) represents an output feature map of a high-level interactive encoder or a high-level interactive decoder with index i, H 6 And V 3 Representing the high-level interaction features learned from the high-level feature interaction module.
The step 4 of inputting the non-interactive features, the low-level interactive features and the high-level interactive features into the feature fusion module and classifying to obtain the forest smoke and fire recognition result comprises the following steps:
step 4.1: using the extracted features (L 6 ,F 6 ,H 6 ) Predicting a sequence of background images and a sequence of smoke density images corresponding to a sequence of input images, i.e
Wherein,and->Respectively representing a predicted background image sequence and a smoke density image sequence, conv (x) representing a convolution operation, and ReLU (x) representing a ReLU function;
step 4.2: will beInput into C3D model to extract space-time characteristic vector V 4 I.e.
Wherein FC (x) represents a fully connected layer and C3D (x) represents a C3D model;
step 4.3: since the bottom layer features may bring noise, the upper layer features may suppress the bottom layer beneficial information, then feature vectors (V 1 ,V 2 ,V 3 ,V 4 ) Sorting, i.e.
Wherein concat represents the series operation of the feature map,a prediction result representing the smoke classification;
step 4.4: in model training, the total loss functionCan be expressed as
Wherein,and->Respectively representing pixel cross entropy loss, contrast loss and classification loss, lambda 1 ,λ 2 And lambda (lambda) 3 Respectively representing weight coefficients of the corresponding losses;
pixel cross entropy lossThe calculation method of (1) is that
Wherein,pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;
countering lossesThe method aims to make up correlation and distribution information of pixels in a tag image ignored by the prior method, increase the correlation information among the pixels, capture the distribution of the whole image, punish mismatching in tag statistics, and comprises the following calculation method of
Wherein,refers to an empirical estimation of the probability expectation, x 1 And x 2 Respectively represent model D 1 And D 2 Input true smoke density image and true background image, P data Representing probability distribution of data compliance, DDis and BDis respectively represent discriminator models in generating countermeasure, respectively denoted as D 1 And D 2 Log (x) represents a logarithmic function;
classification lossThe calculation method of (1) is that
Wherein c andrepresenting the true and predicted categories of the image sequence, respectively.
Fig. 2 shows an example of prediction of smoke density images using the present invention, the first line of images representing a sequence of four smoke images and the second line of images representing a corresponding predicted smoke density image, in which brighter and whiter pixels have higher smoke density in the smoke images.
Fig. 3 shows the overall network architecture of the proposed model, comprising four parts in total, a Feature Extraction Module (FEM), a low-level feature interaction module (LFDM), a high-level feature interaction module (HFDM) and a Feature Fusion Module (FFM). Wherein the feature extraction module comprises a plurality of Encoding Blocks (EB) and Decoding Blocks (DB). The low-level feature interaction module includes a plurality of low-Level Interaction Encoders (LIEs) and low-Level Interaction Decoders (LIDs). The high-level feature interaction module includes a plurality of high-level interaction encoders (HIEs) and high-level interaction decoders (HIDs). The feature fusion module comprises multi-stage feature fusion.
Fig. 4 shows a specific network structure of EB and DB.
Fig. 5 shows a specific structural design of LIE and LID. Specifically, a channel enhancement design using three-dimensional convolution is added after the input layer, the convolution kernel size is designed to be 1×1×3, and an Enhanced Deformable Convolution (EDC) is also added in the last step of the low-level feature interaction decoding module.
Fig. 6 shows a specific network structure of EDC.
Fig. 7 shows a specific network structure of the HIE and HID.
Claims (3)
1. A forest smoke and fire recognition method based on enhanced deformable convolution and label correlation is characterized by comprising the following steps:
step 1: inputting the forest monitoring video image into a feature extraction module FEM to obtain non-interactive features of different layers:
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
wherein DB j (j=4, 5, 6) represents a feature decoding block with index j, EB i (i=1, 2, 3) denotes a feature code block with index i, F i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks and feature decoding blocks, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module;
the feature extraction module FEM comprises a plurality of feature encoding blocks EB and a plurality of feature decoding blocks DB which are connected in series;
step 2: and (2) carrying out the step (1) on the non-interactive feature graphs F of different layers i (i=1, 2, …, 6) input to the low-level feature interaction module to obtain low-level interaction features:
L 1 =Downconv2d(F 6 )
L i+1 =LIE i (L i ,F 6-i ),i=1,2
V 1 =CNN(L 3 )
L j+1 =LID j (L j ,F 7-i ),j=3,4,5
wherein LIE is i (i=1, 2) denotes a low-level interactive coded block with index i, LID j (j=4, 5, 6) represents a low-level interactive decoding block with index j, L i (i=1, 2, …, 6) represents a low-level interactive coded block or a low-level interactive decoded block with index iOutput feature map L 6 And V 1 Representing low-level interaction features learned from the low-level feature interaction module;
the low-level feature interaction module LFDM comprises a plurality of low-level interaction coding blocks LIE and a plurality of low-level interaction decoding blocks LID which are connected in series, a channel enhancement design using a three-dimensional convolutional neural network CNN is added behind an input layer of each LIE and LID, the size of a convolution kernel is designed to be 1 multiplied by 3, and an enhancement deformable convolution EDC is arranged in the last step of the low-level feature interaction module;
each low-level interactive coding block LIE is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
each low-level interactive decoding block LID is connected with the output of the encoding block EB of the feature extraction module FEM in the step 1;
step 3: and (3) carrying out the step (1) on the non-interactive features F with different layers i (i=1, 2, …, 6) input high-level feature interaction module, resulting in high-level interaction features:
H 1 =Conv2d(F 1 )
H i+1 =HIE i (H i ,F i+1 ),i=1,2
V 3 =CNN(H 3 )
H j+1 =HID j (L j ,F J+1 ),j=3,4,5
wherein HIE is i (i=1, 2) denotes a high-level interactive coded block with index i, HID j (j=3, 4, 5) denotes a high-level inter-coded block with index j, H i (i=1, 2, …, 6) an output feature map representing a high-level interactive coded block or a high-level interactive decoded block with index i, H 6 And V 3 Representing the high-level interaction features learned from the high-level feature interaction module;
the high-level feature interaction module HFDM comprises a plurality of high-level interaction coding blocks HIE and a plurality of high-level interaction decoding blocks HID which are connected in series, an enhanced deformable convolution EDC is arranged in the last step of the high-level feature interaction module, and each high-level interaction coding block HIE is connected with the output of the coding block EB of the feature extraction module FEM in the step 1;
each high-level interactive decoding block HID is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;
step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result;
(41) Using the extracted features (L 6 ,F 6 ,H 6 ) Predicting a background image sequence and a smoke density image sequence corresponding to the input image sequence:
wherein,and->Respectively representing a predicted background image sequence and a smoke density image sequence, conv (x) representing a convolution operation, and ReLU (x) representing a ReLU function;
(42) Will beInput into C3D model to extract space-time characteristic vector V 4 I.e.
Wherein FC (x) represents a fully connected layer and C3D (x) represents a C3D model;
(43) Since the low-level features may bring about noise, the upper-level features may suppress the low-level beneficial information, and feature vectors (V 1 ,V 2 ,V 3 ,V 4 ) Sorting, i.e.
Wherein concat represents the series operation of the feature map,a prediction result of smoke classification is indicated.
2. The enhanced deformable convolution and tag correlation based forest fire identification method according to claim 1, wherein: the step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:
step 1.1: the calculation method of the enhanced deformable convolution EDC comprises the following steps: for the output feature map y, position p 0 The method comprises the following steps:
wherein x represents an input feature map, p n Representing a collectionThe positions listed in w r And w d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively n Represents the offset and is typically a fraction;
bilinear interpolation is used to calculate x (p 0 +p n +Δp n ) The method comprises the following steps:
x(p)=∑ q G(p.q)x(q)
wherein q enumerates the integer spatial positions of the feature map x, p is the fractional position, G (x) represents the kernel of bilinear interpolation;
step 1.2: EB recording i (i=1, 2, 3) represents a feature code block with index i, DB j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I 1 ,I 2 ,…,I L ' represents a forest containing L frames of imagesMonitoring video, converting the image sequence into a three-dimensional tensor F by channel stacking 0 And input into a feature extraction module to extract non-interactive features of different levels, namely
F i =EB i (F i-1 ),i=1,2,3
V 2 =CNN(F 3 )
F j =DB j (F j-1 ),j=4,5,6
Wherein F is i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V 2 Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks EB and feature decoding blocks DB, F 6 And V 2 Representing the non-interactive features learned from the feature extraction module.
3. The enhanced deformable convolution and tag correlation based forest fire identification method according to claim 1, wherein: in the model training in the step 4, the total loss functionExpressed as:
wherein,and->Respectively representing pixel cross entropy loss, contrast loss and classification loss, lambda 1 ,λ 2 And lambda (lambda) 3 Respectively representing weight coefficients of the corresponding losses;
pixel cross entropy lossThe calculation method of (1) is that
Wherein (1)>Pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;
countering lossesThe method aims to make up correlation and distribution information of pixels in a tag image ignored by the prior method, increase the correlation information among the pixels, capture the distribution of the whole image, punish mismatching in tag statistics, and comprises the following calculation method of
Wherein,refers to an empirical estimation of the probability expectation, x 1 And x 2 Respectively represent model D 1 And D 2 Input true smoke density image and true background image, P data Representing probability distribution of data compliance, DDis and BDis respectively represent discriminator models in generating countermeasure, respectively denoted as D 1 And D 2 Log (x) represents a logarithmic function;
classification lossThe calculation method of (1) is that
Wherein c andrepresenting the true and predicted categories of the image sequence, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111320633.7A CN114399723B (en) | 2021-11-09 | 2021-11-09 | Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111320633.7A CN114399723B (en) | 2021-11-09 | 2021-11-09 | Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114399723A CN114399723A (en) | 2022-04-26 |
CN114399723B true CN114399723B (en) | 2024-03-05 |
Family
ID=81225797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111320633.7A Active CN114399723B (en) | 2021-11-09 | 2021-11-09 | Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114399723B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409256A (en) * | 2018-10-10 | 2019-03-01 | 东南大学 | A kind of forest rocket detection method based on 3D convolutional neural networks |
CN110490043A (en) * | 2019-06-10 | 2019-11-22 | 东南大学 | A kind of forest rocket detection method based on region division and feature extraction |
CN113486697A (en) * | 2021-04-16 | 2021-10-08 | 成都思晗科技股份有限公司 | Forest smoke and fire monitoring method based on space-based multi-modal image fusion |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11494616B2 (en) * | 2019-05-09 | 2022-11-08 | Shenzhen Malong Technologies Co., Ltd. | Decoupling category-wise independence and relevance with self-attention for multi-label image classification |
-
2021
- 2021-11-09 CN CN202111320633.7A patent/CN114399723B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409256A (en) * | 2018-10-10 | 2019-03-01 | 东南大学 | A kind of forest rocket detection method based on 3D convolutional neural networks |
CN110490043A (en) * | 2019-06-10 | 2019-11-22 | 东南大学 | A kind of forest rocket detection method based on region division and feature extraction |
CN113486697A (en) * | 2021-04-16 | 2021-10-08 | 成都思晗科技股份有限公司 | Forest smoke and fire monitoring method based on space-based multi-modal image fusion |
Non-Patent Citations (1)
Title |
---|
杜嘉欣 ; 常青 ; 刘鑫 ; .面向森林火灾烟雾识别的深度信念卷积网络.现代电子技术.2020,(第13期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN114399723A (en) | 2022-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Frizzi et al. | Convolutional neural network for video fire and smoke detection | |
Andrearczyk et al. | Convolutional neural network on three orthogonal planes for dynamic texture classification | |
Gong et al. | Recognition of group activities using dynamic probabilistic networks | |
CN111832516B (en) | Video behavior recognition method based on unsupervised video representation learning | |
CN110929593A (en) | Real-time significance pedestrian detection method based on detail distinguishing and distinguishing | |
Sekma et al. | Human action recognition based on multi-layer fisher vector encoding method | |
CN113449660A (en) | Abnormal event detection method of space-time variation self-coding network based on self-attention enhancement | |
CN113158983A (en) | Airport scene activity behavior recognition method based on infrared video sequence image | |
Li et al. | Real-time video-based smoke detection with high accuracy and efficiency | |
CN111914731A (en) | Multi-mode LSTM video motion prediction method based on self-attention mechanism | |
CN115294563A (en) | 3D point cloud analysis method and device based on Transformer and capable of enhancing local semantic learning ability | |
CN115527271A (en) | Elevator car passenger abnormal behavior detection system and method | |
Tao et al. | Smoke vehicle detection based on robust codebook model and robust volume local binary count patterns | |
Hu et al. | Deep learning for distinguishing computer generated images and natural images: A survey | |
CN114399723B (en) | Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation | |
CN111680618B (en) | Dynamic gesture recognition method based on video data characteristics, storage medium and device | |
CN110765982A (en) | Video smoke detection method based on change accumulation graph and cascaded depth network | |
US11941884B2 (en) | Multi-source panoptic feature pyramid network | |
Nguyen et al. | A comprehensive taxonomy of dynamic texture representation | |
CN115439930A (en) | Multi-feature fusion gait recognition method based on space-time dimension screening | |
CN113283393B (en) | Deepfake video detection method based on image group and two-stream network | |
CN113887443A (en) | Industrial smoke emission identification method based on attribute perception attention convergence | |
CN106355566A (en) | Smoke and flame detection method applied to fixed camera dynamic video sequence | |
CN106530300A (en) | Flame identification algorithm of low-rank analysis | |
CN117197727B (en) | Global space-time feature learning-based behavior detection method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |