CN114399723B

CN114399723B - Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation

Info

Publication number: CN114399723B
Application number: CN202111320633.7A
Authority: CN
Inventors: 陶焕杰; 胡震武
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2024-03-05
Anticipated expiration: 2041-11-09
Also published as: CN114399723A

Abstract

The invention relates to a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which comprises the following steps: (1) Inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers; (2) Inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features; (3) Inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features; (4) Inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result. The invention can further reduce the missing report rate and the false report rate of forest smoke and fire identification.

Description

Forest smoke and fire recognition method based on enhanced deformable convolution and label correlation

Technical Field

The invention belongs to the technical field of pattern recognition, deep learning and image video processing, and relates to a forest smoke and fire recognition method based on enhanced deformable convolution and label correlation.

Background

Forest fires can have devastating consequences for the ecosystem and human life. The most important precursor signal for forest fires is smoke, so vision-based smoke and fire identification methods are critical for early detection of fires and for reducing the risk of fires. However, in smoke identification, achieving lower smoke detection rates and false positive rates in complex real scenes is a very challenging task due to the wide variety of smoke features and continuous disturbance of smoke-like objects.

To date, researchers have proposed many vision-based smoke recognition/detection methods to improve recognition accuracy with great success. These methods can be broadly divided into two categories: traditional methods and deep learning methods. The conventional method generally extracts manual features and classifies the manual features to obtain a recognition result. Manual features include colors, gradients, shapes, texture features, motion features, frequency domain features, mathematical models, sparse representation based features, and the like. Traditional methods cannot efficiently extract semantic information because manual feature extraction is a complex function of local geometry, structure, and context. In addition, there is a large intra-class variation in smoke color, shape, texture, etc. Therefore, it is difficult to achieve the low false positive rate and low false positive rate goals of cigarettes by designing a smoke hand feature with discrimination.

Deep learning methods typically first extract depth features using various convolutional neural network variants (CNNs) and classify the extracted depth features to obtain recognition results. Depth features may describe semantic features of smoke and thus tend to achieve better performance than manual features. Existing depth models can be further divided into category-level and pixel-level supervision models according to the difference of model supervision information. Category-level supervised deep learning models typically use the binary category information with/without smoke and bounding boxes of smoke regions as training sample labels to guide the model training process. Most existing smoke identification methods employ category-level supervision information. However, simple category supervision information has local semantic properties, i.e. it is only relevant to the small area where the smoke target is located. In addition, smoke is typically translucent and the extracted features contain background information or are affected by imaging conditions such as lighting, weather conditions, viewing angles, etc., interfering with model recognition accuracy.

To address the above issues, some researchers have attempted to explore pixel-level supervised deep learning models to provide pixel-level labels to guide model training. For example, consider smoke recognition as a smoke segmentation task by giving smoke/smoke-free binary class information for each pixel in a sample image. However, due to the diversity of cigarettes and the constant interference of smoke-like objects in real complex scenes, pixel-level supervised smoke recognition methods still face a large number of missed and false positives.

The model does not really solve the core problem of forest smoke and fire identification technically and in engineering. By fully analyzing existing smoke recognition models and patents, we consider that existing models still have the following three-point problem: 1) Most forest pan-tilt cameras can monitor a wide area range of 3-8 km, which results in large variations in the size, point of view and distortion of smoke in the surveillance video. Existing smoke identification methods typically employ data enhancement in training samples or use different receptive fields to obtain multi-scale features. However, data augmentation may prevent generalization to new tasks with unknown geometric transformations, and the effectiveness of multiple receptive fields may not be prominent due to the fixed geometry of the depth network using CNN modules. 2) Related studies have demonstrated that different levels of convolutional layers are typically concerned with different levels of information. Activation from shallow layers focuses on low-level texture details (e.g., colors, lines, edges, and shapes), while activation from deep layers focuses on high-level semantic content (e.g., objects and concepts). Because the features of different layers have complementarity, the improvement of the recognition capability by utilizing the fusion and interaction of the features between layers has important significance. Existing smoke identification methods typically employ unidirectional feature transmission strategies from low-level to high-level. However, such a strategy may lead to a problem that low-level features used to characterize local details are completely suppressed by high-response high-level features during feature interactions. 3) Having a pixel-level supervised smoke recognition model typically uses image form labels, such as image segmentation is essentially a pixel-level classification problem. Existing methods typically employ pixel-level loss training (e.g., cross entropy loss, mean square error loss) and labels for each pixel are predicted independently. However, these losses ignore the correlation of pixels in the label image, possibly resulting in some information loss of correlation between pixels.

In order to solve the problems, the patent provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which realizes bidirectional interaction and lower smoke and fire rate by utilizing label correlation information through designing enhanced deformable convolution.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which can further reduce the missing report rate and false report rate of smoke.

Technical proposal

A forest smoke and fire recognition method based on enhanced deformable convolution and label correlation is characterized by comprising the following steps:

step 1: inputting the forest monitoring video image into a feature extraction module FEM to obtain non-interactive features of different layers:

F _i ＝EB _i (F _i-1 ),i＝1,2,3

V ₂ ＝CNN(F ₃ )

F _j ＝DB _j (F _j-1 ),j＝4,5,6

wherein DB _j (j=4, 5, 6) represents a feature decoding block with index j, EB _i (i=1, 2, 3) denotes a feature code block with index i, F _i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V ₂ Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks and feature decoding blocks, F ₆ And V ₂ Representing the non-interactive features learned from the feature extraction module;

the feature extraction module FEM comprises a plurality of feature encoding blocks EB and a plurality of feature decoding blocks DB which are connected in series;

step 2: and (2) carrying out the step (1) on the non-interactive feature graphs F of different layers _i (i=1, 2, …, 6) input to the low-level feature interaction module to obtain low-level interaction features:

L ₁ ＝Downconv2d(F ₆ )

L _i+1 ＝LIE _i (L _i ,F _6-i ),i＝1,2

V ₁ ＝CNN(L ₃ )

L _j+1 ＝LID _j (L _j ,F _7-i ),j＝3,4,5

wherein LIE is _i (i=1, 2) denotes a low-level interactive coded block with index i, LID _j (j=4, 5, 6) represents a low-level interactive decoding block with index j, L _i (i=1, 2, …, 6) an output feature map representing a low-level interactive coded block or a low-level interactive decoded block with index i, L ₆ And V ₁ The representation is from a low levelThe low-level interaction features learned in the feature interaction module;

the low-level feature interaction module LFDM comprises a plurality of low-level interaction coding blocks LIE and a plurality of low-level interaction decoding blocks LID which are connected in series, a channel enhancement design using a three-dimensional convolutional neural network CNN is added behind an input layer of each LIE and LID, the size of a convolution kernel is designed to be 1 multiplied by 3, and an enhancement deformable convolution EDC is arranged in the last step of the low-level feature interaction module;

each low-level interactive coding block LIE is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;

each low-level interactive decoding block LID is connected with the output of the encoding block EB of the feature extraction module FEM in the step 1;

step 3: and (3) carrying out the step (1) on the non-interactive features F with different layers _i (i=1, 2, …, 6) input high-level feature interaction module, resulting in high-level interaction features:

H ₁ ＝Conv2d(F ₁ )

H _i+1 ＝HIE _i (H _i ,F _i+1 ),i＝1,2

V ₃ ＝CNN(H ₃ )

H _j+1 ＝HID _j (L _j ,F _J+1 ),j＝3,4,5

wherein HIE is _i (i=1, 2) denotes a high-level interactive coded block with index i, HID _j (j=3, 4, 5) denotes a high-level inter-coded block with index j, H _i (i=1, 2, …, 6) an output feature map representing a high-level interactive coded block or a high-level interactive decoded block with index i, H ₆ And V ₃ Representing the high-level interaction features learned from the high-level feature interaction module;

the high-level feature interaction module HFDM comprises a plurality of high-level interaction coding blocks HIE and a plurality of high-level interaction decoding blocks HID which are connected in series, an enhanced deformable convolution EDC is arranged in the last step of the high-level feature interaction module, and each high-level interaction coding block HIE is connected with the output of the coding block EB of the feature extraction module FEM in the step 1;

each high-level interactive decoding block HID is connected with the output of the decoding block DB of the feature extraction module FEM in the step 1;

step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result;

(41) Using the extracted features (L ₆ ,F ₆ ,H ₆ ) Predicting a background image sequence and a smoke density image sequence corresponding to the input image sequence:

wherein,and->Respectively representing a predicted background image sequence and a smoke density image sequence, conv (x) representing a convolution operation, and ReLU (x) representing a ReLU function;

(42) Will beInput into C3D model to extract space-time characteristic vector V ₄ I.e.

Wherein FC (x) represents a fully connected layer and C3D (x) represents a C3D model;

(43) Since the low-level features may bring about noise, the upper-level features may suppress the low-level beneficial information, and feature vectors (V ₁ ,V ₂ ,V ₃ ,V ₄ ) Sorting, i.e.

Wherein concat represents the series operation of the feature map,a prediction result of smoke classification is indicated.

The step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:

step 1.1: the calculation method of the enhanced deformable convolution EDC comprises the following steps: for the output feature map y, position p ₀ The method comprises the following steps:

wherein x represents an input feature map, p _n Representing a collectionThe positions listed in w _r And w _d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively _n Represents the offset and is typically a fraction;

bilinear interpolation is used to calculate x (p ₀ +p _n +Δp _n ) The method comprises the following steps:

x(p)＝∑ _q G(p.q)x(q)

wherein q enumerates the integer spatial positions of the feature map x, p is the fractional position, G (x) represents the kernel of bilinear interpolation;

step 1.2: EB recording _i (i=1, 2, 3) represents a feature code block with index i, DB _j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I ₁ ,I ₂ ,…,I _L ' representing a forest monitoring video comprising L frames of images, the sequence of images being transformed into a three-dimensional tensor F by channel stacking ₀ And input into a feature extraction module to extract non-interactive features of different levels, namely

F _i ＝EB _i (F _i-1 ),i＝1,2,3

V ₂ ＝CNN(F ₃ )

F _j ＝DB _j (F _j-1 ),j＝4,5,6

Wherein F is _i (i=1, 2, …, 6) represents the output feature map of the feature encoding block and the feature decoding block with index i, i.e. no interaction features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V ₂ Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding blocks EB and feature decoding blocks DB, F ₆ And V ₂ Representing the non-interactive features learned from the feature extraction module.

In the model training in the step 4, the total loss functionExpressed as:

wherein,and->Respectively representing pixel cross entropy loss, contrast loss and classification loss, lambda ₁ ，λ ₂ And lambda (lambda) ₃ Respectively representing weight coefficients of the corresponding losses;

pixel cross entropy lossThe calculation method of (1) is that

Wherein (1)>Pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;

countering lossesThe method aims to make up correlation and distribution information of pixels in a tag image ignored by the prior method, increase the correlation information among the pixels, capture the distribution of the whole image, punish mismatching in tag statistics, and comprises the following calculation method of

Wherein,refers to an empirical estimation of the probability expectation, x ₁ And x ₂ Respectively represent model D ₁ And D ₂ Input true smoke density image and true background image, P _data Representing probability distribution of data compliance, DDis and BDis respectively represent discriminator models in generating countermeasure, respectively denoted as D ₁ And D ₂ Log (x) represents a logarithmic function;

classification lossThe calculation method of (1) is that

Wherein c andrepresenting the true and predicted categories of the image sequence, respectively.

Advantageous effects

The invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which comprises the following steps: (1) Inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers; (2) Inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features; (3) Inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features; (4) Inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result. The invention can further reduce the missing report rate and the false report rate of forest smoke and fire identification.

The beneficial effects of the invention are as follows: (1) In order to make the feature representation have robustness to scale, viewpoint, deformation and the like, the invention discloses an enhanced deformable convolution module, and feature complementation is realized by breaking the fixed geometric structure of the convolution module and considering the most representative features and local weak features; (2) In order to avoid that the low-level features for describing local details are completely inhibited by the high-response high-level features in the process of feature fusion and interaction, a multi-directional feature interaction module is invented to acquire complementary interaction information of different levels of features; (3) In order to utilize the correlation and distribution information of pixels in the label image, we invented a discrimination loss term based on generative countermeasure learning to measure the distribution similarity between the network predicted image and the real image, and eliminate the distribution inconsistency. The invention further improves the detection rate and reduces the false alarm rate by combining the advantages of the three points.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is an example of a predicted smoke density image of the present invention FIG. 1

Fig. 3 shows the overall network architecture of the proposed model, comprising four parts in total, a Feature Extraction Module (FEM), a low-level feature interaction module (LFDM), a high-level feature interaction module (HFDM) and a Feature Fusion Module (FFM). The feature extraction module is obtained by connecting a plurality of feature Encoding Blocks (EB) and a plurality of feature Decoding Blocks (DB) in series. The low-level feature interaction module includes a plurality of low-level interaction encoding blocks (LIE) and low-level interaction decoding blocks (LID). The high-level feature interaction module includes a plurality of high-level interaction encoding blocks (HIEs) and high-level interaction decoding blocks (HIDs). The feature fusion module comprises multiple levels of feature fusion.

Fig. 4 shows a specific network structure of the feature encoding block EB (fig. 4 a) and the feature decoding block DB (fig. 4 b).

Fig. 5 shows a specific structural design of the low-level interactive coding block LIE and the low-level interactive decoding block LID. Specifically, a channel enhancement design using three-dimensional convolution is added after the input layer, the convolution kernel size is designed to be 1×1×3, and an Enhanced Deformable Convolution (EDC) is also added in the last step of the low-level feature interaction decoding module.

Fig. 6 shows a specific network structure that enhances the deformable convolution EDC.

Fig. 7 shows a specific network structure of a high-level interactive encoding block HIE (fig. 7 a) and a high-level interactive decoding block HID (fig. 7 b), to which an Enhanced Deformable Convolution (EDC) is further added in the last step of the high-level feature interactive decoding module.

Detailed Description

The invention will now be further described with reference to examples, figures:

the invention provides a forest smoke and fire identification method based on enhanced deformable convolution and label correlation, which is shown in a flow chart in fig. 1 and specifically comprises the following steps:

step 1: inputting the forest monitoring video into a feature extraction module to obtain non-interactive features of different layers;

step 2: inputting the non-interactive features of different levels into a low-level feature interaction module to obtain low-level interactive features;

step 3: inputting non-interactive features of different levels into a high-level feature interaction module to obtain high-level interactive features;

step 4: inputting the non-interactive features, the low-level interactive features and the high-level interactive features into a feature fusion module and classifying to obtain a forest smoke and fire recognition result.

step 1.1: the addition of parallel paths to take into account internal information to achieve enhanced variability convolution prevents locally weak smoke features from being ignored or suppressed, thereby extracting complementary smoke features, for the output feature map y, at position p ₀ There is at

Wherein x represents an input feature map, p _n Representing a collectionThe positions listed in (a) are exemplified by a 3X 3 local area,/->Is defined as { (-1, -1), (-1, 0), …, (1, 1) }, w for a 3 x 3 convolution kernel _r And w _d Representing the weight parameters, Δp, to be learned in the standard convolution and the deformable convolution, respectively _n Representing the offset, and typically the fraction, bilinear interpolation is used to calculate x (p ₀ +p _n +Δp _n ) I.e.

x(p)＝∑ _q G(p.q)x(q)

step 1.2: in a feature coding module, carrying out average pooling and maximum pooling in parallel to extract complementary information, adding three-dimensional convolution before the maximum pooling operation, coding time-space information, respectively designing enhanced deformable convolution at the beginning and the end for extracting features with discrimination on the dimension, the viewpoint, the deformation and the like of smoke, in a feature decoding module, carrying out aggregation on the extracted different level features before the upsampling operation, and using the enhanced deformable convolution to adapt to various geometric transformations to promote the extraction of essential features of the smoke;

step 1.3: EB recording _i (i=1, 2, 3) denotes a feature encoding module with index i, DB _j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I ₁ ,I ₂ ,…,I _L ' representing a forest monitoring video comprising L frames of images, the sequence of images being transformed into a three-dimensional tensor F by channel stacking ₀ And input into a feature extraction module to extract non-interactive features of different levels, namely

F _i ＝EB _i (F _i-1 ),i＝1,2,3

V ₂ ＝CNN(F ₃ )

F _j ＝DB _j (F _j-1 ),j＝4,5,6

Wherein F is _i (i=1, 2, …, 6) represents the output feature map of the feature encoding module and the feature decoding module with index i, i.e. no interactive features of different layers, CNN represents a convolutional neural network with shared output feature dimension 8192, V ₂ Representing the feature vector output by a feature extraction module, wherein the feature extraction module consists of a plurality of feature coding modules and feature decoding modules, F ₆ And V ₂ Representing the non-interactive features learned from the feature extraction module.

The step 2 of inputting the non-interactive features of different levels into the low-level feature interaction module to obtain the low-level interactive features comprises the following steps:

step 2.1: the low-level feature interaction module consists of a plurality of low-level interaction encoders and low-level interaction decoders, the features are interacted from a deep layer to a shallow layer of a network, deep high-response features are gradually transferred downwards, and high-level semantic information of local details (texture and space information) is highlighted;

the step 2.2: non-interactive features F of different layers _i (i=1, 2, …, 6) input low-level feature interaction module, i.e.

L ₁ ＝Downconv2d(F ₆ )

L _i+1 ＝LIE _i (L _i ,F _6-i ),i＝1,2

V ₁ ＝CNN(L ₃ )

L _j+1 ＝LID _j (L _j ,F _7-i ),j＝3,4,5

Wherein LIE is _i (i=1, 2) denotes a low-level interactive encoder, LID, with index i _j (j=4, 5, 6) denotes a low-level interactive decoder with index j, L _i (i=1, 2, …, 6) represents an output feature map of a low-level interactive encoder or a low-level interactive decoder with index i, L ₆ And V ₁ Representing low-level interaction features learned from the low-level feature interaction module.

The step 3 of inputting the non-interactive features of different levels into the high-level feature interaction module to obtain the high-level interactive features comprises the following steps:

step 3.1: the high-level feature interaction module consists of a plurality of high-level interaction encoders and high-level interaction decoders, wherein multi-layer feature fusion interaction is from shallow to deep, shallow low-response features are gradually accumulated, the deep high-response features are prevented from being completely inhibited, low-level information of local details is effectively reserved, channel enhancement design and enhanced deformable convolution are adopted, and a plurality of skip connections are added into aggregation characteristics to accelerate model convergence;

step 3.2: non-interactive features F of different layers _i (i=1, 2, …, 6) input low-level feature interaction module, i.e.

H ₁ ＝Conv2d(F ₁ )

H _i+1 ＝HIE _i (H _i ,F _i+1 ),i＝1,2

V ₃ ＝CNN(H ₃ )

H _j+1 ＝HID _j (L _j ,F _J+1 ),j＝3,4,5

Wherein HIE is _i (i=1, 2) denotes a high-level interactive encoder with index i, HID _j (j=3, 4, 5) denotes a high-level interactive encoder with index j, H _i (i=1, 2, …, 6) represents an output feature map of a high-level interactive encoder or a high-level interactive decoder with index i, H ₆ And V ₃ Representing the high-level interaction features learned from the high-level feature interaction module.

The step 4 of inputting the non-interactive features, the low-level interactive features and the high-level interactive features into the feature fusion module and classifying to obtain the forest smoke and fire recognition result comprises the following steps:

step 4.1: using the extracted features (L ₆ ,F ₆ ,H ₆ ) Predicting a sequence of background images and a sequence of smoke density images corresponding to a sequence of input images, i.e

step 4.2: will beInput into C3D model to extract space-time characteristic vector V ₄ I.e.

step 4.3: since the bottom layer features may bring noise, the upper layer features may suppress the bottom layer beneficial information, then feature vectors (V ₁ ,V ₂ ,V ₃ ,V ₄ ) Sorting, i.e.

Wherein concat represents the series operation of the feature map,a prediction result representing the smoke classification;

step 4.4: in model training, the total loss functionCan be expressed as

pixel cross entropy lossThe calculation method of (1) is that

Wherein,pixel values at positions (i, j) of a kth frame image representing an input image sequence X, L, W and H representing the number of frames, frame width and frame height, respectively, and B and D representing a true background image sequence and a smoke density image sequence, respectively;

classification lossThe calculation method of (1) is that

Fig. 2 shows an example of prediction of smoke density images using the present invention, the first line of images representing a sequence of four smoke images and the second line of images representing a corresponding predicted smoke density image, in which brighter and whiter pixels have higher smoke density in the smoke images.

Fig. 3 shows the overall network architecture of the proposed model, comprising four parts in total, a Feature Extraction Module (FEM), a low-level feature interaction module (LFDM), a high-level feature interaction module (HFDM) and a Feature Fusion Module (FFM). Wherein the feature extraction module comprises a plurality of Encoding Blocks (EB) and Decoding Blocks (DB). The low-level feature interaction module includes a plurality of low-Level Interaction Encoders (LIEs) and low-Level Interaction Decoders (LIDs). The high-level feature interaction module includes a plurality of high-level interaction encoders (HIEs) and high-level interaction decoders (HIDs). The feature fusion module comprises multi-stage feature fusion.

Fig. 4 shows a specific network structure of EB and DB.

Fig. 5 shows a specific structural design of LIE and LID. Specifically, a channel enhancement design using three-dimensional convolution is added after the input layer, the convolution kernel size is designed to be 1×1×3, and an Enhanced Deformable Convolution (EDC) is also added in the last step of the low-level feature interaction decoding module.

Fig. 6 shows a specific network structure of EDC.

Fig. 7 shows a specific network structure of the HIE and HID.

Claims

1. A forest smoke and fire recognition method based on enhanced deformable convolution and label correlation is characterized by comprising the following steps:

F _i ＝EB _i (F _i-1 ),i＝1,2,3

V ₂ ＝CNN(F ₃ )

F _j ＝DB _j (F _j-1 ),j＝4,5,6

L ₁ ＝Downconv2d(F ₆ )

L _i+1 ＝LIE _i (L _i ,F _6-i ),i＝1,2

V ₁ ＝CNN(L ₃ )

L _j+1 ＝LID _j (L _j ,F _7-i ),j＝3,4,5

wherein LIE is _i (i=1, 2) denotes a low-level interactive coded block with index i, LID _j (j=4, 5, 6) represents a low-level interactive decoding block with index j, L _i (i=1, 2, …, 6) represents a low-level interactive coded block or a low-level interactive decoded block with index iOutput feature map L ₆ And V ₁ Representing low-level interaction features learned from the low-level feature interaction module;

H ₁ ＝Conv2d(F ₁ )

H _i+1 ＝HIE _i (H _i ,F _i+1 ),i＝1,2

V ₃ ＝CNN(H ₃ )

H _j+1 ＝HID _j (L _j ,F _J+1 ),j＝3,4,5

2. The enhanced deformable convolution and tag correlation based forest fire identification method according to claim 1, wherein: the step 1 of inputting the forest monitoring video into the feature extraction module to obtain the non-interactive features with different levels comprises the following steps:

x(p)＝∑ _q G(p.q)x(q)

step 1.2: EB recording _i (i=1, 2, 3) represents a feature code block with index i, DB _j (j=4, 5, 6) denotes a feature decoding block with index j, note i= { I ₁ ,I ₂ ,…,I _L ' represents a forest containing L frames of imagesMonitoring video, converting the image sequence into a three-dimensional tensor F by channel stacking ₀ And input into a feature extraction module to extract non-interactive features of different levels, namely

F _i ＝EB _i (F _i-1 ),i＝1,2,3

V ₂ ＝CNN(F ₃ )

F _j ＝DB _j (F _j-1 ),j＝4,5,6

3. The enhanced deformable convolution and tag correlation based forest fire identification method according to claim 1, wherein: in the model training in the step 4, the total loss functionExpressed as:

pixel cross entropy lossThe calculation method of (1) is that

classification lossThe calculation method of (1) is that