CN114998801A

CN114998801A - Forest fire smoke video detection method based on contrast self-supervision learning network

Info

Publication number: CN114998801A
Application number: CN202210645586.1A
Authority: CN
Inventors: 张军国; 李婷婷; 胡春鹤; 田野; 张长春
Original assignee: Beijing Forestry University
Current assignee: Beijing Forestry University
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-02

Abstract

The invention discloses a forest fire smoke video detection method based on a contrast self-supervision learning network, which comprises the following steps: acquiring a forest fire smoke video, and establishing a smoke video data set according to the forest fire smoke video; performing feature learning on the smoke video data set by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames; and detecting a moving object according to the attention feature map of the continuous video frames, and acquiring a detection result. The method is based on a contrast self-supervision learning method, an unsupervised self-distillation network with a cross double-channel network as a backbone network is built, background smoke video features of the complex environment are extracted, semantic information is learned, attention feature maps of continuous video frames are obtained, moving targets of an attention area are determined, high-precision recognition and positioning of forest fire smoke videos are achieved, smoke detection efficiency and performance are improved, and the method is suitable for detecting early-stage forest fire small-target smoke videos with complex background environments and long distances.

Description

Forest fire smoke video detection method based on comparison self-supervision learning network

Technical Field

The invention relates to the technical field of forest fire smoke detection, in particular to a forest fire smoke video detection method based on a contrast self-supervision learning network.

Background

The forest fire disaster is a natural disaster which has strong burst property and large destructiveness and is difficult to deal with and rescue, and the timely detection of the smoke of the forest fire disaster plays an important role in rescue and reduction of loss caused by the fire disaster.

In the prior art, a convolutional neural network is widely applied to a forest fire smoke video detection task, and although the convolutional neural network can acquire a local smoke characteristic representation with discriminativity, the convolutional neural network cannot acquire global information of a smoke image. Due to the self-attention mechanism and the existence of the multi-layer perceptron structure, the visual Transformer can represent global information such as complex spatial transformation and long-distance feature dependence, but also due to the two structures, the visual Transformer usually ignores local features of an image, so that the difference between the background and the foreground of the image is not obvious.

Aiming at the problem that the data volume of small target smoke in early forest fire is limited, a learner provides a few-sample learning method, but many methods do not consider the interference of cloud, fog, haze and other objects similar to the characteristics of smoke color, outline, texture and the like, so that the false alarm rate of the method is slightly higher than that of supervised learning. In fact, the static characteristics of objects such as clouds, fog and haze have high similarity with the characteristics of forest fire smoke, and human eyes are difficult to directly and accurately distinguish smoke, especially small target smoke from a small number of marked images, but the human eyes can easily and accurately identify and position the small target smoke from video data, so that similar objects such as the smoke, the clouds, the fog and the haze can be accurately distinguished. Therefore, the introduced smoke dynamic characteristics have important significance for early forest fire smoke less sample detection.

The existing forest fire smoke video detection method is mostly based on a large amount of smoke data with labels and is not suitable for the detection task of limited data volume of small target smoke of early forest fire. In addition, when the existing method processes smoke video data, information redundancy among a plurality of continuous frames of the video is hardly considered, so that the calculation complexity is too high, and the detection efficiency and performance of smoke are affected.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a forest fire smoke video detection method based on a contrast self-supervision learning network, which comprises the following steps:

acquiring a forest fire smoke video, and establishing a smoke video data set according to the forest fire smoke video;

performing feature learning on the smoke video data set by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames;

and detecting a moving object according to the attention feature map of the continuous video frames, and acquiring a detection result.

Further, before the forest fire smoke video is obtained, a contrast self-supervision learning network is constructed in advance, and the contrast self-supervision learning network comprises a data input module, an unsupervised self-distillation network, a feature extraction backbone network, a motion area extraction module and an output module;

the feature extraction backbone network is built in the unsupervised self-distillation learning network.

Further, the feature extraction backbone network is a cross two-channel network based on a convolutional neural network and a visual Transformer.

Further, the feature learning is performed on the smoke video data set by using a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames, which specifically includes:

inputting the smoke video data set into the data input module for data enhancement to generate video positive and negative sample images, wherein the video positive and negative sample images comprise global views and local views of the video positive and negative samples;

extracting positive and negative sample image features of the video by adopting the feature extraction backbone network;

and performing student network and teacher network comparison learning on the positive and negative sample image characteristics of the video by adopting the unsupervised self-distillation network, and generating semantic information and pseudo labels on the positive and negative sample images of the video so as to form an attention characteristic diagram of continuous video frames.

Further, still include: and optimizing parameters of the student network by adopting a random gradient descent method, and optimizing and correcting the parameters of the teacher network by adopting a middle moment central point operation.

Further, the detecting a moving object according to the attention feature map of the continuous video frames and obtaining a detection result specifically include:

inputting the attention feature map of the continuous video frame into a motion region extraction module;

acquiring continuous three frames of video data in the attention feature map of the continuous video frames by adopting the motion region extraction module, and respectively calculating the frame difference result of the continuous two frames;

the motion region extraction module performs AND operation on the frame difference result of the two continuous frames, and labels and segments the motion region of the attention feature map of the continuous video frames according to the AND operation result to generate a forest fire smoke detection result and a forest fire smoke positioning result of the forest fire smoke video;

and outputting the forest fire smoke detection result and the forest fire smoke positioning result by adopting an output module.

Further, after the detection result is obtained, a result evaluation step is further included, specifically:

evaluating a forest fire smoke detection result of the forest fire smoke video by adopting an average accuracy rate, an average true-positive rate and an average true-negative rate;

evaluating a forest fire smoke positioning result of the single-frame forest fire smoke video by adopting an intersection ratio, and evaluating a forest fire smoke positioning result of the continuous-frame forest fire smoke video by adopting an average intersection ratio.

Further, the method further comprises training and optimizing the contrast self-supervision learning network, and specifically comprises the following steps:

training the feature extraction backbone network on a large labeled ImageNet-1k data set by adopting an AdamW optimizer;

the unsupervised self-distillation network was trained on unlabeled ImageNet-1k dataset using an AdamW optimizer.

The invention has the beneficial effects that: the method is based on a contrast self-supervision learning method, an unsupervised self-distillation network is built, detection of the forest fire smoke video is achieved, the unsupervised self-distillation network takes a cross double-channel network as a backbone network, local features and global features of the smoke video under a complex environment background are extracted, semantic information and pseudo labels of continuous video frames are learned, attention feature maps of the continuous video frames are obtained, meanwhile moving targets of attention regions in the attention feature maps of the continuous video frames are determined, high-precision recognition and positioning of the forest fire smoke video are achieved, smoke detection efficiency and performance are improved, and the method is suitable for detecting early-stage forest fire small-target smoke videos with complex background environments and long distances.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

Fig. 1 is a flowchart of a forest fire smoke video detection method based on a contrast self-supervised learning network according to an embodiment of the present invention;

fig. 2 is a detection result of the forest fire smoke video detection method based on the contrast self-supervised learning network according to the embodiment of the present invention on a public forest fire smoke video data set;

fig. 3 is a detection result of the forest fire smoke video detection method based on the contrast self-supervised learning network according to the embodiment of the present invention on the established forest fire smoke video data set.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only used as examples, and the protection scope of the present invention is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the present invention belongs.

As shown in fig. 1, the method for detecting forest fire smoke based on the contrast self-supervised learning network includes the following steps:

s1: acquiring a forest fire smoke video, and establishing a smoke video data set according to the forest fire smoke video;

specifically, a forest fire smoke video is obtained, the forest fire smoke video is from a lookout tower visible light camera, the shooting distance is long, the shooting time is the initial stage of fire occurrence, and therefore the shot smoke target is small, and the forest fire smoke video has the advantages of being uneven in illumination, complex in background information, multiple in interferent and the like. In order to introduce the dynamic characteristics of the forest fire smoke video, a smoke video data set is constructed on the basis of a static forest fire smoke less sample image data set according to the forest fire smoke video, so that the performance of a smoke detection model is improved by adding time sequence information. Further, the smoke video data set comprises 87 visible light forest fire small target smoke videos and 53 non-smoke videos, and because the forest fire smoke video sequence contains context information, the smoke size in continuous frames can change continuously with time, so that the small target smoke defined in the embodiment is smoke with the minimum detectable size of 20 × 20 pixels in the video sequence.

S2: performing feature learning on the smoke video data set by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames;

specifically, before a forest fire smoke video is obtained, a contrast self-supervision learning network is constructed in advance, and a network model of the contrast self-supervision learning network comprises a data input module, an unsupervised self-distillation network, a feature extraction backbone network, a motion region extraction module and an output module. The data input module is used for inputting video data and performing data enhancement on the video data to obtain a positive and negative sample global view and a local view of a video; the feature extraction backbone network is built in the unsupervised self-distillation learning network and is used for extracting features of a positive and negative sample global view and a local view; the unsupervised self-distillation network is used for carrying out comparison learning on the characteristics of the positive and negative sample global view and the local view through a student network and a teacher network so as to obtain an attention characteristic diagram of continuous video frames; the motion region extraction module is used for detecting a moving object according to the attention feature map of the continuous video frames; an output module: and is used for outputting the detection result of the moving object.

Further, the feature extraction backbone network is a cross dual-channel network based on a convolutional neural network and a visual transform, and specifically comprises an initial network, a convolutional neural network, a visual transform, a feature connection unit and a classifier, wherein the initial network comprises a 7 × 7 convolution with a step size of 2, and the subsequent network is connected with a 3 × 3 maximum pooling with a step size of 2 to extract initial local features such as contours and textures. The feature connection unit is used for fusing the intermediate features of the convolutional neural network and the visual transform, and the two branch networks of the convolutional neural network and the visual transform adopt the same initialization network, so that the feature connection unit is applied from the first stage of the cross dual-channel network, and the features of the convolutional neural network and the visual transform are respectively aggregated and sent to different classifiers through feature fusion of four stages.

After a contrast self-supervision learning network is constructed in advance, forest fire smoke video detection is started, and feature learning is carried out on the constructed smoke video data set by using the constructed contrast self-supervision learning network. And inputting the smoke video data set into a data input module, performing data enhancement, and generating positive and negative video sample images, wherein the positive and negative video sample images comprise global views and local views of the positive and negative video samples. The feature extraction backbone network extracts the features of the positive and negative sample images of the video, the unsupervised self-distillation network takes comparison self-supervision learning as a main learning mode, student network and teacher network comparison learning is carried out on the positive and negative sample image features of the video, semantic information and pseudo labels are generated on the positive and negative sample images of the video, knowledge distillation is introduced to reduce unsupervised self-distillation network parameters, and the processing speed is improved.

Setting student network parameter to theta _s Teacher's network parameter is θ _t Giving an input sample image x, respectively applying a softmax function to a student network g theta _s And teacher network g θ _t The output of the image is normalized to obtain the probability density P of the student network of the sample image _s And teacher network probability density P _t The solving formula of the probability density is expressed as:

where K represents a weight normalized dimension number of K dimensions, i represents the ith sample image, and τ _s Is a temperature parameter, and _s is greater than 0 and is used for controlling the sharpening degree of output distribution, P _t Is calculated by the same method as the formula (1) and has tau _t Temperature parameter > 0. When a fixed teacher network is given, the student network parameters can be learned by minimizing the cross-entropy loss, and the calculation formula of the minimized cross-entropy loss is as follows:

in the formula, H (P) _t (x),P _s (x))＝-P _t (x)logP _s (x)。

In the embodiment, a sample set V with invariance of a sample x is generated through strategies such as clipping, and the like, and the sample set comprises two global views

And some local views with lower resolution, wherein the local view features are used for comparison learning through a student network, and the global view features are used for comparison learning through a teacher network. The minimum loss value of the student network and the teacher network can be obtained by applying the formula (2) as follows:

in the formula (I), the compound is shown in the specification,

represents a sample image of a global view and,

x is the sample image in the global view, and x' is the other sample image except x in the sample set V.

In the embodiment, the teacher network and the student network adopt the same network framework g, but the parameters of the teacher network and the parameters of the student network are different, and the student network adopts a random gradient descent method to optimize the parameters, namely, a formula (3) is adopted to optimize the parameters of the student network. Unlike conventional knowledge distillation, the unsupervised self-distillation network does not preset the teacher network parameter θ _t An Exponential Moving Average (EMA), i.e. a momentum encoder, is used directly on the student network parameters. Teacher network parameter theta in the embodiment _t The updating method of (2) is expressed as:

θ _t ←λθ _t +(1-λ)θ _s (4)

where λ represents the coefficient, λ ∈ [0.996,1 ]. The original momentum encoders are typically used as queues in contrast learning, and in the unsupervised self-distilled network of this embodiment, are used primarily for teacher parameter averaging during the training process.

Further, in the network contrast learning process, the local view input into the student network and the global view input into the teacher network may be positive samples or negative samples, if the student network and the teacher network both input positive samples or both input negative samples, the similarity between the student network and the teacher network is learned, if the student network inputs positive samples and the teacher network inputs negative samples, the difference between the student network and the teacher network is learned, and similarly, if the student network inputs negative samples and the teacher network inputs positive samples, the difference between the student network and the teacher network is also learned.

Further, when the unsupervised self-distillation network carries out teacher network comparison learning on the image characteristics of the positive and negative sample global views, the teacher network learning parameters are optimized and corrected by adopting the middle moment central point operation and the sharpening operation, so that the model is prevented from collapsing. The middle moment central point can prevent a certain dimension from dominating, sharpening has an opposite effect, the two operations complement each other, and collapse resolution is avoided when a momentum encoder is adopted for updating teacher network parameters. The mid-moment center-point operation can be expressed as:

gθ _t (x)←gθ _t (x)+c (5)

wherein c represents a center point, the center point c is updated by an exponential moving average, and the updating process is represented as follows:

where m is the rate parameter and m > 0, B represents the batch size. Sharpening is then achieved by using a lower temperature τ in the teacher network softmax normalization _t To obtain the product.

The unsupervised self-distillation network which takes the crossed dual-channel network as the backbone network can generate semantic information and pseudo labels on positive and negative sample images through a contrast learning method, so that an attention feature map is formed, although the unsupervised self-distillation network does not output label prediction results, the area of an object of interest can be displayed through the attention feature map, and high accuracy performance of subsequent video detection is realized.

Preferably, the feature extraction backbone network is trained on a large labeled ImageNet-1k dataset by adopting an AdamW optimizer, and meanwhile, in order to ensure the stability of the visual Transformer, data enhancement and regularization technologies such as Mixup, CutMix, Erasing, RandAugment and Stochastic Depth are adopted. 300 epochs were trained using an AdamW optimizer with a batch size of 1024, weight decay of 0.05, initial learning rate of 0.001, and input local view size in visual Transformer of 14 × 14. Training an unsupervised self-distillation network on an unlabeled ImageNet-1k dataset by adopting an AdamW optimizer, wherein the batch size is 1024, the training process is divided into a preheating training part and a formal training part, the learning rate rises to an initial value in the first 10 epochs, the linear scaling rule is that lr is 0.0005 1024/256, and attenuation weights [0.04, 0.4 ] are adopted after preheating]The cosine learning rate attenuation strategy. Temperature τ of student network _s 0.1, while the student network temperature is linearly preheated τ in the first 30 epochs _s ∈[0.04，0.07]. Meanwhile, in order to ensure the stability of the unsupervised self-distillation network, technologies such as color dithering, Gaussian blur and overexposure are adopted to enhance data.

S3: detecting a moving object according to the attention feature map of the continuous video frames, and acquiring a detection result;

specifically, feature learning is carried out on the smoke video data set through an unsupervised self-distillation network to form an attention feature map of continuous video frames, the attention feature map of each frame is obtained to determine an attention area, and the motion extraction module only carries out moving object detection on the attention area to reduce information redundancy and improve video processing efficiency. Inputting the attention feature map of the continuous video frames into a motion extraction module, and acquiring continuous three-frame video data F in the attention feature map of the continuous video frames by the motion extraction module _i-1 (x,y)、F _i (x, y) and F _i+1 (x, y) calculating the frame difference D of two continuous frames _i-1,i (x, y) and D _i,i+1 (x, y), the frame difference result is expressed as:

in the formula T _min Minimum threshold, T, representing frame difference _max Representing a maximum threshold for frame differences. In order to alleviate the ghost problem of the moving object in the conventional frame difference method, the frame difference result of the two consecutive frames is subjected to an and operation, which is expressed as:

and if the AND operation output result is 1, labeling and dividing the motion area of the attention feature map of the continuous video frame, and if the AND operation output result is 0, not labeling and dividing the motion area of the attention feature map of the continuous video frame, and generating a forest fire smoke detection result and a forest fire smoke positioning result of the forest fire smoke video according to the labeling and dividing of the motion area.

Further, after the detection result is obtained, a result evaluation step is also included. Evaluating a forest fire smoke detection result of the forest fire smoke video by adopting Average accuracy (mAP), Average True Positive Rate (ATPR) and Average True Negative Rate (ATNR), evaluating a forest fire smoke positioning result of a single-frame forest fire smoke video by adopting an Intersection Over Unit (IoU), and evaluating a forest fire smoke positioning result of a continuous frame forest fire smoke video by adopting an Average Intersection Over Unit (IomU).

Wherein, the calculation formula of the average accuracy rate is as follows:

the average true positive rate is calculated by the formula:

the average true negative rate is calculated by the formula:

the calculation formula of the intersection ratio is as follows:

the average intersection ratio is calculated by the formula:

wherein TP (true Positive) represents the number of correctly identified video frames in the smoke video sequence, FP (false positive) represents the number of incorrectly identified video frames in the smoke video sequence, TN (true negative) represents the number of correctly identified video frames in the non-smoke video sequence, FN (false negative) represents the number of incorrectly identified video frames in the non-smoke video sequence, and M ₁ Representing the number of video sequences, AO representing the intersection of the predicted smoke localization area and the real localization area of the single frame image, AU representing the union of the predicted smoke localization area and the real localization area of the single frame image, M ₂ Representing the number of frames in a smoke video sequence, M ₃ Indicates the number of classes, AP _i An Average of (AR) values representing the accuracy for each recall (RR).

Further, the video detection method for forest fire smoke based on the contrast self-supervision learning network is compared with the video detection method for 3D-PFCN smoke and the video detection method for 3D-VSSNet smoke to evaluate the effectiveness on the video data set for the public forest fire smoke. The smoke video data set is divided into three types of green background, gray background and complex background according to different forest fire smoke video background information, in order to guarantee fairness of comparison experiments, comparison is conducted on three evaluation indexes, namely average true positive rate, average true negative rate and average cross-over ratio, and the comparison results are shown in table 1, and CSLN is used in table 1 to represent the forest fire smoke video detection method based on the comparison self-supervision learning network.

Table 1 comparison of different smoke video detection methods on public forest fire smoke data sets

As shown in table 1, the forest fire smoke video detection method (CSLN) based on the contrast self-supervised learning network exhibits optimal performance in the green forest background, the gray forest, the blue sky background and the complex weather background. Specifically, in a green forest background, the three methods obtain the same 100% average true positive rate ATPR, although the average true negative rate ATNR (99.79%) and average cross-over ratio mlou (86.46%) of the detection method of the present embodiment are slightly better than those of the 3D-PFCN method (98.54%) and average cross-over ratio mlou (78.48%) and those of the 3D-VSSNet method (99.62%) and average cross-over ratio mlou (86.32%), no obvious advantage is shown, which indicates that the three methods are all suitable for forest fire smoke detection in a simple scene of a green forest background. Under the gray forest and the blue sky background, the average true positive rate ATPR and the average cross-over ratio mIoU of the detection method are respectively improved by 1.06% and 1.99% compared with the 3D-VSSNet method, and are respectively improved by 3.19% and 6.51% compared with the 3D-PFCN method, which shows that the forest fire smoke video detection method based on the contrast self-supervision learning network has stronger competitiveness when detecting smoke video data with the blue sky and white cloud background. Under a complex background, the detection method of the embodiment obviously improves three evaluation indexes compared with the 3D-PFCN method and the 3D-VSSNet method, wherein the average true positive rate ATPR, the average true negative rate TANR and the average cross-over are respectively improved by 5.17 percent, 2.95 percent and 5.85 percent compared with the 3D-VSSNet method and are respectively improved by 12.22 percent, 5.1 percent and 9.91 percent compared with the 3D-PFCN method, which indicates that the influence of weather interference factors such as fog, haze and the like on the detection method of the embodiment is the minimum, therefore, the forest fire smoke video detection method based on the contrast self-supervision learning network is more suitable for detecting forest fire smoke video targets in complex scenes than the 3D-VSSNet smoke video detection method and the 3D-PFCN smoke video detection method, and validity of forest fire smoke detection in complex backgrounds including cloud, fog, haze and the like of the contrast self-supervision learning network is verified.

In order to verify the effectiveness of the forest fire smoke video detection method based on the contrast self-supervision learning network in detecting forest fire smoke in a complex environment, detection is carried out on a public forest fire smoke video data set, and the detection result is shown in fig. 2, wherein a DION attention map is a continuous video frame attention feature map output by an unsupervised self-distillation network, a DION attention area map is the position of an original image corresponding to an attention area of the DION attention map, and a motion area extraction map is a result map obtained by directly carrying out continuous frame motion area detection on the attention area. As shown in fig. 2(a) and 2(d), the interference of moving objects such as tree shaking, unmanned aerial vehicle flying, camera shaking and the like can be eliminated during detection, and accurate positioning of long-distance smoke is realized; as shown in fig. 2(b) and 2(e), the complex weather backgrounds including fog, haze and the like can be detected, so that the interference of weather factors can be eliminated during detection, and long-distance smoke can be accurately detected on a low-resolution video frame; as shown in fig. 2(c), small target smoke can be accurately identified and located in a complex background during detection. In conclusion, the video detection method for forest fire smoke based on the contrast self-supervision learning network has higher effectiveness and stability when the video detection method is used for detecting the small target smoke of the long-distance forest fire under the complex background condition.

In order to further verify the effectiveness of the video detection method for forest fire smoke based on the contrast self-supervision learning network in detecting the remote forest fire smoke, detection is performed on the video data set for forest fire smoke established in step S1, the detection result is shown in fig. 3, the video sequence includes weather factors such as blue sky, cloud, fog and over-strong illumination, wherein fig. 3(a), 3(c), 3(f) and 3(h) represent background of blue sky, fig. 3(b) represents over-strong illumination background, fig. 3(d) and 3(g) represent background of cloud, fig. 3(i) represents background of fog, and fig. 3(e) and 3(j) include other non-smoke moving objects. According to the detection method, under the interference of the five weather factors, the smoke can be accurately detected, meanwhile, the remote smoke video target is accurately positioned, and the detection performance is good. In conclusion, the result shows that the forest fire smoke video detection method based on the contrast self-supervision learning network is not only suitable for the small target smoke video data detection scene of the forest fire under the complex background, but also suitable for other open video target detection scenes, and the stability and the generalization capability of the detection method are verified.

The method is based on a contrast self-supervision learning method, an unsupervised self-distillation network is built, detection of the forest fire smoke video is achieved, the unsupervised self-distillation network takes a cross double-channel network as a backbone network, local features and global features of the smoke video under a complex environment background are extracted, semantic information and pseudo labels of continuous video frames are learned, attention feature maps of the continuous video frames are obtained, moving targets of attention areas in the attention feature maps of the continuous video frames are determined, high-precision recognition and positioning of the forest fire smoke video are achieved, smoke detection efficiency and performance are improved, and the method is suitable for detecting early-stage forest fire small-target smoke videos with complex background environments and long distances.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being covered by the appended claims and their equivalents.

Claims

1. A forest fire smoke video detection method based on a contrast self-supervision learning network is characterized by comprising the following steps:

2. The forest fire smoke video detection method based on the contrast self-supervision learning network as claimed in claim 1, further comprising, before obtaining the forest fire smoke video, pre-constructing a contrast self-supervision learning network, wherein the contrast self-supervision learning network comprises a data input module, an unsupervised self-distillation network, a feature extraction backbone network, a motion region extraction module and an output module;

3. The video detection method for forest fire smoke based on the contrast self-supervision learning network as claimed in claim 2, wherein the feature extraction backbone network is a cross two-channel network based on a convolutional neural network and a visual Transformer.

4. The forest fire smoke video detection method based on the contrast self-supervision learning network as claimed in claim 2, wherein the smoke video data set is subjected to feature learning by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames, specifically:

5. The video forest fire smoke detection method based on the contrast self-supervision learning network as claimed in claim 4, further comprising: and optimizing parameters of the student network by adopting a random gradient descent method, and optimizing and correcting the parameters of the teacher network by adopting a middle moment central point operation.

6. The forest fire smoke video detection method based on the contrast auto-supervised learning network as recited in claim 4, wherein the detection of moving objects is performed according to the attention feature map of the continuous video frames, and a detection result is obtained, specifically:

inputting the attention feature maps of the continuous video frames into a motion region extraction module;

7. A forest fire smoke video detection method based on a contrast self-supervision learning network as claimed in claim 6, wherein after the detection result is obtained, a result evaluation step is further included, specifically:

8. The forest fire smoke video detection method based on the contrast self-supervision learning network as claimed in claim 2, further comprising training and optimizing the contrast self-supervision learning network, specifically comprising:

the unsupervised self-distilled network was trained on an unlabeled ImageNet-1k dataset using an AdamW optimizer.