CN114998801A - Forest fire smoke video detection method based on contrast self-supervision learning network - Google Patents

Forest fire smoke video detection method based on contrast self-supervision learning network Download PDF

Info

Publication number
CN114998801A
CN114998801A CN202210645586.1A CN202210645586A CN114998801A CN 114998801 A CN114998801 A CN 114998801A CN 202210645586 A CN202210645586 A CN 202210645586A CN 114998801 A CN114998801 A CN 114998801A
Authority
CN
China
Prior art keywords
video
forest fire
network
smoke
fire smoke
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210645586.1A
Other languages
Chinese (zh)
Inventor
张军国
李婷婷
胡春鹤
田野
张长春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Forestry University
Original Assignee
Beijing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Forestry University filed Critical Beijing Forestry University
Priority to CN202210645586.1A priority Critical patent/CN114998801A/en
Publication of CN114998801A publication Critical patent/CN114998801A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/254Analysis of motion involving subtraction of images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20224Image subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30204Marker
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/10Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture
    • Y02A40/28Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture specially adapted for farming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a forest fire smoke video detection method based on a contrast self-supervision learning network, which comprises the following steps: acquiring a forest fire smoke video, and establishing a smoke video data set according to the forest fire smoke video; performing feature learning on the smoke video data set by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames; and detecting a moving object according to the attention feature map of the continuous video frames, and acquiring a detection result. The method is based on a contrast self-supervision learning method, an unsupervised self-distillation network with a cross double-channel network as a backbone network is built, background smoke video features of the complex environment are extracted, semantic information is learned, attention feature maps of continuous video frames are obtained, moving targets of an attention area are determined, high-precision recognition and positioning of forest fire smoke videos are achieved, smoke detection efficiency and performance are improved, and the method is suitable for detecting early-stage forest fire small-target smoke videos with complex background environments and long distances.

Description

Forest fire smoke video detection method based on comparison self-supervision learning network
Technical Field
The invention relates to the technical field of forest fire smoke detection, in particular to a forest fire smoke video detection method based on a contrast self-supervision learning network.
Background
The forest fire disaster is a natural disaster which has strong burst property and large destructiveness and is difficult to deal with and rescue, and the timely detection of the smoke of the forest fire disaster plays an important role in rescue and reduction of loss caused by the fire disaster.
In the prior art, a convolutional neural network is widely applied to a forest fire smoke video detection task, and although the convolutional neural network can acquire a local smoke characteristic representation with discriminativity, the convolutional neural network cannot acquire global information of a smoke image. Due to the self-attention mechanism and the existence of the multi-layer perceptron structure, the visual Transformer can represent global information such as complex spatial transformation and long-distance feature dependence, but also due to the two structures, the visual Transformer usually ignores local features of an image, so that the difference between the background and the foreground of the image is not obvious.
Aiming at the problem that the data volume of small target smoke in early forest fire is limited, a learner provides a few-sample learning method, but many methods do not consider the interference of cloud, fog, haze and other objects similar to the characteristics of smoke color, outline, texture and the like, so that the false alarm rate of the method is slightly higher than that of supervised learning. In fact, the static characteristics of objects such as clouds, fog and haze have high similarity with the characteristics of forest fire smoke, and human eyes are difficult to directly and accurately distinguish smoke, especially small target smoke from a small number of marked images, but the human eyes can easily and accurately identify and position the small target smoke from video data, so that similar objects such as the smoke, the clouds, the fog and the haze can be accurately distinguished. Therefore, the introduced smoke dynamic characteristics have important significance for early forest fire smoke less sample detection.
The existing forest fire smoke video detection method is mostly based on a large amount of smoke data with labels and is not suitable for the detection task of limited data volume of small target smoke of early forest fire. In addition, when the existing method processes smoke video data, information redundancy among a plurality of continuous frames of the video is hardly considered, so that the calculation complexity is too high, and the detection efficiency and performance of smoke are affected.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a forest fire smoke video detection method based on a contrast self-supervision learning network, which comprises the following steps:
acquiring a forest fire smoke video, and establishing a smoke video data set according to the forest fire smoke video;
performing feature learning on the smoke video data set by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames;
and detecting a moving object according to the attention feature map of the continuous video frames, and acquiring a detection result.
Further, before the forest fire smoke video is obtained, a contrast self-supervision learning network is constructed in advance, and the contrast self-supervision learning network comprises a data input module, an unsupervised self-distillation network, a feature extraction backbone network, a motion area extraction module and an output module;
the feature extraction backbone network is built in the unsupervised self-distillation learning network.
Further, the feature extraction backbone network is a cross two-channel network based on a convolutional neural network and a visual Transformer.
Further, the feature learning is performed on the smoke video data set by using a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames, which specifically includes:
inputting the smoke video data set into the data input module for data enhancement to generate video positive and negative sample images, wherein the video positive and negative sample images comprise global views and local views of the video positive and negative samples;
extracting positive and negative sample image features of the video by adopting the feature extraction backbone network;
and performing student network and teacher network comparison learning on the positive and negative sample image characteristics of the video by adopting the unsupervised self-distillation network, and generating semantic information and pseudo labels on the positive and negative sample images of the video so as to form an attention characteristic diagram of continuous video frames.
Further, still include: and optimizing parameters of the student network by adopting a random gradient descent method, and optimizing and correcting the parameters of the teacher network by adopting a middle moment central point operation.
Further, the detecting a moving object according to the attention feature map of the continuous video frames and obtaining a detection result specifically include:
inputting the attention feature map of the continuous video frame into a motion region extraction module;
acquiring continuous three frames of video data in the attention feature map of the continuous video frames by adopting the motion region extraction module, and respectively calculating the frame difference result of the continuous two frames;
the motion region extraction module performs AND operation on the frame difference result of the two continuous frames, and labels and segments the motion region of the attention feature map of the continuous video frames according to the AND operation result to generate a forest fire smoke detection result and a forest fire smoke positioning result of the forest fire smoke video;
and outputting the forest fire smoke detection result and the forest fire smoke positioning result by adopting an output module.
Further, after the detection result is obtained, a result evaluation step is further included, specifically:
evaluating a forest fire smoke detection result of the forest fire smoke video by adopting an average accuracy rate, an average true-positive rate and an average true-negative rate;
evaluating a forest fire smoke positioning result of the single-frame forest fire smoke video by adopting an intersection ratio, and evaluating a forest fire smoke positioning result of the continuous-frame forest fire smoke video by adopting an average intersection ratio.
Further, the method further comprises training and optimizing the contrast self-supervision learning network, and specifically comprises the following steps:
training the feature extraction backbone network on a large labeled ImageNet-1k data set by adopting an AdamW optimizer;
the unsupervised self-distillation network was trained on unlabeled ImageNet-1k dataset using an AdamW optimizer.
The invention has the beneficial effects that: the method is based on a contrast self-supervision learning method, an unsupervised self-distillation network is built, detection of the forest fire smoke video is achieved, the unsupervised self-distillation network takes a cross double-channel network as a backbone network, local features and global features of the smoke video under a complex environment background are extracted, semantic information and pseudo labels of continuous video frames are learned, attention feature maps of the continuous video frames are obtained, meanwhile moving targets of attention regions in the attention feature maps of the continuous video frames are determined, high-precision recognition and positioning of the forest fire smoke video are achieved, smoke detection efficiency and performance are improved, and the method is suitable for detecting early-stage forest fire small-target smoke videos with complex background environments and long distances.
Drawings
In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.
Fig. 1 is a flowchart of a forest fire smoke video detection method based on a contrast self-supervised learning network according to an embodiment of the present invention;
fig. 2 is a detection result of the forest fire smoke video detection method based on the contrast self-supervised learning network according to the embodiment of the present invention on a public forest fire smoke video data set;
fig. 3 is a detection result of the forest fire smoke video detection method based on the contrast self-supervised learning network according to the embodiment of the present invention on the established forest fire smoke video data set.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only used as examples, and the protection scope of the present invention is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the present invention belongs.
As shown in fig. 1, the method for detecting forest fire smoke based on the contrast self-supervised learning network includes the following steps:
s1: acquiring a forest fire smoke video, and establishing a smoke video data set according to the forest fire smoke video;
specifically, a forest fire smoke video is obtained, the forest fire smoke video is from a lookout tower visible light camera, the shooting distance is long, the shooting time is the initial stage of fire occurrence, and therefore the shot smoke target is small, and the forest fire smoke video has the advantages of being uneven in illumination, complex in background information, multiple in interferent and the like. In order to introduce the dynamic characteristics of the forest fire smoke video, a smoke video data set is constructed on the basis of a static forest fire smoke less sample image data set according to the forest fire smoke video, so that the performance of a smoke detection model is improved by adding time sequence information. Further, the smoke video data set comprises 87 visible light forest fire small target smoke videos and 53 non-smoke videos, and because the forest fire smoke video sequence contains context information, the smoke size in continuous frames can change continuously with time, so that the small target smoke defined in the embodiment is smoke with the minimum detectable size of 20 × 20 pixels in the video sequence.
S2: performing feature learning on the smoke video data set by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames;
specifically, before a forest fire smoke video is obtained, a contrast self-supervision learning network is constructed in advance, and a network model of the contrast self-supervision learning network comprises a data input module, an unsupervised self-distillation network, a feature extraction backbone network, a motion region extraction module and an output module. The data input module is used for inputting video data and performing data enhancement on the video data to obtain a positive and negative sample global view and a local view of a video; the feature extraction backbone network is built in the unsupervised self-distillation learning network and is used for extracting features of a positive and negative sample global view and a local view; the unsupervised self-distillation network is used for carrying out comparison learning on the characteristics of the positive and negative sample global view and the local view through a student network and a teacher network so as to obtain an attention characteristic diagram of continuous video frames; the motion region extraction module is used for detecting a moving object according to the attention feature map of the continuous video frames; an output module: and is used for outputting the detection result of the moving object.
Further, the feature extraction backbone network is a cross dual-channel network based on a convolutional neural network and a visual transform, and specifically comprises an initial network, a convolutional neural network, a visual transform, a feature connection unit and a classifier, wherein the initial network comprises a 7 × 7 convolution with a step size of 2, and the subsequent network is connected with a 3 × 3 maximum pooling with a step size of 2 to extract initial local features such as contours and textures. The feature connection unit is used for fusing the intermediate features of the convolutional neural network and the visual transform, and the two branch networks of the convolutional neural network and the visual transform adopt the same initialization network, so that the feature connection unit is applied from the first stage of the cross dual-channel network, and the features of the convolutional neural network and the visual transform are respectively aggregated and sent to different classifiers through feature fusion of four stages.
After a contrast self-supervision learning network is constructed in advance, forest fire smoke video detection is started, and feature learning is carried out on the constructed smoke video data set by using the constructed contrast self-supervision learning network. And inputting the smoke video data set into a data input module, performing data enhancement, and generating positive and negative video sample images, wherein the positive and negative video sample images comprise global views and local views of the positive and negative video samples. The feature extraction backbone network extracts the features of the positive and negative sample images of the video, the unsupervised self-distillation network takes comparison self-supervision learning as a main learning mode, student network and teacher network comparison learning is carried out on the positive and negative sample image features of the video, semantic information and pseudo labels are generated on the positive and negative sample images of the video, knowledge distillation is introduced to reduce unsupervised self-distillation network parameters, and the processing speed is improved.
Setting student network parameter to theta s Teacher's network parameter is θ t Giving an input sample image x, respectively applying a softmax function to a student network g theta s And teacher network g θ t The output of the image is normalized to obtain the probability density P of the student network of the sample image s And teacher network probability density P t The solving formula of the probability density is expressed as:
Figure BDA0003685778070000061
where K represents a weight normalized dimension number of K dimensions, i represents the ith sample image, and τ s Is a temperature parameter, and s is greater than 0 and is used for controlling the sharpening degree of output distribution, P t Is calculated by the same method as the formula (1) and has tau t Temperature parameter > 0. When a fixed teacher network is given, the student network parameters can be learned by minimizing the cross-entropy loss, and the calculation formula of the minimized cross-entropy loss is as follows:
Figure BDA0003685778070000062
in the formula, H (P) t (x),P s (x))=-P t (x)logP s (x)。
In the embodiment, a sample set V with invariance of a sample x is generated through strategies such as clipping, and the like, and the sample set comprises two global views
Figure BDA0003685778070000063
And some local views with lower resolution, wherein the local view features are used for comparison learning through a student network, and the global view features are used for comparison learning through a teacher network. The minimum loss value of the student network and the teacher network can be obtained by applying the formula (2) as follows:
Figure BDA0003685778070000064
in the formula (I), the compound is shown in the specification,
Figure BDA0003685778070000071
represents a sample image of a global view and,
Figure BDA0003685778070000072
x is the sample image in the global view, and x' is the other sample image except x in the sample set V.
In the embodiment, the teacher network and the student network adopt the same network framework g, but the parameters of the teacher network and the parameters of the student network are different, and the student network adopts a random gradient descent method to optimize the parameters, namely, a formula (3) is adopted to optimize the parameters of the student network. Unlike conventional knowledge distillation, the unsupervised self-distillation network does not preset the teacher network parameter θ t An Exponential Moving Average (EMA), i.e. a momentum encoder, is used directly on the student network parameters. Teacher network parameter theta in the embodiment t The updating method of (2) is expressed as:
θ t ←λθ t +(1-λ)θ s (4)
where λ represents the coefficient, λ ∈ [0.996,1 ]. The original momentum encoders are typically used as queues in contrast learning, and in the unsupervised self-distilled network of this embodiment, are used primarily for teacher parameter averaging during the training process.
Further, in the network contrast learning process, the local view input into the student network and the global view input into the teacher network may be positive samples or negative samples, if the student network and the teacher network both input positive samples or both input negative samples, the similarity between the student network and the teacher network is learned, if the student network inputs positive samples and the teacher network inputs negative samples, the difference between the student network and the teacher network is learned, and similarly, if the student network inputs negative samples and the teacher network inputs positive samples, the difference between the student network and the teacher network is also learned.
Further, when the unsupervised self-distillation network carries out teacher network comparison learning on the image characteristics of the positive and negative sample global views, the teacher network learning parameters are optimized and corrected by adopting the middle moment central point operation and the sharpening operation, so that the model is prevented from collapsing. The middle moment central point can prevent a certain dimension from dominating, sharpening has an opposite effect, the two operations complement each other, and collapse resolution is avoided when a momentum encoder is adopted for updating teacher network parameters. The mid-moment center-point operation can be expressed as:
t (x)←gθ t (x)+c (5)
wherein c represents a center point, the center point c is updated by an exponential moving average, and the updating process is represented as follows:
Figure BDA0003685778070000073
where m is the rate parameter and m > 0, B represents the batch size. Sharpening is then achieved by using a lower temperature τ in the teacher network softmax normalization t To obtain the product.
The unsupervised self-distillation network which takes the crossed dual-channel network as the backbone network can generate semantic information and pseudo labels on positive and negative sample images through a contrast learning method, so that an attention feature map is formed, although the unsupervised self-distillation network does not output label prediction results, the area of an object of interest can be displayed through the attention feature map, and high accuracy performance of subsequent video detection is realized.
Preferably, the feature extraction backbone network is trained on a large labeled ImageNet-1k dataset by adopting an AdamW optimizer, and meanwhile, in order to ensure the stability of the visual Transformer, data enhancement and regularization technologies such as Mixup, CutMix, Erasing, RandAugment and Stochastic Depth are adopted. 300 epochs were trained using an AdamW optimizer with a batch size of 1024, weight decay of 0.05, initial learning rate of 0.001, and input local view size in visual Transformer of 14 × 14. Training an unsupervised self-distillation network on an unlabeled ImageNet-1k dataset by adopting an AdamW optimizer, wherein the batch size is 1024, the training process is divided into a preheating training part and a formal training part, the learning rate rises to an initial value in the first 10 epochs, the linear scaling rule is that lr is 0.0005 1024/256, and attenuation weights [0.04, 0.4 ] are adopted after preheating]The cosine learning rate attenuation strategy. Temperature τ of student network s 0.1, while the student network temperature is linearly preheated τ in the first 30 epochs s ∈[0.04,0.07]. Meanwhile, in order to ensure the stability of the unsupervised self-distillation network, technologies such as color dithering, Gaussian blur and overexposure are adopted to enhance data.
S3: detecting a moving object according to the attention feature map of the continuous video frames, and acquiring a detection result;
specifically, feature learning is carried out on the smoke video data set through an unsupervised self-distillation network to form an attention feature map of continuous video frames, the attention feature map of each frame is obtained to determine an attention area, and the motion extraction module only carries out moving object detection on the attention area to reduce information redundancy and improve video processing efficiency. Inputting the attention feature map of the continuous video frames into a motion extraction module, and acquiring continuous three-frame video data F in the attention feature map of the continuous video frames by the motion extraction module i-1 (x,y)、F i (x, y) and F i+1 (x, y) calculating the frame difference D of two continuous frames i-1,i (x, y) and D i,i+1 (x, y), the frame difference result is expressed as:
Figure BDA0003685778070000091
Figure BDA0003685778070000092
in the formula T min Minimum threshold, T, representing frame difference max Representing a maximum threshold for frame differences. In order to alleviate the ghost problem of the moving object in the conventional frame difference method, the frame difference result of the two consecutive frames is subjected to an and operation, which is expressed as:
Figure BDA0003685778070000093
and if the AND operation output result is 1, labeling and dividing the motion area of the attention feature map of the continuous video frame, and if the AND operation output result is 0, not labeling and dividing the motion area of the attention feature map of the continuous video frame, and generating a forest fire smoke detection result and a forest fire smoke positioning result of the forest fire smoke video according to the labeling and dividing of the motion area.
Further, after the detection result is obtained, a result evaluation step is also included. Evaluating a forest fire smoke detection result of the forest fire smoke video by adopting Average accuracy (mAP), Average True Positive Rate (ATPR) and Average True Negative Rate (ATNR), evaluating a forest fire smoke positioning result of a single-frame forest fire smoke video by adopting an Intersection Over Unit (IoU), and evaluating a forest fire smoke positioning result of a continuous frame forest fire smoke video by adopting an Average Intersection Over Unit (IomU).
Wherein, the calculation formula of the average accuracy rate is as follows:
Figure BDA0003685778070000094
the average true positive rate is calculated by the formula:
Figure BDA0003685778070000095
the average true negative rate is calculated by the formula:
Figure BDA0003685778070000096
the calculation formula of the intersection ratio is as follows:
Figure BDA0003685778070000097
the average intersection ratio is calculated by the formula:
Figure BDA0003685778070000101
wherein TP (true Positive) represents the number of correctly identified video frames in the smoke video sequence, FP (false positive) represents the number of incorrectly identified video frames in the smoke video sequence, TN (true negative) represents the number of correctly identified video frames in the non-smoke video sequence, FN (false negative) represents the number of incorrectly identified video frames in the non-smoke video sequence, and M 1 Representing the number of video sequences, AO representing the intersection of the predicted smoke localization area and the real localization area of the single frame image, AU representing the union of the predicted smoke localization area and the real localization area of the single frame image, M 2 Representing the number of frames in a smoke video sequence, M 3 Indicates the number of classes, AP i An Average of (AR) values representing the accuracy for each recall (RR).
Further, the video detection method for forest fire smoke based on the contrast self-supervision learning network is compared with the video detection method for 3D-PFCN smoke and the video detection method for 3D-VSSNet smoke to evaluate the effectiveness on the video data set for the public forest fire smoke. The smoke video data set is divided into three types of green background, gray background and complex background according to different forest fire smoke video background information, in order to guarantee fairness of comparison experiments, comparison is conducted on three evaluation indexes, namely average true positive rate, average true negative rate and average cross-over ratio, and the comparison results are shown in table 1, and CSLN is used in table 1 to represent the forest fire smoke video detection method based on the comparison self-supervision learning network.
Figure BDA0003685778070000102
Table 1 comparison of different smoke video detection methods on public forest fire smoke data sets
As shown in table 1, the forest fire smoke video detection method (CSLN) based on the contrast self-supervised learning network exhibits optimal performance in the green forest background, the gray forest, the blue sky background and the complex weather background. Specifically, in a green forest background, the three methods obtain the same 100% average true positive rate ATPR, although the average true negative rate ATNR (99.79%) and average cross-over ratio mlou (86.46%) of the detection method of the present embodiment are slightly better than those of the 3D-PFCN method (98.54%) and average cross-over ratio mlou (78.48%) and those of the 3D-VSSNet method (99.62%) and average cross-over ratio mlou (86.32%), no obvious advantage is shown, which indicates that the three methods are all suitable for forest fire smoke detection in a simple scene of a green forest background. Under the gray forest and the blue sky background, the average true positive rate ATPR and the average cross-over ratio mIoU of the detection method are respectively improved by 1.06% and 1.99% compared with the 3D-VSSNet method, and are respectively improved by 3.19% and 6.51% compared with the 3D-PFCN method, which shows that the forest fire smoke video detection method based on the contrast self-supervision learning network has stronger competitiveness when detecting smoke video data with the blue sky and white cloud background. Under a complex background, the detection method of the embodiment obviously improves three evaluation indexes compared with the 3D-PFCN method and the 3D-VSSNet method, wherein the average true positive rate ATPR, the average true negative rate TANR and the average cross-over are respectively improved by 5.17 percent, 2.95 percent and 5.85 percent compared with the 3D-VSSNet method and are respectively improved by 12.22 percent, 5.1 percent and 9.91 percent compared with the 3D-PFCN method, which indicates that the influence of weather interference factors such as fog, haze and the like on the detection method of the embodiment is the minimum, therefore, the forest fire smoke video detection method based on the contrast self-supervision learning network is more suitable for detecting forest fire smoke video targets in complex scenes than the 3D-VSSNet smoke video detection method and the 3D-PFCN smoke video detection method, and validity of forest fire smoke detection in complex backgrounds including cloud, fog, haze and the like of the contrast self-supervision learning network is verified.
In order to verify the effectiveness of the forest fire smoke video detection method based on the contrast self-supervision learning network in detecting forest fire smoke in a complex environment, detection is carried out on a public forest fire smoke video data set, and the detection result is shown in fig. 2, wherein a DION attention map is a continuous video frame attention feature map output by an unsupervised self-distillation network, a DION attention area map is the position of an original image corresponding to an attention area of the DION attention map, and a motion area extraction map is a result map obtained by directly carrying out continuous frame motion area detection on the attention area. As shown in fig. 2(a) and 2(d), the interference of moving objects such as tree shaking, unmanned aerial vehicle flying, camera shaking and the like can be eliminated during detection, and accurate positioning of long-distance smoke is realized; as shown in fig. 2(b) and 2(e), the complex weather backgrounds including fog, haze and the like can be detected, so that the interference of weather factors can be eliminated during detection, and long-distance smoke can be accurately detected on a low-resolution video frame; as shown in fig. 2(c), small target smoke can be accurately identified and located in a complex background during detection. In conclusion, the video detection method for forest fire smoke based on the contrast self-supervision learning network has higher effectiveness and stability when the video detection method is used for detecting the small target smoke of the long-distance forest fire under the complex background condition.
In order to further verify the effectiveness of the video detection method for forest fire smoke based on the contrast self-supervision learning network in detecting the remote forest fire smoke, detection is performed on the video data set for forest fire smoke established in step S1, the detection result is shown in fig. 3, the video sequence includes weather factors such as blue sky, cloud, fog and over-strong illumination, wherein fig. 3(a), 3(c), 3(f) and 3(h) represent background of blue sky, fig. 3(b) represents over-strong illumination background, fig. 3(d) and 3(g) represent background of cloud, fig. 3(i) represents background of fog, and fig. 3(e) and 3(j) include other non-smoke moving objects. According to the detection method, under the interference of the five weather factors, the smoke can be accurately detected, meanwhile, the remote smoke video target is accurately positioned, and the detection performance is good. In conclusion, the result shows that the forest fire smoke video detection method based on the contrast self-supervision learning network is not only suitable for the small target smoke video data detection scene of the forest fire under the complex background, but also suitable for other open video target detection scenes, and the stability and the generalization capability of the detection method are verified.
The method is based on a contrast self-supervision learning method, an unsupervised self-distillation network is built, detection of the forest fire smoke video is achieved, the unsupervised self-distillation network takes a cross double-channel network as a backbone network, local features and global features of the smoke video under a complex environment background are extracted, semantic information and pseudo labels of continuous video frames are learned, attention feature maps of the continuous video frames are obtained, moving targets of attention areas in the attention feature maps of the continuous video frames are determined, high-precision recognition and positioning of the forest fire smoke video are achieved, smoke detection efficiency and performance are improved, and the method is suitable for detecting early-stage forest fire small-target smoke videos with complex background environments and long distances.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being covered by the appended claims and their equivalents.

Claims (8)

1. A forest fire smoke video detection method based on a contrast self-supervision learning network is characterized by comprising the following steps:
acquiring a forest fire smoke video, and establishing a smoke video data set according to the forest fire smoke video;
performing feature learning on the smoke video data set by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames;
and detecting a moving object according to the attention feature map of the continuous video frames, and acquiring a detection result.
2. The forest fire smoke video detection method based on the contrast self-supervision learning network as claimed in claim 1, further comprising, before obtaining the forest fire smoke video, pre-constructing a contrast self-supervision learning network, wherein the contrast self-supervision learning network comprises a data input module, an unsupervised self-distillation network, a feature extraction backbone network, a motion region extraction module and an output module;
the feature extraction backbone network is built in the unsupervised self-distillation learning network.
3. The video detection method for forest fire smoke based on the contrast self-supervision learning network as claimed in claim 2, wherein the feature extraction backbone network is a cross two-channel network based on a convolutional neural network and a visual Transformer.
4. The forest fire smoke video detection method based on the contrast self-supervision learning network as claimed in claim 2, wherein the smoke video data set is subjected to feature learning by adopting a pre-constructed contrast self-supervision learning network to obtain an attention feature map of continuous video frames, specifically:
inputting the smoke video data set into the data input module for data enhancement to generate video positive and negative sample images, wherein the video positive and negative sample images comprise global views and local views of the video positive and negative samples;
extracting positive and negative sample image features of the video by adopting the feature extraction backbone network;
and performing student network and teacher network comparison learning on the positive and negative sample image characteristics of the video by adopting the unsupervised self-distillation network, and generating semantic information and pseudo labels on the positive and negative sample images of the video so as to form an attention characteristic diagram of continuous video frames.
5. The video forest fire smoke detection method based on the contrast self-supervision learning network as claimed in claim 4, further comprising: and optimizing parameters of the student network by adopting a random gradient descent method, and optimizing and correcting the parameters of the teacher network by adopting a middle moment central point operation.
6. The forest fire smoke video detection method based on the contrast auto-supervised learning network as recited in claim 4, wherein the detection of moving objects is performed according to the attention feature map of the continuous video frames, and a detection result is obtained, specifically:
inputting the attention feature maps of the continuous video frames into a motion region extraction module;
acquiring continuous three frames of video data in the attention feature map of the continuous video frames by adopting the motion region extraction module, and respectively calculating the frame difference result of the continuous two frames;
the motion region extraction module performs AND operation on the frame difference result of the two continuous frames, and labels and segments the motion region of the attention feature map of the continuous video frames according to the AND operation result to generate a forest fire smoke detection result and a forest fire smoke positioning result of the forest fire smoke video;
and outputting the forest fire smoke detection result and the forest fire smoke positioning result by adopting an output module.
7. A forest fire smoke video detection method based on a contrast self-supervision learning network as claimed in claim 6, wherein after the detection result is obtained, a result evaluation step is further included, specifically:
evaluating a forest fire smoke detection result of the forest fire smoke video by adopting an average accuracy rate, an average true-positive rate and an average true-negative rate;
evaluating a forest fire smoke positioning result of the single-frame forest fire smoke video by adopting an intersection ratio, and evaluating a forest fire smoke positioning result of the continuous-frame forest fire smoke video by adopting an average intersection ratio.
8. The forest fire smoke video detection method based on the contrast self-supervision learning network as claimed in claim 2, further comprising training and optimizing the contrast self-supervision learning network, specifically comprising:
training the feature extraction backbone network on a large labeled ImageNet-1k data set by adopting an AdamW optimizer;
the unsupervised self-distilled network was trained on an unlabeled ImageNet-1k dataset using an AdamW optimizer.
CN202210645586.1A 2022-06-09 2022-06-09 Forest fire smoke video detection method based on contrast self-supervision learning network Pending CN114998801A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210645586.1A CN114998801A (en) 2022-06-09 2022-06-09 Forest fire smoke video detection method based on contrast self-supervision learning network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210645586.1A CN114998801A (en) 2022-06-09 2022-06-09 Forest fire smoke video detection method based on contrast self-supervision learning network

Publications (1)

Publication Number Publication Date
CN114998801A true CN114998801A (en) 2022-09-02

Family

ID=83033504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210645586.1A Pending CN114998801A (en) 2022-06-09 2022-06-09 Forest fire smoke video detection method based on contrast self-supervision learning network

Country Status (1)

Country Link
CN (1) CN114998801A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117409529A (en) * 2023-10-13 2024-01-16 国网江苏省电力有限公司南通供电分公司 Multi-scene electrical fire on-line monitoring method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117409529A (en) * 2023-10-13 2024-01-16 国网江苏省电力有限公司南通供电分公司 Multi-scene electrical fire on-line monitoring method and system
CN117409529B (en) * 2023-10-13 2024-05-24 国网江苏省电力有限公司南通供电分公司 Multi-scene electrical fire on-line monitoring method and system

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN111209810B (en) Boundary frame segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time through visible light and infrared images
CN111310862A (en) Deep neural network license plate positioning method based on image enhancement in complex environment
CN103700114B (en) A kind of complex background modeling method based on variable Gaussian mixture number
CN110363770B (en) Training method and device for edge-guided infrared semantic segmentation model
CN109919073B (en) Pedestrian re-identification method with illumination robustness
CN111008608B (en) Night vehicle detection method based on deep learning
CN111582074A (en) Monitoring video leaf occlusion detection method based on scene depth information perception
CN111242026A (en) Remote sensing image target detection method based on spatial hierarchy perception module and metric learning
CN113537226A (en) Smoke detection method based on deep learning
CN111274964B (en) Detection method for analyzing water surface pollutants based on visual saliency of unmanned aerial vehicle
CN115861756A (en) Earth background small target identification method based on cascade combination network
CN114821374A (en) Knowledge and data collaborative driving unmanned aerial vehicle aerial photography target detection method
CN114998801A (en) Forest fire smoke video detection method based on contrast self-supervision learning network
CN116453033A (en) Crowd density estimation method with high precision and low calculation amount in video monitoring scene
CN111815529B (en) Low-quality image classification enhancement method based on model fusion and data enhancement
CN110334703B (en) Ship detection and identification method in day and night image
CN110084160B (en) Video forest smoke and fire detection method based on motion and brightness significance characteristics
CN116740572A (en) Marine vessel target detection method and system based on improved YOLOX
CN116958780A (en) Cross-scale target detection method and system
CN116363610A (en) Improved YOLOv 5-based aerial vehicle rotating target detection method
CN114387484B (en) Improved mask wearing detection method and system based on yolov4
CN110796008A (en) Early fire detection method based on video image
CN115861595A (en) Multi-scale domain self-adaptive heterogeneous image matching method based on deep learning
CN115690770A (en) License plate recognition method based on space attention characteristics in non-limited scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination