CN116912782B

CN116912782B - Firework detection method based on overlapping annotation training

Info

Publication number: CN116912782B
Application number: CN202311181870.9A
Authority: CN
Inventors: 刘云川; 贺亮; 岑亮; 易炜; 吴雷
Original assignee: Chongqing Hongbao Technology Co ltd; Sichuan Hongbaorunye Engineering Technology Co ltd
Current assignee: Chongqing Hongbao Technology Co ltd; Sichuan Hongbaorunye Engineering Technology Co ltd
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2023-11-14
Anticipated expiration: 2043-09-14
Also published as: CN116912782A

Abstract

The invention discloses a smoke and fire detection method based on overlapping labeling training, which mainly comprises the following steps: marking the obtained cleaned standard data by using an overlap marking method to obtain marked data; constructing a target detection model, and training the target detection model by using marking data; selecting an optimal model through a picture level F1 index IF1 to replace a traditional mAP index; deploying the optimal model to finish smoke and fire detection; and (3) carrying out post-processing on all detected firework target frames, and outputting whether firework and coordinate values of the firework target frames exist or not. According to the invention, the optimal model is selected through the F1 index mIF1 value of the picture level to replace the traditional mAP index, so that the selected model meets the actual deployment requirement to the greatest extent, is more suitable for targets with unfixed morphology and unfixed labeling mode such as fireworks, and can be deployed on edge equipment.

Description

Firework detection method based on overlapping annotation training

Technical Field

The invention belongs to the field of computer vision and artificial intelligence, and particularly relates to a smoke and fire detection method based on overlapping annotation training.

Background

With the continuous progress and development of technology in the fields of artificial intelligence and computer vision, more and more artificial intelligence algorithms are applied to the fields of intelligent security, monitoring and early warning and the like. Through computer vision technology, the video image information can be automatically and intelligently analyzed, and the quick response and processing of various security events are realized. In industrial parks, construction sites, oil-gas well sites, forestry engineering and other scenes, fire is a great hidden danger, and once the fire occurs, the fire can cause extremely serious casualties and property loss due to untimely extinguishment, so that the automatic real-time monitoring and alarming of open smoke and open fire are very important work.

For the above requirements, there are three popular solutions, one is to use the solution as a classification problem, and to classify each frame of video picture whether there is open smoke or open fire, the advantage of this solution is that the labeling is simple, the disadvantage is that the position of the open smoke or open fire cannot be displayed, and visual display is not facilitated; the second scheme is used as a common target detection problem, defines the targets as open smoke and open fire, and carries out frame taking and marking of the target frame, and has the advantages that the target position can be displayed, the defect that the marking mode of the target frame is not unique and easy to cause training difficulty, and a lot of background information is easy to be selected in an extra frame due to the irregularity of the smoke and fire form; the third scheme is used as a semantic segmentation problem, namely, pixel-level segmentation is carried out on the position of the open smoke and the open fire in the picture, and has the advantages of higher precision, high labeling cost, slower model reasoning speed and adverse edge deployment.

On the premise of ensuring visual position display, making a proper scheme, maximizing detection accuracy by using a simpler labeling mode and consuming lower computational resources is a valuable problem.

Disclosure of Invention

In order to solve the technical problems, the invention discloses a smoke and fire detection method based on overlapping annotation training, which comprises the following steps:

s100: collecting training data;

s200: data cleaning is carried out on the acquired training data to obtain cleaned standard data;

s300: marking the obtained cleaned standard data by using an overlap marking method to obtain marked data;

s400: constructing a target detection model, and training the target detection model by using marking data;

s500: selecting an optimal model through a picture level F1 index IF1;

the principle of selecting the optimal model is as follows: judging whether the picture has open smoke and open fire or not, and judging that no false alarm exists in the training process;

s600: deploying the optimal model to finish smoke and fire detection;

s700: and (3) carrying out post-processing on all detected firework target frames, and outputting whether firework and coordinate values of the firework target frames exist or not.

Preferably, the step S500 further includes:

s501: the method comprises the steps that a picture level F1 index IF1 is adopted during training of a target detection model;

s502: and (3) averaging the IF1 values of the smoke and the fire to obtain a final evaluation index mIF1:wherein n=2;

s503: after training, selecting the model with the highest mIF1 value as the optimal model.

Preferably, the specific calculation mode of the IF1 is as follows:

wherein, IP represents the precision of the picture level, IR represents the recall of the picture level, ITP represents that the detected picture contains a certain kind of target, and the label also contains the target; IFP represents that the predicted picture contains some kind of object, but the tag does not contain the object; IFN represents that the predicted picture does not contain some kind of object, but the tag contains that object.

Preferably, the training data includes: the smoke and fire detection data set of the open source, the crawled pictures comprising fire and fire keywords and the data collected on site.

Preferably, the data cleaning means: only the picture data of the open source dataset is reserved, or only some irrelevant pictures are removed, or the video is converted into pictures.

Preferably, the overlapping labeling method is to label the fireworks with a plurality of overlapping dense small boxes.

Preferably, the post-processing in step S700 includes performing non-maximum suppression post-processing and confidence threshold filtering on the obtained pyrotechnic target frame.

According to the technical scheme, the data is marked by the overlapping marking method for subsequent training, so that the number of positive samples is greatly increased, the problem of unbalance of the positive and negative samples in target detection is solved, meanwhile, the false marking of background information is reduced, the model precision is effectively improved, and finally, the optimal model is selected through the F1 index mIF1 value of the picture level to replace the traditional mAP index, so that the selected model meets the actual deployment requirement to the greatest extent and is more suitable for targets with unfixed forms and unfixed marking modes such as fireworks. The whole scheme of the invention has high development efficiency, low data labeling cost and great precision improvement, and simultaneously reduces the computing resource requirement of the edge equipment.

Drawings

FIG. 1 is a flow chart of a method for smoke detection based on overlay annotation training provided in one embodiment of the invention;

FIG. 2 is a comparison of labeling schemes provided in one embodiment of the invention;

FIG. 3 is a schematic diagram of the detection model calculation flow of the present invention provided in one embodiment of the present invention.

Detailed Description

In order for those skilled in the art to understand the technical solutions disclosed in the present invention, the technical solutions of the various embodiments will be described below with reference to the embodiments and the related fig. 1 to 3, where the described embodiments are some embodiments, but not all embodiments of the present invention.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will appreciate that the embodiments described herein may be combined with other embodiments.

Referring to FIG. 1, in one embodiment, the present invention discloses a method of smoke detection based on overlay annotation training, the method comprising the steps of:

s100: collecting training data;

s200: data cleaning is carried out on the acquired training data to obtain cleaned data;

s300: marking the obtained cleaned data by using an overlap marking method;

s400: constructing a target detection model, and training the target detection model by using the marked data;

s500: selecting an optimal model through a picture level F1 index IF1;

that is, the invention only needs to judge whether the picture has open smoke and open fire, and meanwhile, no false alarm exists in the training process, and does not consider the factor that whether the overlapping degree of a target frame predicted by a model and a marked frame is smaller, so as to replace the traditional mAP index as a picture level F1 index;

s600: deploying the optimal model to finish smoke and fire detection;

For this embodiment, the method is more applicable to objects such as pyrotechnical and the like that are not fixed in morphology and are not fixed in labeling mode: firstly, marking training data by using an overlap marking method, then training by using acquired image data by using a target detection algorithm, selecting an optimal model by using a picture-level F1 index in the training process, and finally, deploying by using the obtained model. During deployment, model reasoning calculation is carried out on the input image, then post-processing is carried out on all detected smoke target frames, and finally, whether the coordinate values of the open smoke and the open fire and the target frames exist or not is output.

According to the method, the original data is marked by the overlapping marking method, compared with a main stream marking mode of directly marking a large frame, the method can effectively improve the number of positive samples, greatly reduce the interference of background information, and enable the model to pay more attention to the texture information of the target, so that the model training difficulty is reduced, and the model precision is improved. Then, compared with the main stream mAP index, the optimal model with better actual effect can be selected through the picture-level F1 index, and the model is more suitable for targets with unfixed modes such as fireworks and unfixed labeling modes. The whole scheme is simple and effective, can greatly improve the detection precision of open smoke and open fire, reduces false alarm, and is particularly suitable for being deployed on edge equipment with limited computing resources.

In another embodiment, the training data comprises: the smoke and fire detection data set of the open source, the crawled pictures comprising fire and fire keywords and the data collected on site.

For the embodiment, training data is firstly collected and is mainly divided into three types, wherein the first type is an open-source smoke and fire detection data set which is directly searched and downloaded, and the second type is a self-defined picture crawling mode by a crawler which mainly crawls keywords such as fire, fire and the like; the third type is on-site collection, namely, firstly, disposing cameras with different angles and distances on site, then, on the premise of ensuring safety, igniting different materials (such as paper, plastic, cloth and the like) to record videos, and simulating different time periods (such as early morning, noon and evening) and different illumination intensities (natural light and light strong light irradiation) as far as possible.

In another embodiment, the data cleansing refers to: the method only needs to keep the picture data of the open source data set, or only needs to reject some irrelevant pictures, or changes the video into the pictures.

For this embodiment, after the data acquisition is completed, the data needs to be initially cleaned and related pre-processing. Because a new labeling scheme is provided, labels provided by the open source data set are not applicable, only picture data of the open source data set are required to be reserved for the first type of data, only a few irrelevant pictures are required to be removed for the second type of data, and video is required to be converted into pictures for the third type of data.

In another embodiment, the overlap labeling method is to label the fireworks with a plurality of overlapping dense small boxes.

For this embodiment, after the data cleansing and preprocessing is completed, the data needs to be annotated. The newly proposed overlay labeling method, a general target detection model, can label a group of flames or a group of cigarettes as a large frame, thus being easy to have several problems: 1. the fireworks are different from a specific target such as a human body and an automobile, and have a fixed form, so that for a large number of fireworks mixed together, it is difficult to clearly label each frame, if the frames are simply marked, different people have different labeling methods, the convergence difficulty of the model can be improved, the model is easy to miss the mark, and if the whole frame is marked, more ambiguity exists, so that the model can not distinguish the two targets of the fireworks well; 2. even if only a smoke fire exists, the form is usually not a standard rectangle, a rectangular frame is directly used for marking, a large amount of background areas can be additionally contained, and unlike other target detection, the smoke fire detection mainly learns texture information, a large frame is directly marked to easily contain irrelevant background information, the model training difficulty is improved, and the condition that only the background areas are possibly generated but the smoke fire frames are marked after conventional data enhancement operations such as translation and splicing is more serious, so that the negative influence on the model accuracy is greater.

The overlapping labeling method provided by the invention is to label fireworks by using a plurality of overlapping dense small frames, so that the number of positive samples can be increased, the background information selected by the frames can be reduced, meanwhile, the model can pay more attention to texture information instead of outline morphology, and when the prediction is deployed, non-maximum value inhibition processing is performed in post-processing, and a large number of predicted overlapping frames can be de-overlapped and visualized conciseness and intuition are maintained. Specific reference may be made to fig. 2 for comparison. The left-hand diagram of fig. 2 is a general labeling scheme, and the right-hand diagram of fig. 2 is an overlapping labeling scheme.

The overlapping dense small frames are not particularly limited, and mainly, the small frames are used for completely covering the pyrotechnic objects on the premise of not containing the background. Because the labeling of the frames is horizontal and vertical, and the pyrotechnic objects are not rectangles, each frame can be as large as possible in the horizontal and vertical directions without the frame reaching the background. The included background affects model training because of shape and semantic features of objects such as human bodies, vehicles, animals and the like, the included background has little influence, smoke and fire only has texture features, training after the included background has interference, and especially after data enhancement such as translation, mosaics and the like, the situation that only one background area frame is in a picture range can occur. Thus, in summary, there are mainly two features: 1. overlapping is mainly used for increasing the number of more target positive samples, and meanwhile, the same graph has different labeling possibilities, so that the difference can be relieved to a certain extent through overlapping; 2. the small dense frame is to cover all the pyrotechnic target areas as fully as possible without background.

In another embodiment, the detection model selects a one-stage object detection model.

For this embodiment, after the data annotation is completed, the construction of the detection model may begin. In order to achieve both speed and accuracy, the model may be selected from currently-used one-stage object detection models such as YOLOv3, YOLOv4, YOLOv5, YOLOv6, YOLOv7, etc. of the YOLOv series.

In another embodiment, the step S500 further includes:

s502: averaging the IF1 values of the two types of fireworks to obtain a final evaluation index mIF1:wherein n=2;

For this embodiment, a model is built and trained using the aforementioned data, and then the best model is selected according to the picture-level F1 index proposed by the present invention.

The mAP index is adopted in the main stream of target detection model training, but is not applicable to targets with unfixed modes such as fireworks and unfixed labeling modes, for example, the target frame predicted by the model and the labeled frame have small overlapping degree, but belong to fireworks, the mAP is not scored, but the actual scene is acceptable, namely, the condition that whether the picture has bright smoke and open fire is only judged, and the factor that whether the overlapping degree of the target frame predicted by the model and the labeled frame is smaller than a preset range is not considered. Thus, the replacement mAP is a picture level F1 indicator, referred to herein as IF1. It can be appreciated that the training process must ensure that the framed target frame is not misreported.

Because the invention detects two targets of open smoke and open fire, the IF1 value of each type needs to be averaged to obtain the final evaluation index mIF1:

in this example, n=2, the model of each iteration and the corresponding mff 1 value are reserved in the training process, and the model with the highest mff 1 value is selected as the final used model after the training is completed.

In another embodiment, the specific calculation method of the IF1 is as follows:

wherein, IP represents the Precision (Image Precision) of the picture level, IR represents the Recall (Image Recall) of the picture level, ITP represents that the detected picture contains a certain target, and the tag also contains the target; the IFP represents that the predicted picture contains some kind of object, but the tag does not contain the object; the IFN represents that the predicted picture does not contain some sort of object, but the tag does.

For this embodiment, the TP in conventional object detection is calculated by the overlapping degree of two frames, and the TP at picture level is just to compare whether the picture contains smoke or not. All of these indices are modifications to the indices of normal F1, P, R, TP, FP, FN, instead of picture-level, as is done with a single box.

Examples are as follows:

if there are 4 verification pictures, the numbers are 0 to 3, respectively:

0: the cigarette has bright smoke and no bright fire;

1: the cigarette has bright smoke and no bright fire;

2: no open smoke and open fire;

3: there is open smoke and open fire;

and their model predictions are:

0: there is open smoke and open fire;

1: no open smoke and no open fire;

2: no open smoke and no open fire;

3: there is open smoke and open fire;

then by definition:

ITP bright smoke=2 (2 pictures, i.e. 0, 3 predicted bright smoke and corresponding label also bright smoke);

IFP bright smoke = 0 (there is no picture detected to contain bright smoke but the label does not contain bright smoke);

ifnMing tobacco=1 (1 picture, no 1 predicts no Ming tobacco but labeled Ming tobacco);

then, according to the above formula, IF bright smoke=2/(2+0) =1, ir bright smoke=2/(2+1) =0.67, IF1 bright smoke=2×1×0.67/(1+0.67) =0.80.

The same is true for open flame targets:

ITP open flame = 1 (1 picture, i.e. number 3 predicts open flame and corresponding tag also open flame);

IFP open flame = 1 (1 picture i.e. number 0 predicts open flame but label does not contain open flame);

IFN open flame = 1 (1 picture i.e. No. 2 predicts no open flame but the label has an open flame);

then, according to the above formula, IF open fire=1/(1+1) =0.5, and IF open fire=1/(1+1) =0.5, IF1 open fire=2×0.5×0.5/(0.5+0.5) =0.5;

finally, mfr1= (IF 1 bright smoke+if1 bright fire)/2= (0.8+0.5)/2=0.65.

In another embodiment, the post-processing in step S700 includes non-maximum suppression post-processing and confidence threshold filtering of the resulting pyrotechnic target boxes.

For the embodiment, when the method is deployed, target detection reasoning is carried out on a detection model, then non-maximum suppression post-processing and confidence threshold filtering are carried out on the obtained detection frame, if the post-processing result is empty, no early warning information is output, and if the post-processing result contains a target frame, corresponding smoke and fire type early warning is carried out.

In another embodiment, the deep stream streaming media framework is adopted, and the target detection algorithm is the s model of YOLOv5-6.1 version.

The 6.1 version of YOLOV5 is adopted, and the s small model is selected to facilitate edge deployment. For any piece of data (or video frame) to be detected, it is first scaled to 640 size at the longest side, and then pixels 114 through 640 are filled in for the short side. And the natural image has 3 channels of RGB, so the dimension of the input model is 3x640x640.

The picture data is input into the model, firstly, feature extraction is carried out through a backbone network, then feature graph splicing is carried out through an SPPF structure, and finally, the final target feature vector is output through a plurality of 2D convolution output heads. The overall model flow is shown in fig. 3, as follows:

for an input picture, firstly, a Conv module is arranged through a model layer 0, the Conv module consists of a 2-dimensional convolution layer, a BatchNorm layer and a SiLU activation function layer, the convolution kernel of the 2-dimensional convolution layer is 6x6 in size, the convolution sum and the number are 32, the step size is 2x2, the filling is 2, and the output dimension is changed into 32x320x320 after passing through the Conv module. The convolution operation is the operation of multiplying the input data by a matrix in each window with the size of a convolution kernel and then summing, and meanwhile, multiplexing the weight of each convolution kernel, namely calculating different coordinate positions of the same input data;

the convolution operation of the convolution layer is that a matrix (convolution kernel) with a specific size is taken each time, and then the matrix is sequentially traversed and scanned for the input feature map to the region with the same size and the inner product is calculated.

The calculation steps of the Batchnorm are as follows:

1) First, a piece of data x is obtained, and all x are obtained from batch data at this time _i Is equal to the mathematical expectation of (1)Value:and corresponding variance；

3) The following is the normalization of x:。

4) The most important step, the introduction of scaling and translation variablesAndnormalized values are calculated:where B is the input data of one batch, and m is the data quantity of one batch.A decimal (preferably 0.000001) prevents the denominator from being 0,to normalize the data, the variable is translatedAndin order for the parameters to be trained,and outputting data for the final batch.

The calculation formula of the SiLU activation function is as follows:

where e is the natural logarithm.

Then the model layer 1 is also a Conv module, and the model layer is also composed of a 2-dimensional convolution layer, a BatchNorm layer and a SiLU activation function layer, wherein the convolution kernel size of the 2-dimensional convolution layer is 3x3, the number of convolution kernels is 64, the step size is 2x2, and the filling is 1. The output dimension is changed into 64x160x160 after the Conv module.

Then, through model layer 2, a C3 module, wherein the C3 module is composed of 3 Conv modules and a Bottleneck module, the Bottleneck module is a residual structure composed of two Conv modules, namely, for an input feature x, x1 is obtained through the first Conv module, x2 is obtained through the second Conv module through the x1, then x and x2 are added to be used as final output, the number of convolution kernels of the first Conv module of the Bottleneck is 32, the convolution kernel size is 1x1, the step size is 1x1, the number of convolution kernels of the second Conv module of the Bottleneck is 32, the convolution kernel size is 3x3, the step size is 1x1, the filling is 1, and the activation function is SiLU. For the C3 module here, the first and second Conv modules have the same structure, i.e. the number of convolution kernels is 32, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the convolution kernels of the third Conv module is 64, the step sizes of the convolution kernels are all 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 64x160x160, the output is output as 32x160x160 through a first Conv module, then the output is output as 32x160x160 through a Bottleneck module, then the original input is output as 32x160x160 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain the output of 64x160x160, and then the final output or 64x160x160 is obtained through a third Conv module.

And then passing through a model layer 3 and a Conv module, wherein the Conv module consists of a 2-dimensional convolution layer and a SiLU activation function layer, the convolution kernel of the 2-dimensional convolution layer has a size of 3x3, the number of convolution kernels is 128, the step size is 2x2, the filling is 1, and the output dimension is 128x80x80 after passing through the Conv module.

And then passing through a model layer 4, and a C3 module, wherein the model layer 4 comprises 3 Conv modules and two Bottleneck modules, the first Bottleneck module is a residual structure formed by the two Conv modules, the number of convolution kernels of the first Conv module is 128, the size of the convolution kernels is 1x1, the step size is also 1x1, the activation function is SiLU, the number of convolution kernels of the second Conv module of Bottleneck is also 64, the size of the convolution kernels is 3x3, the step size is also 1x1, the filling is 1, and the activation function is SiLU. The second Bottleneck modular structure core is the same as the first. For the C3 module herein, the first and second Conv modules are the same in structure, i.e., the number of convolution kernels is 128, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the third Conv modules is 256, the size of the convolution kernels is 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 128x80x80, the output is 64x80x80 through a first Conv module, then the output is 64x80x80 through two Bottleneck modules, then the original input is 64x80x80 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain 128x80x80 output, and then the final output or 128x80x80 is obtained through a third Conv module.

And then passing through a model layer 5, a Conv module, which consists of a 2-dimensional convolution layer and a SiLU activation function layer, wherein the convolution kernel of the 2-dimensional convolution layer has a size of 3x3, the number of the convolution kernels is 256, the step size is 2x2, the filling is 1, and the output dimension is changed into 256x40x40 after passing through the Conv module.

And a C3 module is formed by 3 Conv modules and 3 Bottleneck modules after passing through a model layer 6, wherein the first Bottleneck module is a residual structure formed by two Conv modules, the number of convolution kernels of the first Conv module is 128, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, and the structures of the second and third Bottleneck modules are the same as those of the first Bottleneck module. For the C3 module herein, the first and second Conv modules are the same in structure, i.e., the number of convolution kernels is 128, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the third Conv modules is 256, the size of the convolution kernels is 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 256x40x40, the output is 128x40x40 through a first Conv module, then the output is 128x40x40 through two Bottleneck modules, then the original input is 128x40x40 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain 256x40x40 output, and then the final output or 256x40x40 is obtained through a third Conv module.

And then passing through a model layer 7, a Conv module consisting of a 2-dimensional convolution layer and a SiLU activation function layer, wherein the convolution kernel size of the 2-dimensional convolution layer is 3x3, the number of the convolution kernels is 512, the step size is 2x2, the filling is 1, and the output dimension is 512x20x20 after passing through the Conv module.

And then through the model layer 8, a C3 module is composed of 3 Conv modules and 1 Bottleneck module, wherein the Bottleneck module is a residual structure composed of two Conv modules, the number of convolution kernels of the first Conv module is 256, the size of the convolution kernels is 1x1, the step size is 1x1, the number of convolution kernels of the second Conv module of the Bottleneck module is 256, the size of the convolution kernels is 3x3, the step size is 1x1, the filling is 1, and the activation function is SiLU. For the C3 module here, the first and second Conv modules are the same in structure, i.e., the number of convolution kernels is 256, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the third Conv modules is 512, the step sizes of the convolution kernels are both 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 512x20x20, the output is 256x20x20 through a first Conv module, then the output is 256x20x20 through two Bottleneck modules, then the original input is 256x20x20 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain 512x20x20 output, and then the final output or 512x20x20 is obtained through a third Conv module.

And then passing through a model layer 9, SPPF module, which consists of two Conv modules and a maximum pooling layer. The number of convolution kernels of the first Conv module is 256, the size of the convolution kernel is 1x1, the step length is also 1x1, the activation function is SiLU, the number of convolution kernels of the second Conv module of the Bottleneck is 512, the size of the convolution kernel is 3x3, the step length is also 1x1, the filling is 1, and the activation function is SiLU. For the feature vector input 512x20x20, the first Conv module outputs 256x20x20 to obtain x1, then the maximum pooling output 256x20x20 of x1 to obtain y1, the maximum pooling output 256x20x20 of y1 to obtain y2, the maximum pooling output 256x20x20 of y2 to obtain y3, and then the joint output 1024x20x20 of x1, y2 and y3 re-channel dimensions are processed through the second Conv module to obtain the feature map with the final output 512x20x20.

And then passing through a model layer 10 and Conv modules, wherein the model layer comprises a 2-dimensional convolution layer and a SiLU activation function layer, the convolution kernel size of the 2-dimensional convolution layer is 1x1, the number of convolution kernels is 256, the step size is 1x1, and the output dimension is changed into 256x20x20 after passing through the Conv modules.

Then through model 11 th layer, an up sampling layer, up sampling is carried out by 2 times by using the nearest interpolation mode, and the output is changed to 256x40x40.

And then the output characteristics 256x40x40 of the 6 th layer and the output characteristics 256x40x40 of the 6 th layer are spliced in the channel dimension through a Concat splicing layer to obtain the output of 512x40 x40.

And a C3 module is formed by 3 Conv modules and 1 Bottleneck module through a model layer 13, wherein the Bottleneck module is a residual structure formed by two Conv modules, the number of convolution kernels of the first Conv module is 128, the convolution kernel size is 1x1, the step size is 1x1, the number of convolution kernels of the second Conv module of the Bottleneck is 128, the convolution kernel size is 3x3, the step size is 1x1, the filling is 1, and the activation function is SiLU. For the C3 module herein, the first and second Conv modules are the same in structure, i.e., the number of convolution kernels is 128, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the third Conv modules is 256, the size of the convolution kernels is 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 512x40x40, the output is 128x40x40 through a first Conv module, then the output is 128x40x40 through two Bottleneck modules, then the original input is 128x40x40 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain 256x40x40 output, and finally the output 256x40x40 is obtained through a third Conv module.

And then passing through a model layer 14, a Conv module, which consists of a 2-dimensional convolution layer and a SiLU activation function layer, wherein the convolution kernel of the 2-dimensional convolution layer has a size dimension of 1x1, the number of convolution kernels is 128, the step size is 1x1, and the output dimension is 128x40x40 after passing through the Conv module.

Then through model layer 15, an up-sampling layer, up-sampling is performed by 2 times by using the nearest interpolation mode, and the output becomes 128x80x80.

And then the output characteristics 128x80x80 of the previous layer 4 and the output characteristics 128x80x80 are spliced to obtain the output of 256x80x80 in the channel dimension through a Concat splicing layer of the model layer 16.

And then through a model layer 17, a C3 module is composed of 3 Conv modules and 1 Bottleneck module, wherein the Bottleneck module is a residual structure composed of two Conv modules, the number of convolution kernels of the first Conv module is 64, the size of the convolution kernels is 1x1, the step length is also 1x1, the number of convolution kernels of the second Conv module of the Bottleneck module is also 64, the size of the convolution kernels is 3x3, the step length is also 1x1, the filling is 1, and the activation function is SiLU. For the C3 module herein, the first and second Conv modules are the same in structure, i.e., the number of convolution kernels is 64, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the third Conv modules is 128, the step sizes of the convolution kernels are both 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 256x80x80, the input feature vector is output as 64x80x80 through a first Conv module, then the output is output through two Bottleneck modules to obtain 64x80x80, then the original input is output as 64x80x80 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain 128x80x80 output, and finally the output 128x80x80 is obtained through a third Conv module.

And then passing through a model 18 th layer, a Conv module, which consists of a 2-dimensional convolution layer, a BatchNorm layer and a SiLU activation function layer, wherein the convolution kernel of the 2-dimensional convolution layer has a size of dimension 1x1, the number of convolution kernels is 128, the step size is 1x1, and the output dimension is 128x40x40 after passing through the Conv module.

And then the output characteristics 128x40x40 of the 14 th layer and the output characteristics 128x40x40 of the 14 th layer are spliced to obtain the output of 256x40x40 in the channel dimension through a Concat splicing layer of the model 19 th layer.

And a C3 module passing through a model layer 20 is composed of 3 Conv modules and 1 Bottleneck module, wherein the Bottleneck module is a residual structure composed of two Conv modules, the number of convolution kernels of the first Conv module is 64, the size of the convolution kernels is 1x1, the step size is 1x1, the number of convolution kernels of the second Conv module of the Bottleneck module is 64, the size of the convolution kernels is 3x3, the step size is 1x1, the filling is 1, and the activation function is SiLU. For the C3 module herein, the first and second Conv modules are the same in structure, i.e., the number of convolution kernels is 64, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the third Conv modules is 128, the step sizes of the convolution kernels are both 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 256x40x40, the output is output as 64x40x40 through a first Conv module, then the output is output as 64x40x40 through two Bottleneck modules, then the original input is output as 64x40x40 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain output of 128x40x40, and finally the output is output as 256x40x40 through a third Conv module.

And then passing through a model 21 st layer, a Conv module, which consists of a 2-dimensional convolution layer, a BatchNorm layer and a SiLU activation function layer, wherein the convolution kernel of the 2-dimensional convolution layer has a size of dimension 1x1, the number of convolution kernels is 256, and the step size is 1x1. After passing through the Conv module, the output dimension is changed into 256x20x20.

Then the model layer 22 and a Concat splicing layer splice the previous layer 10 output characteristics 256x20x20 and 256x20x20 in the channel dimension to obtain 512x20x20 output.

And a C3 module passing through a model 23 layer, wherein the model is composed of 3 Conv modules and 1 Bottleneck module, the Bottleneck module is a residual structure composed of two Conv modules, the number of convolution kernels of the first Conv module is 256, the size of the convolution kernels is 1x1, the step size is 1x1, the number of convolution kernels of the second Conv module of the Bottleneck module is 256, the size of the convolution kernels is 3x3, the step size is 1x1, the filling is 1, and the activation function is SiLU. For the C3 module here, the first and second Conv modules are the same in structure, i.e., the number of convolution kernels is 256, the size of the convolution kernels is 1x1, the step size is 1x1, the activation function is SiLU, the number of the third Conv modules is 512, the step sizes of the convolution kernels are both 1x1, and the activation function is SiLU. The calculation flow of the C3 module is that for the input feature vector 512x20x20, the output is 256x20x20 through a first Conv module, then the output is 256x20x20 through two Bottleneck modules, then the original input is 256x20x20 through a second Conv module, then the two outputs are spliced in the channel dimension to obtain 512x20x20 output, and finally the output 512x20x20 is obtained through a third Conv module.

Finally, the output 128x80x80 of the 17 th layer, the output 256x40x40 of the 20 th layer and the output 512x20x20 of the 23 rd layer are input into a detection module together, a group of anchor frame anchor boxes which are preset through a Kmeans clustering method are arranged for all three feature images, each group is 3, and meanwhile, the output dimension to be predicted by the network is 1 bright smoke category+1 bright fire category+1 target two category+four coordinate values together by 7 dimensions. The four sets of output feature maps are then passed through a 2-dimensional convolution layer of 3x7 = 21 convolution kernels, the convolution kernels being 1x1 in step size, 1, and then the lower dimensional order is swapped to obtain [3, 80, 80, 7], [3, 40, 40, 7], [3, 20, 20, 7], respectively.

The anchor frame clustering Kmeans algorithm flow is as follows: 1) Traversing the training data set to read the width and height of all the labeling frames, wherein each group of width and height is used as a group of coordinates; 2) Randomly selecting k labeling frame coordinates as the central point of each cluster set (where k is 9); 3) Calculating the distances from the labeling frames to the centers of the k sets, and dividing each frame into a set to which the center closest to the Euclidean distance belongs; 4) If the frames in each set are not changed, terminating, and outputting the width and height represented by the central point of the cluster as a result; otherwise, updating the cluster center in such a way that the center point of all the frame width and height coordinates in each set is taken as a new center.

During training, the feature graphs output by the four groups of convolution layers are directly taken out and the labels are subjected to loss, wherein the class loss of the frames is calculated by using BCELoss, and the coordinate width and height are calculated by using CIOULoss.

The BCELoss loss calculation formula is as follows:

wherein the method comprises the steps ofRepresenting model predictive values and target representing tag values.

The CIOULoss calculation formula is as follows:

wherein,

a and B are two target boxes to be calculated,

、representing the width and height of the label frame,

w and h represent the width and height of the prediction frame,

representing the calculated Euclidean distance, b represents the central point of the prediction frame, c represents the central point of the labeling frame, b ^gt Representing the center point of the label box,

alpha and gamma are influencing factors.

And at the time of prediction, the four groups of output are spliced and then reshape is formed into data in [25200, 7] dimensions, wherein 7-dimensional digital meanings are respectively frame center point coordinates x, frame center point coordinates y, frame width w, frame height h, foreground target confidence, confidence of category 1 and confidence of category 2. 25200 represents the sum of the number of anchor boxes for all positions of all feature layers.

Then the result needs to be post-processed, a large number of invalid boxes are filtered by directly filtering the foreground object confidence coefficient to be smaller than 0.1, and the rest boxes are sent to NMS algorithm for duplication elimination.

The NMS calculates the mode as follows: 1) Taking out a frame with the maximum confidence score from the candidate frame set A; 2) Calculating an intersection ratio IoU of the fetched frames and the remaining frames in all the sets A, deleting all the frames of IoU which are larger than a threshold value (set to 0.5 here), and adding the fetched frames to the new set B; 3) Repeating 1 and 2 until the candidate frame set A is empty, and outputting the result processed by the NMS as the set B. Here we get the final detection result of the detection model.

The summary is as follows: in the training stage, the acquired training data are subjected to overlapping labeling, model parameters are updated through forward reasoning calculation and a back propagation algorithm, and finally an optimal model is selected by using a picture level mIF1 index; in the deployment stage, the frames are filtered and combined by model forward reasoning calculation and then post-processing by NMS algorithm. The 24-layer structure of the model can be used during training and reasoning prediction, only the model 24-layer structure is used for updating model parameters after calculation during training, and the final detection frame is obtained after NMS post-processing of the calculated result of the model 24-layer structure during reasoning prediction.

Finally, it is pointed out that a person skilled in the art, given the benefit of this disclosure, can make numerous variants, all of which fall within the scope of protection of the invention, without thereby departing from the scope of protection of the claims.

Claims

1. A smoke and fire detection method based on overlapping annotation training, the method comprising the steps of:

s100: collecting training data;

s500: selecting an optimal model through a picture level F1 index IF1;

s600: deploying the optimal model to finish smoke and fire detection;

s700: post-processing all detected firework target frames, and outputting whether firework and coordinate values of the target frames exist or not;

wherein, the step S500 further includes:

s503: after training, selecting a model with the highest mIF1 value as an optimal model;

the specific calculation mode of the IF1 is as follows:

，

wherein, IP represents the precision of the picture level, IR represents the recall of the picture level, ITP represents that the detected picture contains a certain kind of target, and the label also contains the target; IFP represents that the predicted picture contains some kind of object, but the tag does not contain the object; IFN represents that the predicted picture does not contain some object, but the label contains the object;

wherein,

the post-processing in the step S700 includes performing non-maximum suppression post-processing and confidence threshold filtering on the obtained pyrotechnic target frame;

wherein,

the overlap labeling method is to label fireworks by using a plurality of overlapped dense small frames, the overlap labeling not only improves the number of positive samples, but also reduces the background information selected by the frames, and simultaneously, the model is more concerned with texture information instead of outline form, and when the prediction is deployed, non-maximum value inhibition processing is performed in post-processing, and a large number of predicted overlapped frames can be de-overlapped and visualized conciseness and intuitiveness are maintained.

2. The method of claim 1, wherein the training data comprises: the smoke and fire detection data set of the open source, the crawled pictures comprising fire and fire keywords and the data collected on site.

3. The method of claim 1, wherein the data cleansing means: only the picture data of the open source dataset is reserved, or only some irrelevant pictures are removed, or the video is converted into pictures.