CN115457428A

CN115457428A - Improved YOLOv5 fire detection method and device integrating adjustable coordinate residual attention

Info

Publication number: CN115457428A
Application number: CN202210981425.XA
Authority: CN
Inventors: 李晓旭; 张曦; 于春雨
Original assignee: Shenyang Fire Research Institute of MEM
Current assignee: Shenyang Fire Research Institute of MEM
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-12-09

Abstract

The invention provides an improved YOLOv5 fire detection method and device integrating adjustable coordinate residual attention, wherein the method comprises the following steps: constructing a fire data set, wherein the fire data set comprises video data and first picture data of different fire degrees collected in a laboratory ignition experiment, extracting second picture data from the video data, and adding marks of flame and/or smoke for the first picture data and the second picture data; establishing an improved YOLOv5 neural network which is integrated with adjustable coordinate residual attention, and training the improved YOLOv5 neural network by utilizing a fire data set to serve as a fire detection model; and deploying the fire detection model to the mobile terminal, and after receiving the real-time video data captured by the camera, detecting and identifying a fire target by the mobile terminal through the fire detection model on the real-time video data. The invention can not only identify and detect the flame information generated by the fire, but also identify and detect the smoke generated in the early stage of the fire, thereby reducing the loss of missing the optimal repair time in the early stage of the fire.

Description

Improved YOLOv5 fire detection method and device integrating adjustable coordinate residual attention

Technical Field

The invention relates to the technical field of computer vision and image type fire detection and identification, in particular to an improved YOLOv5 fire detection method and device integrating adjustable coordinate residual error attention.

Background

Accurate identification of early detection of a fire is an important means for fire safety, and it is valuable and necessary to research a fire monitoring alarm system with a rapid response capability. Fire monitoring and alarm systems have been under investigation for decades.

Chinese patent application CN113869567a describes a control method, device, computer device and storage medium based on fire prediction information for multiple scenes, which mainly aims at performing fire prediction control on fire scene data, such as temperature, smoke flow and fire fighting data. Chinese patent application CN113673748A discloses a fire prediction method based on an XGboost model, which mainly utilizes the XGboost model to realize the prediction of fire, but the XGboost model cannot model a space position and cannot well capture images.

Muhammad et al [ Muhammad K, ahmad J et al (2019) effective device CNN-based fire detection and localization in video surveillance applications, IEEE Trans Syst Man Cybern Syst 19 (7): 1419-1434] classify fire detection methods into two categories: conventional fire alarms and visual sensors assist in fire detection. Currently, most fire detection and fire alarm systems are based on conventional fire detection or fire alarm systems. For example, xu et al [ Xu Y, zhang J et al (2013) The structure of an automatic fire alarm system based on a visual instrument, J Tianjin Univ Techninol 29 (3): 30-36] propose a fire alarm system based on a fire alarm controller, temperature and smoke detectors. Hu et al [ Hu X (2013) Research and product reduction of MIR flame detector system, J Zhejiang Univ 10 (1): 78] propose a multiband infrared fire detector. However, systems based on these sensors have a limited monitoring range and the performance of these systems is susceptible to environmental changes.

With the popularization of video monitoring systems, the research of assisting fire detection by a visual sensor is receiving much attention. Advantages of image/video based detection of fire include fast response, insensitivity to ambient temperature, and concomitant real-time images or video of the fire scene. In image/video based fire detectors, fire objects are abstracted into image features generated from color, brightness, texture, shape and motion information. Toptas proposes a remote video monitoring system and an image processing technology based on a network camera for fire monitoring and alarm technology [ Toptas B, handy D. A new intelligent be color algorithm-based color space for fire/flame detection [ J ]. Soft Computing, 2019 (2): 1-12 ]. Wan et al [ Wan Z (2020) Fire detection from images based on single shot multi-box detector. Hohai unity, nanjing ] propose an improved SSD for detecting fires in images by using data to enhance and modify the proportion and number of default boxes, but with an accuracy of only 84.75%. Shen et al [ Shen D, chen X, yan W (2018) Flame detection using detection In: proceedings of the 2018 th international control on control automation and robotics, pp 416-420] propose an optimized YOLO model for detecting Flame objects from video frames. However, the data set employed lacks data diversity because the samples are from 194 images.

The difficulty of the fire identification technology based on digital image processing is the segmentation and extraction of the flame target. Previously, flame and smoke targets were extracted mainly by excavation methods and contour tracing techniques. However, in practical applications, the obtained image is noisy, and the noise area is not uniform in size, which often results in image damage. Not only does this take a lot of time, but also needs the difference of object and actual outline, does not adopt the attention to extract more characteristic information to the characteristic extraction part of training image with pertinence, can appear the condition of wrong recognition or missing recognition, has influenced speed and accuracy of flame recognition.

Disclosure of Invention

In view of the above, the present invention proposes an improved YOLOv5 fire detection method and apparatus incorporating adjustable coordinate residual attention that overcomes or at least partially addresses the above-mentioned problems.

The invention provides an improved YOLOv5 fire detection method integrated with adjustable coordinate residual attention, which comprises the following steps:

constructing a fire data set, wherein the fire data set comprises video data and first picture data of different fire degrees collected in a laboratory ignition experiment, extracting second picture data from the video data, and adding marks of flame and/or smoke to the first picture data and the second picture data;

establishing an improved YOLOv5 neural network integrated with adjustable coordinate residual attention, and training the improved YOLOv5 neural network by using the fire data set to serve as a fire detection model;

and deploying the fire detection model to a mobile terminal, and after the mobile terminal receives real-time video data captured by a camera, detecting and identifying a fire target by the mobile terminal through the fire detection model.

Optionally, the improved YOLOv5 neural network that blends in adjustable coordinate residual attention comprises: backbone network Backbone, neck network Neck and Head network Head;

the Backbone network Backbone is mainly used for extracting key features from an input image; the Neck network Neck is mainly used for creating a characteristic pyramid; the Head network Head is primarily responsible for the final detection step, which uses the anchor box to construct the final output vector with class probabilities, objectification scores, and bounding boxes.

Optionally, the establishing of the improved YOLOv5 neural network merged into the coordinate attention includes:

an attention mechanism is added to a backbone network for YOLOv5 feature extraction, the attention mechanism is utilized to encode the remote dependency relationship and the position information of the input image from the horizontal and vertical spatial directions respectively, and then the features are aggregated.

Optionally, the attention mechanism final output is represented as follows:

wherein the content of the first and second substances,

a graph of the input features is represented,

and

respectively representing the attention weights of the two spatial directions. The formula is expressed as follows:

wherein

And

respectively, feature tensors of the information decomposition of the feature F in two directions.

(. A) and

(. Cndot.) denotes convolution operations with convolution kernels of 1 × 1, respectively.

For over-parameters, the level and the height can be automatically adjustedFeature weight in vertical direction.

Wherein the content of the first and second substances,

and

representing the original features in two directions, respectively. Concat (-) represents the splicing operation of two features.

Wherein H and W are respectively input into the height and width of the characteristic diagram,

the output of the c-th channel with height h,

is the output of the c-th channel of width w,

is the input image characteristic of the c channel.

Optionally, the backbone network includes four bottleeck-CSP-New modules to replace the bottleeck-CSP modules in the original YOLOv5 neural network;

the Bottleneck-CSP-New module comprises a first module and a second module; the first module uses a convolution layer of 1 multiplied by 1 to reduce the number of channels by half, then controls the number of channels of a hidden layer in a Bottleneck module through a residual error structure Bottleneck module and parameters, and then passes through a Conv.2Dxl module without passing through a BN and an activation function; and the second module carries out shortcut connection operation on the input features and the output of the first module without any change, and finally outputs the input features after the convolution of BN + Relu and common Conv.2Dxl.

Optionally, the loss function of the fire detection model is as follows:

the loss function is a total loss function of the model, and is specifically as follows:

wherein the content of the first and second substances,

represents the aspect ratio of the target detection frame,

representing the aspect ratio of the prediction detection block.

Of these, A, B represents a prediction detection frame and a target detection frame, respectively. a and

respectively representing the center point of the prediction detection frame and the target detection frame.

Representing the euclidean distance between the two center points, c represents the diagonal length of the minimum occlusion region of the prediction frame that contains the target frame at the same time.

Is a variable parameter hyperparameter.

Wherein, the first and the second end of the pipe are connected with each other,

the setting is 1, and the setting is,

the setting is 2, and the setting is,

and y is the judgment of whether the sample is a positive sample or not for the size of the prediction probability. The focus loss function replaces the cross entropy loss function as a confidence and classification loss for the network.

The invention also provides an improved YOLOv5 fire detection device incorporating adjustable coordinate residual attention, the device comprising:

the system comprises a data collection module, a data analysis module and a data analysis module, wherein the data collection module is used for constructing a fire data set, the fire data set comprises video data and first picture data of different fire degrees collected in a laboratory ignition experiment, extracting second picture data from the video data, and adding marks of flame and/or smoke for the first picture data and the second picture data;

the model establishing module is used for establishing an improved YOLOv5 neural network fused with coordinate attention, and training the improved YOLOv5 neural network by utilizing the fire data set to serve as a fire detection model;

and the model deployment module is used for deploying the fire detection model to a mobile terminal, and after the mobile terminal receives the real-time video data captured by the camera, the mobile terminal utilizes the fire detection model to detect and identify the fire target for the real-time video data.

The invention also provides a computer readable storage medium for storing program code for performing the method of any of the above.

The present invention also provides a computing device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform any of the methods described above according to instructions in the program code.

Aiming at the problems of low detection precision and low speed of the existing fire detection method and sensor, the invention provides an improved YOLOv5 fire detection method and device integrating adjustable coordinate residual attention on the basis of analyzing fire image characteristics. According to the invention, the characteristics of the image can be automatically extracted and learned by adopting the YOLOv5 neural network, firstly, the attention mechanism is added, and the position information is embedded into the channel attention, so that the network can obtain information in a larger range, and the detection precision of a small target and a fuzzy smoke boundary is improved. Meanwhile, a Bottleneck-CSP model in a backbone network is improved, model parameters are reduced, the volume of the model is reduced, and effective support is provided for deployment of the model. The method can quickly and accurately identify the detection object and visually detect the fire in real time. The method can identify and detect not only the flame information generated by the fire, but also the smoke generated in the early stage of the fire, and reduce the loss of missing the optimal repair time in the early stage of the fire. For early fire, the missing of the optimal remedy time is reduced, and the early fire detection is carried out in time.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic flow chart of an improved YOLOv5 fire detection method incorporating adjustable coordinate residual attention according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating the overall structure of an improved YOLOv5 network according to an embodiment of the present invention;

FIG. 3 illustrates an overall implementation of an attention mechanism of an embodiment of the present invention;

FIG. 4 is a schematic diagram showing a front-to-back comparison of an improved original BottleneckCSP module according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of an improved YOLOv5 fire detection device incorporating adjustable coordinate residual attention according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

An embodiment of the present invention provides an improved YOLOv5 fire detection method incorporating adjustable coordinate residual attention, and as shown in fig. 1, the improved YOLOv5 fire detection method incorporating coordinate attention according to the embodiment of the present invention may include at least the following steps S101 to S103.

S101, a fire data set is constructed, the fire data set comprises video data and first picture data of different fire degrees collected in a laboratory ignition experiment, second picture data are extracted from the video data, and marks of flames and/or smoke are added to the first picture data and the second picture data.

Optionally, videos and pictures of different degrees of a fire disaster can be collected by performing multiple ignition tests in a laboratory, wherein the sizes of the burning discs are respectively subjected to simulation tests according to the sizes of the burning discs specified by an image fire detector in national standard special fire detector, small target fire picture data and/or video data are collected, and the video data are subjected to interception of fire picture data of multiple burning states. The method comprises the steps of marking fire pictures by using a picture marking tool (labelImg), marking a region of interest of each picture, manually marking flame and smoke parts in each picture, and optionally dividing first picture data and second picture data into a flame target only, a smoke target only and a smoke target and a flame target simultaneously. That is, for any image data, the target markers may be aligned, e.g., it is determined that the image data contains a smoke target, a flame target, or both a smoke target and a flame target.

S102, establishing an improved YOLOv5 neural network fused with coordinate attention, and training the improved YOLOv5 neural network by using the fire data set to serve as a fire detection model;

s103, deploying the fire detection model to a mobile terminal, and after receiving real-time video data captured by a camera, detecting and identifying a fire target by the mobile terminal through the fire detection model.

The embodiment can improve the traditional YOLOv5 neural network and add an attention mechanism; inputting the marked fire data set according to a format required by a neural network, inputting the fire data set into an improved YOLOv5 neural network for training and testing a result; and deploying the trained model to a mobile terminal to perform fire detection and identification tasks.

As shown in fig. 2, the improved YOLOv5 neural network fused to coordinate attention includes: backbone network Backbone, neck network Neck and Head network Head; the Backbone network Backbone is mainly used for extracting key features from an input image; the Neck network Neck is mainly used for creating a characteristic pyramid; the Head network Head is primarily responsible for the final detection step, which uses the anchor box to construct the final output vector with class probabilities, objectification scores, and bounding boxes.

The step S102 of building an improved YOLOv5 neural network that incorporates the residual attention of the adjustable coordinate includes: an attention mechanism is added to a backbone network for YOLOv5 feature extraction, the attention mechanism is utilized to encode the remote dependency relationship and the position information of the input image from the horizontal and vertical spatial directions respectively, and then the features are aggregated.

When the attention mechanism is integrated into improvement, coordinate attention can be added into a backbone network of Yolov5 feature extraction, wherein the coordinate attention is a lightweight and efficient attention mechanism, and position information is embedded into channel attention, so that a mobile network can acquire knowledge in a larger range. Fig. 3 shows an overall implementation schematic of the attention mechanism of the embodiment of the present invention. The attention mechanism of the present embodiment is a coordinate attention mechanism. This attention encodes the remote dependency and location information from the horizontal and vertical spatial directions, respectively, and then aggregates the features. Therefore, features need to be decomposed to capture position information from space. Specifically along the horizontal and vertical directions. To be input with a characteristic diagram, i.e.

Using pooled nuclei

And

c-th channel with height h and width w for encoding horizontal and vertical direction features respectivelyThe output of the lanes are represented as:

wherein, H and W are respectively input into the height and width of the characteristic diagram,

the output of the c-th channel with height h,

is the output of the c-th channel with width w,

is the input image characteristic of the c channel. The two transformations described above aggregate features with two spatial directions (X and Y). They generate a pair of directional perceptual feature maps that enable the attention mechanism to capture distant information of the feature map along one spatial path and preserve accurate position information along another spatial path. Attention-driven mechanisms are widely used to improve the performance of the model, and the inspiration for attention-driven mechanisms comes from the way the human eye observes things, as the human eye is always focusing on the most important aspects of things. Also, it allows the network to focus on important features, which contributes to the accuracy of the network. By applying an attention mechanism to the network model, the accuracy of the classification will be further improved. The essence of the attention mechanism is to weight the feature map, so that the model can pay attention to important feature information, and the generalization capability of the model is improved. The SE attention mechanism is to compute channel attention weights using 2D global pooling and weight the feature information to optimize the model. However, SE attention weights the channel dimensions of the feature map, but ignores the spatial dimensions, which is crucial in computer vision tasks. CBAM uses channel pooling and convolution to weight the spatial dimension. However, convolution cannotThe relevance of the remote information is captured, which is crucial for the visual task. Therefore, the invention provides a network fusion coordinate attention mechanism, and the coordinate attention mechanism can obtain cross information, position sensitivity and direction perception information. It helps the model to focus useful feature information. Global Average Pooling (GAP) is commonly used to compute channel attention weights and to globally encode spatial information, GAP for each image feature over the spatial dimension H × W. However, it calculates the channel attention weight by compressing the global spatial information, thereby losing spatial information. Thus, the two-dimensional global pooling is decomposed into one-dimensional global pooling in horizontal and vertical directions to efficiently utilize spatial and spectral information. Specifically, each spectral dimension in the feature map with spatial extent (H, 1) and (1, W) is encoded using 1D horizontal global pooling and vertical global pooling. The two formulas described above allow correlation of the remote information to be obtained in one spatial direction while preserving location information in the other spatial direction, which helps the network focus on more information useful for classification. The two signatures generated in the horizontal and vertical directions are then encoded into two attention weights, each weight capturing the relevance of the remote information from the input signature in one spatial direction.

The two transforms are spliced in the spatial dimension and the channels are compressed using a 1 x 1 convolution. The spatial information in the vertical and horizontal directions is then encoded using BatchNorm and nonlinearities, the encoded information is split, and the noted channels are adjusted to be equal to the number of channels in the input feature map using a 1 × 1 convolution. Then, normalization and weighted fusion are performed using sigmoid function. The attention mechanism final output is expressed as follows:

representing the input feature map, c representing the c-th channel, h, w representing the height and width of the input feature map, respectively. i. j is a function ofThe sublist represents the height and width of the current vector.

And

respectively, the attention weights for the two spatial directions. The formula is expressed as follows:

wherein

And

(. A) and

For hyper-parameters, feature weights in horizontal and vertical directions may be automatically adjusted.

The method aims at the flame to carry out target detection, and considers that the flame form is constantly changed along with time, and the flame form has different change characteristics in the horizontal direction and the vertical direction. Thus, passing the hyper-parameter

Influence of changes in horizontal and vertical directions on recognition respectivelyAnd (6) adjusting the rows. Meanwhile, the initial characteristics of the flame are kept through residual connection, and the initial characteristics are combined with the coordinate attention characteristics to achieve a better recognition effect.

Wherein the content of the first and second substances,

and

Further, the backbone network of the embodiment of the invention comprises four Bottleneck-CSP-New modules to replace the Bottleneck-CSP modules in the original YOLOv5 neural network; the Bottleneck-CSP-New module comprises a first module and a second module; the first module uses a convolution layer of 1 multiplied by 1 to reduce the number of channels by half, then controls the number of channels of a hidden layer in a Bottleneck module through a residual error structure Bottleneck module and parameters, and then passes through a Conv.2Dxl module without passing through a BN and an activation function; and the second module carries out short connection operation on the input characteristics and the output of the first module without any change, and finally outputs the input characteristics after the input characteristics are convolved by BN + Relu and common Conv.2Dxl.

The modified model size should be reduced as much as possible for deployment on a hardware device. Therefore, the backbone network of the YOLOv5 model is modified. The backbone network of the YOLOv5s architecture includes four bottleeck-CSP modules, each having a number of convolutional layers. Although the convolution process can extract image features, the parameters of the convolution kernel are many, resulting in many parameters in the recognition model. Thus, the convolutional layers on different branches of the original CSP module are deleted. The depth of the other branch, to which the input feature map of the Bottleneck-CSP module is directly connected to the output feature map, greatly reduces the number of parameters in the module. The four stages of the original backbone network using the Bottleneck-CSP module are replaced by four Bottleneck-CSP-New modules, and due to the lightweight attribute of the Bottleneck-CSP modules, the defect of deep feature extraction in the image can be finally caused, but the model parameters can be reduced while the image feature information can be well extracted by combining with an attention mechanism, so that the model is convenient to deploy.

Fig. 4 shows a front-back comparison schematic diagram of an improved original bottleeckcsp module according to an embodiment of the present invention. An original Bottleneck-CSP module is divided into two parts of a Bottleneck and a CSP, input features pass through two different modules, firstly, the first module uses a convolution layer (conv.2Dxl + BN + Hardwish) of 1 x 1 to reduce the number of channels by half, then the number of channels of a hidden layer in the Bottleneck module is controlled through parameters through the Bottleneck module with a residual error structure, and then the input features pass through a Conv.2Dxl module without passing through a BN and an activation function. Next is a second module, the input features pass through a module of Conv2d, not through the BN and activation functions. And performing short connection operation on the outputs of the two modules, and finally performing output after the BN + Relu and the common Conv.2Dxl convolution.

The modified Bottleneck-CSP-New module is also divided into two modules, the first module is consistent with the original feature extraction flow, the second part is that the input feature is subjected to short connection operation with the output of the first module without any change, and finally the output is carried out after the convolution of BN + Relu and common Conv.2Dxl.

After the modified YOLOv5 neural network is obtained, the training of the modified YOLOv5 neural network in step S102 may be continued. Training the improved YOLOv5 neural network may be performed in the following manner.

1. Firstly, the labeled data set is input according to the format required by the neural network. The input end adopts a Mosaic data enhancement mode, images are spliced in a mode of random zooming, random cutting and random arrangement, and the detection effect on the small-target fire image is obvious. And (4) carrying out self-adaptive anchor frame calculation, and calculating the optimal anchor frame value of different training sets in a self-adaptive manner during each training.

2. Secondly, in the network training stage, the detailed steps of each module are as follows:

(1) The Backbone network Backbone is mainly used for extracting key features from input images.

The Focus layer is an initial layer of the backbone network and is used for simplifying model calculation and improving training speed. The application is as follows: using the slicing technique, the three-channel image is first segmented into four slices, each 3 × 320 × 320. Second, the four parts are connected deeply using concatenation, with an output feature map size of 12 × 320 × 320.

The Conv layer is the second layer of the backbone network, and performs convolution operation on the input feature map through the convolution layer, and forms an output feature map with the size of 32 × 320 × 320 by adopting the convolution layer composed of 32 convolution kernels. Finally, the result is output to the next layer through the BN layer (batch normalization) and the Hardswish activation function.

The BottleneckCSP-New module is the third layer of the backbone network and is designed to more effectively extract the depth information of the image. The BottleneckCSP-New module is mainly composed of a Bottleneck module, and connects a convolution layer (Conv 2d + BN + ReLu activation function) with a convolution kernel size of 1 × 1 and a convolution kernel size of 3 × 3. The final output of the Bottleneck module is the sum of the output of the part and the initial input through the residual structure. The first input of the BottleneckCSP-New module is split into two branches, the number of feature mapping channels is halved using convolution in the two branches, and then the output feature maps of branch 1 and branch 2 are deeply connected through the input features in the Bottleneck module and the second branch using concat. Finally, after stepping through the BN and Conv2d layers, an output profile of the module is created.

The CA-Res module is the fourth layer of the backbone network, and adopts an attention mechanism, wherein the attention mechanism encodes remote dependency and position information of an input image from the horizontal and vertical spatial directions respectively, learns horizontal and vertical direction characteristics through an adjustable residual error structure, and aggregates the horizontal and vertical direction characteristics. In particular toSaid, the characteristic diagram output by the BottleneckCSP-New module, namely

Using pooled nuclei

And

coding the horizontal and vertical direction features respectively, learning the horizontal and vertical direction features through an adjustable residual error structure, aggregating the horizontal and vertical direction features, and performing normalization and weighted fusion on the horizontal, vertical and input features by using a sigmoid function.

And then, repeating the Conv module, the BottleneckCSP-New module and the CA module twice, and performing Conv convolution operation on the output image characteristics once again.

The SPP block (spatial pyramid pool) is the twelfth layer of the backbone network and is designed to increase the field of view of the network by converting an arbitrarily sized feature map into a fixed sized feature vector. After the convolutional layer loop, outputting a feature map of size 256 × 20 × 20; the convolution kernel size is 1 × 1. Then, the feature map is sub-sampled by three concurrent max boosting layers, and the feature map is deeply connected with the output feature map, and the size of the output feature map is 1024 × 20 × 20.

(2) The hack model is mainly used to create a feature pyramid. The feature pyramid helps in successful generalization of the model in object scaling and helps in identifying the same object of different size and scale.

This formula is used to select a feature map, where 224 is the pre-trained size of a typical graph network, x, y are the width x and height y of ROl (the region of interest), respectively,

is the target level to which the region of interest of x y =224 should map. The feature pyramid is very beneficial to help the model perform well on unknown data.

(3) The Head model is primarily responsible for the final detection step, which uses the anchor box to construct the final output vector with class probabilities, objectification scores, and bounding boxes. The detection network of the YOLOv5s structure includes three detection layers, each of which inputs a feature map having sizes of 80 × 80, 40 × 40, and 20 × 20 for detecting image objects of various sizes. Each detection layer outputs a 21-channel vector comprising two classes, a class probability, four surrounding frame position coordinates and three anchor frames. Then, a prediction boundary box and a category of the target in the original image are generated and marked, so that the detection of the image target is realized.

Computational adoption of a loss function

A loss function of the fire detection model as follows:

the loss function is the total loss function of the model.

When the prediction detection frame and the target detection frame do not intersect, ioU cannot reflect the distance between the two structures. At this point, the function is not conducive and cannot be optimized. The second problem is that IoU are identical, but the positions of the prediction detection frames are different, so IoU _ Loss cannot distinguish the intersection of the two. DIoU _ loss solves the above problem by considering the overlapping area of two frames and the distance of the center point. The CIoU _ Loss introduces the length-width ratio of two frames on the basis of the DIoU _ Loss, so that the intersection is converged faster than when the intersection is 0, and the regression result is better. Aspect ratio

And

the expression of (c) is as follows:

wherein the content of the first and second substances,

represents the aspect ratio of the target detection frame,

representing the aspect ratio of the prediction detection block.

Of these, A, B denotes a prediction detection frame and a target detection frame, respectively. a and

Is a variable parameter hyperparameter.

In order to solve the problem of imbalance of positive and negative samples, focus loss is adopted to replace a cross entropy loss function to serve as confidence coefficient and classification loss of the network. The method gives higher loss weight to the foreground image, so that the model is more concentrated on classification of the foreground.

the setting is 1, and the setting is,

the setting is 2, and the setting is,

and y is the judgment of whether the sample is a positive sample or not.

The method comprises the steps of deploying trained and trained fire detection models to a mobile terminal to detect and identify fire targets, deploying trained weights and models to the mobile terminal, capturing videos through a camera and extracting the videos to the mobile terminal, then performing characteristic analysis on each frame of image of an input video stream by using the fire detection models, judging whether smoke targets and/or flame targets exist or not, detecting emerging fires in real time, outputting alarms when smoke and/or flames are detected, and popping up a detection frame to prompt relevant personnel to execute fire extinguishing measures.

The image with fire marks is input into an improved network model and passes through three modules, namely a backbone network, a neck network and a head network. The backbone network includes a slice network, a convolutional network, a modified bottleeckcsp network, a coordinate attention mechanism, and an SPP network, and the three-channel image is first segmented into four slices, each slice being 3 × 320 × 320, using a slicing technique. Secondly, the four parts are deeply connected by cascade connection, the output feature mapping size is 12 multiplied by 320, the input features of the image are extracted by a series of feature extraction networks, and the output feature mapping size is 1024 multiplied by 20. The neck network retains spatial information through up-sampling and down-sampling operations. And finally, sampling the feature graphs without sizes, processing the feature graphs into the same size, performing feature fusion and convolution operation to obtain 3 feature layers of 20 × 255, 40 × 255 and 80 × 255, calculating by using a GIoU loss function, predicting and generating a prediction result. And screening the multi-target frames by using the non-maximum value.

An embodiment of the present invention further provides an improved YOLOv5 fire detection apparatus incorporating adjustable coordinate residual attention, and as shown in fig. 5, the improved YOLOv5 fire detection apparatus incorporating coordinate attention according to an embodiment of the present invention may include:

a data collecting module 510, configured to construct a fire data set, where the fire data set includes video data and first picture data of different fire degrees collected in a laboratory ignition experiment, extract second picture data from the video data, and add signs of flame and/or smoke to the first picture data and the second picture data;

a model establishing module 520, configured to establish an improved YOLOv5 neural network that incorporates coordinate attention, and train the improved YOLOv5 neural network using the fire data set as a fire detection model;

a model deployment module 530, configured to deploy the fire detection model to a mobile terminal, and after the mobile terminal receives real-time video data captured by a camera, the mobile terminal uses the fire detection model to detect and identify a fire target for the real-time video data.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium is used for storing a program code, and the program code is used for executing the method described in the above embodiment.

An embodiment of the present invention further provides a computing device, where the computing device includes a processor and a memory: the memory is used for storing program codes and transmitting the program codes to the processor; the processor is configured to perform any of the methods described above according to instructions in the program code.

It is clear to those skilled in the art that the specific working processes of the above-described systems, devices, modules and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.

In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.

Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Alternatively, all or part of the steps of the method embodiments may be implemented by hardware (such as a personal computer, a server, or a network device) related to program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

Claims

1. An improved YOLOv5 fire detection method incorporating adjustable coordinate residual attention, the method comprising:

2. The method of claim 1, wherein the improved YOLOv5 neural network that blends in adjustable coordinate residual attention comprises: backbone network Backbone, neck network Neck and Head network Head;

the Backbone network backhaul is mainly used for extracting key features from input images; the Neck network Neck is mainly used for creating a characteristic pyramid; the Head network Head is primarily responsible for the final detection step, which uses the anchor box to construct the final output vector with class probabilities, objectification scores, and bounding boxes.

3. The method of claim 2, wherein the building of the improved YOLOv5 neural network that blends in adjustable coordinate residual attention comprises:

adding an attention mechanism into a backbone network for YOLOv5 feature extraction, coding remote dependency relationship and position information of an input image from horizontal and vertical spatial directions respectively by using the attention mechanism, learning horizontal and vertical direction features through an adjustable residual error structure, and aggregating the features in the horizontal and vertical directions.

4. The method of claim 3, wherein the adjustable coordinate residual attention mechanism final output is represented as follows:

wherein the content of the first and second substances,

a graph of the input features is represented,

and

the attention weights for the two spatial directions are expressed separately, and the formula is as follows:

wherein

And

the feature tensors that are the information decomposition of the feature F in two directions,

(. A) and

(. Cndot.) represents convolution operations with convolution kernels of 1 × 1 respectively,

the feature weight in the horizontal and vertical directions can be automatically adjusted for the hyper-parameter;

wherein the content of the first and second substances,

and

respectively representing original features in two directions, and Concat (·) represents splicing operation of the two features;

is the output of the c-th channel with height h,

is the output of the c-th channel with width w,

is the input image characteristic of the c channel.

5. The method of claim 2, wherein the backbone network comprises four Bottleneck-CSP-New modules to replace the Bottleneck-CSP modules in the original YOLOv5 neural network;

the Bottleneck-CSP-New module comprises a first module and a second module; the first module uses a convolution layer of 1 multiplied by 1 to reduce the number of channels by half, then controls the number of channels of a hidden layer in a Bottleneck module through a residual error structure Bottleneck module and parameters, and then passes through a Conv.2Dxl module without passing through a BN and an activation function; and the second module carries out short connection operation on the input characteristics and the output of the first module without any change, and finally outputs the input characteristics after the input characteristics are convolved by BN + Relu and common Conv.2Dxl.

6. The method of claim 2, wherein the loss function of the fire detection model

The following were used:

the

The loss function is the total loss function of the model, and is specifically as follows:

wherein the content of the first and second substances,

represents the aspect ratio of the target detection frame,

representing the aspect ratio of the prediction detection block;

wherein A, B denotes a prediction detection frame and a target detection frame, respectively, a and

respectively representing the central point of the prediction detection frame and the target detection frame,

representing the euclidean distance between two center points, c representing the length of the diagonal of the minimum occlusion region of the prediction frame containing the target frame at the same time,

a hyper-parameter that is a variable parameter;

wherein the content of the first and second substances,

the setting is 1, and the setting is,

the setting is 2, and the setting is,

and (4) for predicting the probability, y is for judging whether the sample is a positive sample, and the focus loss function replaces the cross entropy loss function to be used as the confidence coefficient and the classification loss of the network.

7. An improved YOLOv5 fire detection device incorporating adjustable coordinate residual attention, the device comprising:

the data collection module is used for constructing a fire data set, wherein the fire data set comprises video data and first picture data of different fire degrees collected in a laboratory ignition experiment, extracting second picture data from the video data, and adding marks of flame and/or smoke for the first picture data and the second picture data;

the model building module is used for building an improved YOLOv5 neural network which is fused with adjustable coordinate residual attention, and training the improved YOLOv5 neural network by using the fire data set to serve as a fire detection model;

8. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the method of any of claims 1-6.

9. A computing device, the computing device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-6 according to instructions in the program code.