CN116824335A

CN116824335A - YOLOv5 improved algorithm-based fire disaster early warning method and system

Info

Publication number: CN116824335A
Application number: CN202310756773.1A
Authority: CN
Inventors: 邱云周; 贾根团; 张静; 郑春雷
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2023-09-29

Abstract

The application relates to a fire disaster early warning method based on a YOLOv5 improved algorithm, which comprises the following steps: s1, obtaining streaming media data; s2, preprocessing the streaming media data to obtain an image sequence to be detected; s3, detecting the image sequence to be detected by using a fire detection model to obtain a detection result; the fire detection model is constructed based on a YOLOv5 improvement algorithm and comprises the following steps: a attention module CAB is placed in the deep layer of the backbone network to extract the characteristics; a feature fusion module is constructed to perform multi-scale feature fusion and generate four detection heads with different receptive fields; s4, judging whether fire occurs or not based on the detection result of the fire detection model. The application can accurately detect early fire in real time.

Description

YOLOv5 improved algorithm-based fire disaster early warning method and system

Technical Field

The application relates to the technical field of target detection, in particular to a fire disaster early warning method and system based on a YOLOv5 improved algorithm.

Background

Fire is one of the world-recognized disasters, seriously jeopardizing the life and property safety of humans. For security construction of smart cities, early effective fire detection and early warning are of vital importance. Sensors based on physical signals, such as smoke sensors, heat release infrared flame sensors, ultraviolet flame sensors, etc., are widely used in fire alarm systems. Because these conventional physical sensors are limited to near fire locations, they cannot work effectively in semi-enclosed large-space buildings and open underground spaces, and they cannot provide disaster detailed information such as fire location, fire size, and degree of combustion, and the visual sensor-based fire detection technology can meet these demands.

In the early stage of visual fire detection research, fire scenes are described mainly by manually extracting static and dynamic characteristics such as colors, textures, shapes, edges, motions and the like of flames according to application scenes, and then a proper classifier is designed by combining a machine learning method to perform further classification and identification. The traditional method improves the accuracy of fire identification by designing the artificial feature extractor, and promotes the development of visual fire detection technology to a certain extent. However, due to the high complexity of fire scenes in video, the characteristics of manual design are highly redundant and depend on fixed scenes, the extractable information is only the shallow characteristics of flames, and the method belongs to a heuristic method, has poor robustness and is difficult to adapt to fire detection in complex scenes. Accordingly, in recent years, a visual fire detection study has been conducted by using a deep learning method. The model design of the existing double-stage target detection algorithm uses a large-scale deep network, so that the detection requirements of high precision and high real-time performance are difficult to meet, the single-stage target detection algorithm reduces the precision at the cost of the real-time performance of detection, and the problems of low detection precision, inaccurate positioning and the like of a small fire target exist, and the fire detection and early warning cannot be accurately and real-timely performed.

Disclosure of Invention

The application aims to solve the technical problem of providing a fire early warning method and a fire early warning system based on a YOLOv5 improved algorithm, which can accurately detect fire and early warn in real time.

The technical scheme adopted for solving the technical problems is as follows: the fire disaster early warning method based on the YOLOv5 improved algorithm comprises the following steps:

s1, obtaining streaming media data;

s2, preprocessing the streaming media data to obtain an image sequence to be detected;

s3, detecting the image sequence to be detected by using a fire detection model to obtain a detection result; the fire detection model is constructed based on a YOLOv5 improvement algorithm and comprises the following steps:

a attention module CAB is placed in the deep layer of the backbone network to extract the characteristics;

a feature fusion module is constructed to perform multi-scale feature fusion and generate four detection heads with different receptive fields;

s4, judging whether fire occurs or not based on the detection result of the fire detection model.

Further, the feature extraction performed by the deep attention module CAB in the backbone network includes:

the method comprises the steps of carrying out feature extraction on an image to be detected, which is put into the fire detection model, twice through a first CBS module and a second CBS module, inputting the image to be detected into a first CSP1 module, and outputting a first feature map;

the first feature map is subjected to feature extraction through a third CBS module and then is input into a second CSP1 module, and a second feature map is output;

the second feature map is subjected to feature extraction through a fourth CBS module and then is input into a third CSP1 module, and a third feature map is output;

and the third feature map is subjected to feature extraction through a fifth CBS module and then is input into a attention module CAB to increase the weight of target features, and then is input into a sixth CBS module after being subjected to spatial pyramid pooling, and a fourth feature map is output as the input of the feature fusion module.

Further, the spatial pyramid pooling is performed through a bi-directional pyramid network.

Further, the construction feature fusion module performs multi-scale feature fusion and generates four detection heads with different receptive fields, including:

performing splicing operation on the fourth feature map and the third feature map after upsampling, and obtaining a first fusion feature map through a first CSP2 module and a seventh CBS module;

performing splicing operation on the first fusion feature map and the second feature map after upsampling, and obtaining a second fusion feature map through a second CSP2 module and an eighth CBS module;

performing splicing operation on the second fusion feature map and the first feature map after upsampling, obtaining a third fusion feature map through a third CSP2 module, and performing convolution operation on the third fusion feature map to obtain a first detection feature map;

the third fusion feature map is subjected to a splicing operation with the second fusion feature map after passing through a ninth CBS module, a fourth fusion feature map is obtained after passing through a fourth CSP2 module, and a convolution operation is performed on the fourth fusion feature map to obtain a second detection feature map;

the fourth fusion feature map is subjected to a tenth CBS module, then is subjected to splicing operation with the first fusion feature map and the third feature map subjected to downsampling operation, a fifth CSP2 module is further used for obtaining a fifth fusion feature map, and convolution operation is performed on the fifth fusion feature map to obtain a third detection feature map;

and performing splicing operation on the fifth fusion feature map, the fourth feature map and the second feature map after downsampling operation after passing through an eleventh CBS module, obtaining a sixth fusion feature map through a sixth CSP2 module, and performing convolution operation on the sixth fusion feature map to obtain a fourth detection feature map.

Further, the size of the first detection feature map is 1/4 of the image to be detected, the size of the second detection feature map is 1/8 of the image to be detected, the size of the third detection feature map is 1/16 of the image to be detected, and the size of the fourth detection feature map is 1/32 of the image to be detected.

Further, the attention module CAB is constructed based on a coordinate attention mechanism, and a mish function is adopted as an activation function of the batch normalization layer.

Further, the preprocessing the streaming media data to obtain an image sequence to be detected includes:

storing the streaming media data into an image sequence according to interval frames;

and carrying out normalization processing on the image sequence to obtain the image sequence to be detected.

Further, the determining whether the fire occurs based on the detection result of the fire detection model includes:

analyzing and obtaining fire category and occurrence probability based on the detection result;

analyzing and comparing the occurrence probability with a threshold value to obtain a predicted voting value;

and judging whether fire occurs or not by using the predicted voting value.

Further, the step of acquiring training data and preprocessing the fire detection model during training includes:

acquiring a multi-scene fire image set containing two targets of flame and smoke;

normalizing the images in the fire image set to a preset size, and filling the background with gray to obtain a standard image set;

carrying out affine transformation, transmission transformation and combination transformation on the images in the standard image set to obtain an enhanced image set;

and selecting a certain number of images in the enhanced image set as a training image set.

The technical scheme adopted for solving the technical problems is as follows: the utility model also provides a fire early warning system based on YOLOv5 improves algorithm, characterized by that includes:

the data acquisition unit is used for acquiring streaming media data;

the data input unit is used for preprocessing the streaming media data to obtain an image sequence to be detected, and sequentially placing images to be detected in the image sequence to be detected into the fire detection unit for detection;

the fire detection unit is used for detecting the image to be detected and obtaining a detection result, is constructed based on a YOLOv5 improved algorithm and comprises a backbone network for extracting features, a feature fusion module for carrying out multi-scale feature fusion and four detection heads with different receptive fields, wherein the deep layer of the backbone network is provided with a attention module CAB;

and a result output unit for judging whether fire occurs based on the detection result.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the application has the following advantages and positive effects:

(1) According to the application, a fire detection model is constructed based on a YOLOv5 improved algorithm, and the model introduces a attention mechanism in a main network, so that the weight representation of a target position is enhanced, and the average detection precision is further improved;

(2) Based on the principle of a bidirectional feature pyramid network, the application converts a part of path aggregation network into bidirectional trans-scale connection, and can realize better fusion of the features of each scale through simple splicing operation;

(3) According to the application, on the basis of the original structure of YOLOv5, a small target detection head is added for focusing and detecting a small target in a visual task, and real-time early warning of early fire is realized through a video frame voting mechanism;

(4) The detection model is constructed based on the dynamic neural network with variable depth and width, and the size of the network model can be adjusted to be deployed to different hardware devices;

(5) The application constructs a multi-scene flame and smoke image data set, and uses a plurality of data enhancement methods to preprocess the image data set, thereby solving the problem of unbalanced flame and smoke targets in the training image.

Drawings

FIG. 1 is a schematic view of a fire detection model of the present application;

FIG. 2 is a block diagram of a CAB module according to the present application;

FIG. 3 is a flow chart of the present application;

FIG. 4 is a flow chart of a training phase in a first embodiment of the application

FIG. 5 is a graph showing confidence levels of a fire detection model in a first embodiment of the present application;

FIG. 6 is a graph of accuracy versus confidence of a fire detection model in a first embodiment of the present application;

FIG. 7 is a recall graph of a fire detection model in a first embodiment of the application;

FIG. 8 is a recall-confidence curve for a fire detection model of a first embodiment of the present application;

FIG. 9 is a graph comparing the performance of the present application with other methods.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

The first embodiment of the application relates to a fire early warning method based on a YOLOv5 improved algorithm, as shown in fig. 3, comprising the following steps:

s1: and acquiring streaming media data through a tunnel monitoring acquisition module.

S2: preprocessing streaming media data to obtain an image sequence to be detected, including:

storing the streaming media data as an image sequence according to the interval frames;

and carrying out normalization processing on the image sequence to obtain an image sequence to be detected.

S3: and detecting the image sequence to be detected by using a fire detection model to obtain an area containing two targets of flame and smoke and the probability of fire, marking the area in the original image sequence, and finally framing to form a video.

S4: and (3) comparing the fire probability obtained in the step (S3) with a threshold value, deducing N predicted voting values, and judging by utilizing the N voting values to realize early warning of the fire in an initial stage.

Before detection, a fire detection model needs to be built and trained, as shown in fig. 4, and the method comprises the following steps:

step A1, establishing a multi-scene fire data set, preprocessing data to obtain a training sample set { train1, …, train, …, train } and a test sample set { test1, …, test, …, test n };

a2, building a deep learning network model Fire-Yolov5 based on a Yolov5 improvement algorithm;

and step A3, continuously iterating the training minimization loss function to obtain a trained model, and deploying the trained model into an edge server for tunnel monitoring.

Step A1 is specifically described below.

A101: a multi-scene fire Image containing two types of targets of flames and smoke is obtained from an open source data set, image1, … Image … and Image N are produced, sample labels in a uniform format are produced, label1, … Labeli … and LabelN, each Label Labeli represents the center point position coordinate (Xij, YIj) of the jth target in the corresponding sample Image, the width and height (Wij, hij) of the target and the category {0,1}, image represents the ith sample in the data set, i epsilon [0, N ] represents the total number of images, and the category {0,1} represents { flames and smoke }, respectively.

A102: the normalization process was 640 pixels by 640 pixels for each sample in the dataset, with the background gray filled. Scaling of different aspect ratio imagesThe image is scaled to +.>Pixels, where max (w, h) and min (w, h) are the maximum and minimum values between the image width and height, respectively, +.>To round up, the gray fill value is (114,114,114).

A103: the data set after normalization processing is divided into a training set part Train and a Test set part Test, 80% of the data set is selected as a training set and the remaining 20% of the data set is selected as a Test set for each type of image.

A104: and setting a parameter vector for data enhancement, and carrying out affine transformation, perspective transformation and combination transformation on the image samples in the training set to enrich the training sample set.

Step A2 is specifically described below.

Step a201: and setting the depth and width coefficients of the neural network, so as to adjust the size of the network model to adapt to different hardware platforms. The network depth, the network layer number, the network width and the network output channel of the neural network are controlled by a depth factor DM and a width factor WM respectively, wherein the network layer number is max (number×DM) and 1, the number is the network layer number of different modules, and the number is rounded. The network output channel isWherein channel is the number of channels of different modules, ">Is rounded upward.

Step A202: a coordinate attention module CAB is constructed based on an improved coordinate attention (Coordinate Attention) mechanism, using the mich activation function as the activation function for the batch normalization layer. As shown in FIG. 2, the module uses two pooling kernels in two spatial ranges to perform one-dimensional feature encoding on each channel along the horizontal coordinate and the vertical coordinate, and the two one-dimensional feature encoding of the c-th channel is output as

Wherein, the liquid crystal display device comprises a liquid crystal display device,output of the c-th channel with height h, -/->Output of the c-th channel with width w, x _c (h, i) and x _c (j, W) is the value in the feature map vector, W and H are the width and height of the c-th channel, the number of channels is transformed using a1×1 convolution kernel and a dash activation function to obtain global spatial information in the horizontal and vertical directions, the output f=δ (F ₁ ([z ^h ,z ^w ]))，[z ^h ,z ^w ]Representing a two-direction tensor stitching operation in the horizontal and vertical directions, the intermediate feature map is split into two independent tensors along the spatial dimension, and the channels are transformed using two 1 x 1 convolutions to conform to the input channels. Conversion process

Wherein f ^h And f ^w Representing the output value of the corresponding delta conversion function, F _h And F _w Representing two 1 x 1 convolution transforms, m representing the Mish activation function, two tensors g obtained ^h And g ^w As a weight parameter for attention. The Mish activation function m=x×tanh (ln (1+e) ^x ) A smooth curve, not completely truncated in the negative portion, allows a smaller negative gradient inflow and more favorable information penetration into the neural network, resulting in higher accuracy and generalization. With the increase of the layer depth, the ReLU activation function can quickly reduce the training precision, and the Mish activation function has comprehensive improvement in the aspects of training stability, average precision, peak value precision and the like. Finally, the output of the attention module is obtained as

Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Is a weight parameter corresponding to the m function.

Step A203: the coordinate attention module CAB is used instead of the csp2_x module in the YOLOv5 backbone network to enhance the weight parameter representation of the region of interest.

Step A204: and fusing feature graphs of different scales by using a Concat connection bidirectional trans-scale link to realize multi-layer fusion of semantics. The Fire-YOLOv5 algorithm combines the principle of a bidirectional feature golden tower network, connects input nodes and output nodes of the same level in a cross-layer manner, shortens the path of low-level semantic transmission to high-level, combines adjacent layers in a splicing manner instead of an adding manner, organically combines high-level rich semantic features with features positioned at the low level, and obviously improves the prediction accuracy. The embodiment adopts a bidirectional cross-scale connection mode for eliminating the weight to perform feature fusion, and aims to improve the detection precision without affecting the reasoning operation speed of the network.

Step a205: a group of small target anchor frames and detection heads are added, and detection of the target at the 32 times downsampling pixel level of the original image is achieved. The method comprises the steps that the small target information quantity is lost due to the fact that multiple downsampling multiples are too large in a network model, a group of anchor frames and a small target detection layer are additionally arranged to solve the problem that a small fire target cannot be detected in consideration of limited resolution and context information of the available network model, and the 19 th layer of the network model, namely a characteristic diagram output by a seventh CBS module, is subjected to upsampling to obtain a characteristic diagram with the size of 160 multiplied by 160, and is subjected to splicing operation with a characteristic diagram output by a3 rd layer of a backbone network, and then a CSP_2X layer and a convolution layer are connected. The input image size is uniformly adjusted to 640×640 pixels, 160×160 feature maps are used to detect objects above 4×4 pixels, 80×80 feature maps are used to detect objects above 8×8 pixels, 40×40 feature maps are used to detect objects above 16×16 pixels, and 20×20 feature maps are used to detect objects above 32×32 pixels. After the small target detection layer is additionally arranged, the four layers of detection structures can cover different receptive fields, so that the rapid detection and the accurate positioning of the ultra-small pixel targets are realized.

More specifically, as shown in fig. 1, the network model Fire-YOLOv5 includes a backbone network for feature extraction, a feature fusion module for multi-scale feature fusion, and four detection heads with different receptive fields. The model is used to do the following:

the method comprises the steps of performing feature extraction on an image to be detected placed in a fire detection model through two CBS modules, inputting the extracted image to be detected into a CSP1_1 module, and outputting a first feature map;

the first feature map is subjected to feature extraction through a CBS module and then is input into a CSP1_2 module, and a second feature map is output;

the second feature map is subjected to feature extraction through a CBS module and then is input into a CSP1_3 module, and a third feature map is output;

the third feature map is subjected to feature extraction through a CBS module and then is input into a attention module CAB, then is input into the CBS module after passing through a bidirectional pyramid network, and a fourth feature map is output as the input of a feature fusion module;

the fourth feature map is subjected to up-sampling and then is subjected to splicing operation with the third feature map, and then a first fusion feature map is obtained through a CSP2_1 module and a CBS module;

the first fusion feature map is subjected to up-sampling and then is subjected to splicing operation with the second feature map, and then the second fusion feature map is obtained through a CSP2_2 module and a CBS module;

the second fusion feature map is subjected to up-sampling and then is subjected to splicing operation with the first feature map, a third fusion feature map is obtained through a CSP2_3 module, and convolution operation is carried out on the third fusion feature map to obtain a first detection feature map, wherein the size of the first detection feature map is 1/4 of that of the image to be detected;

the third fusion characteristic diagram is spliced with the second fusion characteristic diagram after passing through the CBS module, a fourth fusion characteristic diagram is obtained after passing through the CSP2_4 module, and the fourth fusion characteristic diagram is convolved to obtain a second detection characteristic diagram, wherein the size of the second detection characteristic diagram is 1/8 of that of the image to be detected;

the fourth fusion characteristic diagram is subjected to a CBS module, then is subjected to a splicing operation with the first fusion characteristic diagram and the third characteristic diagram subjected to a downsampling operation, a fifth fusion characteristic diagram is obtained through a CSP2_5 module, and a convolution operation is performed on the fifth fusion characteristic diagram to obtain a third detection characteristic diagram, wherein the size of the third detection characteristic diagram is 1/16 of that of the image to be detected;

and performing splicing operation on the fifth fusion characteristic map, the fourth characteristic map and the second characteristic map subjected to downsampling operation after passing through the CBS module, obtaining a sixth fusion characteristic map through the CSP2_6 module, and performing convolution operation on the sixth fusion characteristic map to obtain a fourth detection characteristic map, wherein the size of the fourth detection characteristic map is 1/32 of that of the image to be detected.

Step A3 is specifically described below.

S301: setting the maximum iteration number Itera, the learning rate eta, the training batch size B, and inputting B pictures of the training data sets { train1, …, train, … and train } each time, wherein the input number Num isWhere m is the total number of samples in the training dataset. The loss function L is

L＝L _class +L _CIoU +L _obj +L _noobj

Wherein L is _class To classify losses, L _CIoU To locate the loss, L _obj 、L _noobj Positive and negative sample confidence loss, respectively. Definition T is the number of output feature graphs T, S ² Is the number of grid cells divided by the feature map, N is the number of anchor frames on each grid N, w is the width of the prediction frame, h is the height of the prediction frame, 1 _r<4 And (3) judging the condition of positive samples, and setting the ratio of the width and the height of the calibration frame to the width and the height of the predicted mania to be smaller than 4. Error between class of class loss calculation reasoning and corresponding calibration class

Wherein x is _i For one of the N calibrated classes, the value {0,1, …, N-1}, y _i For the normalized class probability,the probability of the target class is inferred for the network. Calculating error between prediction frame and calibration frame by positioning loss

Wherein, the liquid crystal display device comprises a liquid crystal display device,

w ^gt is the width of the calibration frame, h ^gt Is the height of the calibration frame, ioU is the ratio of the intersection union of the calibration frame and the prediction frame, ρ ² (b,b ^gt ) Is the center point distance of the calibration frame and the prediction frame. Computing positive sample confidence loss for a network

Computing negative sample confidence loss for a network

Wherein the confidence level of the C calibration, the value {0,1},0 represents not the target, 1 represents the target, gr is the set probability factor,is the confidence of the reasoning, and the confidence of the negative sample is zero.

S302: using gradient descentIterative optimization is carried out on the network by the minimized loss function, an SGD learning optimizer is adopted, and the global initial learning rate is eta, wherein omega _t+1 Omega as a network parameter and prediction _t Is the current network weight parameter, +.>Is the gradient value for the next iteration.

S303: when the iteration times do not reach the set minimum iteration times Itera, if the loss function L is not reduced any more, stopping training; when the iteration times reach the set minimum iteration times Itera, stopping training to obtain a trained network model; otherwise, continuing to perform iterative optimization.

In specific implementation, the network training adopts an open source Pytorch deep learning framework, ubuntu 20.04 system environment, cuda10.0 and Python3.7 programming environment, the hardware platform GPU model is NVIDIAGeForce RTX 2070Max-Q, the video memory size is 8G, the CPU model is Intel (R) Core (TM) i-10750H CPU@2.60GHz, and the memory size is 12G, so that the training and testing are completed. Due to the limitation of hardware equipment, the training batch size is set to be 2, and the global initial learning rate is set to be 0.001 by adopting an SGD learning optimizer.

Experimental results show that Fire-YOLOv5x achieves a good balance of performance and efficiency, and is more robust in flame and smoke detection tasks, as shown in fig. 5-9. The network parameter is 70.7M, 18.0% is reduced compared with the YOLOv5x network parameter, the detection precision is 93.5%, 2.0% is improved compared with YOLOv5x, the average detection precision is 71.8% when the IoU threshold value is set to 0.5, and compared with the detection precision, the detection precision is improved by 0.2%, and the reasoning speed is equivalent to that of YOLOv5 x. The F1 value, the precision and the recall rate curve of the Fire-YOLOv5x show that the average precision and the recall rate of the detected class reach 93.5% and 96% respectively, and the novel method has higher detection precision and lower omission rate. Using a published data set for testing, experimental results show that the detection accuracy of Fire-Yolov5x is improved by 1.6% and 2% compared to EfficientDet-D4 and Yolov5, respectively, the detection recall is improved by 1.7% compared to EfficientDet-D4, and the average detection accuracy is improved by 14.5% compared to EfficientDet-D4 at a IoU threshold of 0.5. Comparable in detection speed to Efficientdet-D4. Particularly when dealing with ultra-small pixels and dense fire targets, the performance is better than the existing flame and smoke detection methods based on deep learning. The video detection result of the tunnel fire disaster shows that the rapid detection and the timely early warning of the fire disaster can be realized. The depth and width of the deep neural network model can be flexibly adjusted, and the training of networks with different scales can be deployed to hardware equipment with different calculation forces.

A second embodiment of the present application relates to a YOLOv5 improvement algorithm-based fire early warning system, comprising:

the data acquisition unit is used for acquiring streaming media data;

the data input unit is used for preprocessing streaming media data to obtain an image sequence to be detected, and sequentially placing images to be detected in the image sequence to be detected into the fire detection unit for detection;

the fire detection unit is used for detecting an image to be detected and obtaining a detection result, is constructed based on a YOLOv5 improved algorithm and comprises a backbone network for extracting characteristics, a fusion module for carrying out multi-scale characteristic fusion and four detection heads with different receptive fields, wherein a attention module CAB is arranged in the deep layer of the backbone network;

and a result output unit for judging whether a fire occurs based on the detection result.

Claims

1. A fire disaster early warning method based on a YOLOv5 improved algorithm is characterized by comprising the following steps:

s1, obtaining streaming media data;

2. The fire early warning method based on YOLOv5 improvement algorithm according to claim 1, wherein the feature extraction performed by the deep attention module CAB in the backbone network comprises:

the method comprises the steps of carrying out feature extraction on an image to be detected placed in the fire detection model through a first CBS module and a second CBS module, inputting the image to be detected into a first CSP1 module, and outputting a first feature map;

3. The YOLOv5 improvement algorithm-based fire early warning method of claim 2, wherein the spatial pyramid pooling is performed through a bi-directional pyramid network.

4. The fire early warning method based on YOLOv5 improvement algorithm according to claim 2, wherein the constructing feature fusion module performs multi-scale feature fusion and generates four detection heads with different receptive fields, comprising:

5. The YOLOv5 improvement algorithm-based fire early warning method of claim 4, wherein the first detection feature map has a size of 1/4 of the image to be detected, the second detection feature map has a size of 1/8 of the image to be detected, the third detection feature map has a size of 1/16 of the image to be detected, and the fourth detection feature map has a size of 1/32 of the image to be detected.

6. The fire early warning method based on YOLOv5 improvement algorithm according to claim 1, wherein the attention module CAB is constructed based on a coordinate attention mechanism, and a mish function is adopted as an activation function of a batch normalization layer.

7. The fire early warning method based on YOLOv5 improvement algorithm according to claim 1, wherein the preprocessing the streaming media data to obtain the image sequence to be detected comprises:

8. The YOLOv5 improvement algorithm-based fire early warning method of claim 1, wherein the determining whether a fire occurs based on the detection result of the fire detection model comprises:

and judging whether fire occurs or not by using the predicted voting value.

9. The YOLOv5 improvement algorithm-based fire early warning method of claim 1, wherein the step of acquiring training data and preprocessing the training data when the fire detection model is trained comprises:

10. A YOLOv5 improvement algorithm-based fire early warning system, comprising:

the data acquisition unit is used for acquiring streaming media data;