CN112052797A

CN112052797A - MaskRCNN-based video fire identification method and system

Info

Publication number: CN112052797A
Application number: CN202010931021.0A
Authority: CN
Inventors: 陈锐; 钱廷柱; 刘洪奎; 郭云正
Original assignee: Hefei Kedalian Safety Technology Co ltd
Current assignee: Hefei Kedalian Safety Technology Co ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-12-08
Anticipated expiration: 2040-09-07
Also published as: CN112052797B

Abstract

The invention provides a MaskRCNN-based video fire identification method and system, which are characterized in that a MaskRCNN deep learning model is adopted to detect targets such as smoke, flame and the like in a video image, the image in a video stream is transmitted to a trained MaskRCNN model, and through a series of operations such as convolution, pooling and the like, characteristic information in the image can be fully extracted, and predicted coordinate information and score results of the targets such as smoke, flame and the like can be accurately output. The MaskRCNN model has higher accuracy and reliability, but a small amount of false alarms exist, and in order to further reduce the false alarms, the embodiment uses a frame difference-based dynamic energy detection method to filter the detection result of the MaskRCNN, so that most of the false alarms of static objects can be removed. Finally, the deep neural network is adopted to carry out the final classification decision on the image area corresponding to the detected object, so that the false alarm rate is further reduced.

Description

MaskRCNN-based video fire identification method and system

Technical Field

The invention relates to the technical field of fire identification, in particular to a MaskRCNN-based video fire identification method and system.

Background

Currently, existing fire detection methods can be roughly divided into two categories: a fire detection method based on traditional image processing and a fire detection method based on deep learning. The fire detection method based on the traditional image processing is generally based on the visual characteristics of images, early scholars propose modeling color information in the images to extract a suspected flame area, and the method is high in real-time performance, but only considers the color characteristics, so that the method faces the problem of low accuracy. Then, researchers propose a method of dynamic background modeling to detect dynamic regions in the video, then obtain candidate regions through a color model, and finally complete final screening through shape features such as forms, textures and the like. Compared with a pure color model, the method has much higher performance, and meanwhile, the method for modeling the dynamic background greatly reduces the false alarm phenomenon of a static object. The fire detection methods based on the traditional image processing involve the use of color features more or less, however, in some real scenes, the reliability of the algorithm is greatly reduced due to the influence of factors such as color cast, overexposure and illumination of a camera. Subsequent scholars introduce machine learning algorithms such as a support vector machine and an artificial neural network into fire identification, and classify suspected regions by using the machine learning algorithms after extracting image features of the suspected regions. The method obtains better accuracy under the influence of interference factors such as illumination, shielding and the like, and the performance of the method is superior to that of most fire detection algorithms. However, these algorithms cannot effectively utilize the massive data set to improve the performance of the algorithms, and meanwhile, researchers are required to design features manually, which is cumbersome. The fire detection method based on deep learning generally adopts the existing mature algorithms, such as FasterRCNN, MaskRCNN, SSD, YOLO and the like, directly carries out target detection on flame and smoke in a single static image through a convolutional neural network, and locates out targets such as suspected flame and suspected smoke in the image.

For example, in the video smoke detection method based on the convolutional neural network disclosed as application No. 201911323715.x, a suspected smoke area is extracted, and then the convolutional neural network is used for image classification; the method has the problem of low detection precision.

In addition, the traditional image processing method needs to manually perform complex feature extraction and design, has strong dependency on manually selected features, is generally only suitable for a single specific scene, has poor performance in a real complex scene, and generally has the problems of low detection rate, serious false alarm phenomenon and the like.

Disclosure of Invention

The invention aims to solve the technical problems of how to improve the accuracy of video fire identification and reduce the false alarm rate.

The invention solves the technical problems through the following technical means:

the MaskRCNN-based video fire identification method is characterized by comprising the following steps of: the method comprises the following steps:

s01, training a MaskRCNN network to obtain a target MaskRCNN network;

s02, transmitting the images in the video stream to a target MaskRCNN network, and extracting characteristic information in the images to obtain an image classification data set;

s03, constructing an EfficientNet network, initializing the weight of the network, and training by using the image classification data set constructed in the step S02 to obtain a target EfficientNet network;

s04, detecting smoke and fire targets in the video by using a target MaskRCNN network to obtain coordinates, probability values and categories of target frames of suspected smoke and flame;

s05, judging frame difference energy, namely filtering the false alarm phenomenon of the static object by judging the frame difference energy of the target frame obtained in the step S04;

and S06, making a final decision by a classification network, and classifying the target frames reserved after the frame difference energy is judged by adopting an EfficientNet network.

The method adopts a MaskRCNN deep learning model to detect targets such as smoke, flame and the like in the video image, transmits the image in the video stream to the trained MaskRCNN model, can fully extract the characteristic information in the image through a series of operations such as convolution, pooling and the like, and can accurately output the predicted coordinate information and score results of the targets such as smoke, flame and the like. The MaskRCNN model has higher accuracy and reliability, but a small amount of false alarms exist, and in order to further reduce the false alarms, the embodiment uses a frame difference-based dynamic energy detection method to filter the detection result of the MaskRCNN, so that most of the false alarms of static objects can be removed. Finally, the deep neural network is adopted to carry out the final classification decision on the image area corresponding to the detected object, so that the false alarm rate is further reduced.

Further, the training of the MaskRCNN network in the target MaskRCNN network construction module specifically includes: collecting fire image samples, carrying out sample labeling, constructing a training data set, labeling coordinates and categories of targets such as smoke, flame and the like in the images, and carrying out data enhancement on the data set; constructing a MaskRCNN network, initializing the weight of the MaskRCNN network, and training the MaskRCNN network by using the constructed training set to obtain a target MaskRCNN network; the sample labels are specifically: marking the coordinates of the upper left corner (x1, y1) and the coordinates of the lower right corner (x2, y2) of the smoke and flame objects, and performing data enhancement operations on samples in the image dataset, wherein the data enhancement operations comprise random cropping, random brightness dithering, random saturation dithering, random contrast dithering, random hue dithering and Mixup operations, and the processing formula of Mixup is as follows:

wherein, lambda-Beta (1.5 );

wherein x_iRepresenting images i, x to be fused_jRepresenting images j, y to be fused_iAnd y_jRespectively representing image i and imagej label information;

the training process of the MaskRCNN network comprises the following steps:

step S011, initializing the MaskRCNN network by using the trained network parameters on the COCO data set;

step S012: after image samples in the training data set are scaled to 1024x1024, extracting an overall characteristic diagram of the training sample image by using a ResNet101+ FPN network in MaskRCNN;

step S013: inputting the overall feature map into an RPN network, predicting an ROI (region of interest), and selecting positive and negative samples according to the overlapping ratio of a target frame of a candidate region and a labeled target frame;

step S014: ROIAlign pooling calculation is carried out on the ROI area on the feature map corresponding to the positive and negative samples, so as to obtain a candidate area feature map with a fixed size;

when ROIAlign pooling calculation is carried out, firstly, an ROI area target frame is mapped onto a feature map, then a feature image ROI area is obtained according to a minimum circumscribed rectangle algorithm, the ROI area is divided into m multiplied by m grids, 4 points on the feature map are selected by each grid to carry out bilinear difference, and finally, the feature map with the size of m multiplied by m is obtained;

step S015: classifying the ROI area feature graph after ROIAlign calculation and performing regression calculation on a target frame;

step S016: calculating a loss function of the MaskRCNN, performing gradient calculation on the loss function through a random gradient descent algorithm, and updating the weight of the MaskRCNN network;

step S017: and repeating the steps S012 to S016 until reaching the preset iteration number, stopping training and storing the MaskRCNN network.

Further, the step S02 specifically includes: and detecting a large amount of video data and image data by using a target MaskRCNN network, cutting out targets with suspected smoke and flame, and constructing an image classification data set comprising 3 categories of smoke, flame and probability values.

Further, the step S03 is specifically: and initializing parameters of the EfficientNet network by using the trained EfficientNet classification network parameters on the ImageNet data set, inputting the image classification data set, performing end-to-end training, performing gradient calculation on a classification loss function through an Adam optimization algorithm, updating the parameters of the EfficientNet network, and stopping training after finishing the training of a set round to obtain the target EfficientNet network.

Further, the step S04 is specifically: acquiring images from a video stream, zooming the video images to 1024x1024 sizes, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; and performing NMS non-maximum suppression on all the predicted targets, and filtering out overlapped invalid target boxes.

Further, the step S05 is specifically: cutting out an image area corresponding to a target frame in each frame of the nearest adjacent N frames according to the target frame, wherein N is larger than 2, respectively carrying out frame difference calculation on the adjacent 2 images of the N images acquired by each target frame, carrying out binarization calculation according to a threshold value T to obtain N-1 binary images, counting the number of non-zero values of pixel values in all the binary images, and dividing the number by the area of the target frame to obtain a final energy value; if the energy value is larger than the set energy threshold value, the target frame is effective, otherwise, the target frame is considered as a static object false alarm, and the target frame is discarded.

The invention also provides a MaskRCNN-based video fire identification system, which comprises:

a target MaskRCNN network construction module trains a MaskRCNN network to obtain a target MaskRCNN network;

the image classification data set construction module is used for transmitting the images in the video stream to a target MaskRCNN network, extracting characteristic information in the images and obtaining an image classification data set;

the target EfficientNet network construction module is used for constructing an EfficientNet network, initializing the weight of the network, and training by using the image classification data set constructed in the step S02 to obtain the target EfficientNet network;

the smoke and fire target detection module is used for detecting smoke and fire targets in the video by using a target MaskRCNN network to obtain target frame coordinates, probability values and categories of suspected smoke and flame;

the frame difference energy judgment module is used for filtering the false alarm phenomenon of the static object by frame difference energy judgment aiming at the target frame;

and the classification network final decision module is used for classifying the target frames reserved after the frame difference energy is judged by adopting an EfficientNet network.

wherein, lambda-Beta (1.5 );

wherein x_iRepresenting images i, x to be fused_jRepresenting images j, y to be fused_iAnd y_jLabel information respectively representing the image i and the image j;

the training process of the MaskRCNN network comprises the following steps:

Further, the specific execution process of the smoke and fire target detection module is as follows: acquiring images from a video stream, zooming the video images to 1024x1024 sizes, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; and performing NMS non-maximum suppression on all the predicted targets, and filtering out overlapped invalid target boxes.

Further, the frame difference energy determination module specifically executes the following process: cutting out an image area corresponding to a target frame in each frame of the nearest adjacent N frames (N >3) according to the target frame, respectively performing frame difference calculation on the adjacent 2 images of the N images acquired by each target frame, performing binarization calculation according to a threshold value T to obtain N-1 binary images, and counting the number of non-zero values of pixel values in all the binary images, wherein the number is divided by the area of the target frame to obtain a final energy value; if the energy value is larger than the set energy threshold value, the target frame is effective, otherwise, the target frame is considered as a static object false alarm, and the target frame is discarded.

The invention has the advantages that:

Based on the deep learning algorithm, various characteristics of flame and smoke in the image are independently learned through the convolutional neural network, the learned characteristics can be identified and positioned for the flame and smoke in the image, and the method has the advantages of high detection rate, low false alarm rate, high robustness and the like. In actual deployment, the method can basically meet the real-time requirement through simultaneous acceleration of software and hardware.

Drawings

FIG. 1 is a flow chart of MaskRCNN network and EfficientNet network training in the embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for video fire identification to detect a fire in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of MaskRCNN network fire detection according to the present invention;

FIG. 4 is a flow chart of dynamic detection in an embodiment of the present invention;

FIG. 5 is a diagram illustrating the effect of the Mixup method in the embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a MaskRCNN-based video fire identification method, which comprises the following steps:

step S1: as shown in fig. 1, collecting fire image samples, performing sample labeling, constructing a training data set, labeling coordinates and categories of targets such as smoke, flame and the like in the images, and performing data enhancement on the data set; the data enhancement operations include random clipping, random luminance dithering, random saturation dithering, random contrast dithering, random hue dithering, mixup, and the like. As shown in fig. 1, the embodiment provides an image fusion method named mixup, in which two different images are fused according to a certain ratio, so that the diversity of a training data set can be effectively increased, as shown in fig. 5;

wherein, the coordinates of the upper left corner (x1, y1) and the coordinates of the lower right corner (x2, y2) of the smoke and flame objects need to be marked, and the data enhancement operation on the samples in the image data set comprises the operations of random clipping, random brightness dithering, random saturation dithering, random contrast dithering, random hue dithering, mixup and the like. The processing formula for Mixup is as follows:

wherein, lambda-Beta (1.5 );

wherein x_iRepresenting images i, x to be fused_jRepresenting images j, y to be fused_iAnd y_jThe label information (target frame and category information) of the image i and the image j are respectively represented.

Step S2: as shown in fig. 2, a MaskRCNN network is constructed, the weights of the network are initialized, the MaskRCNN network is trained by using the training set constructed in step S1, and after 100 rounds of training, the training is stopped;

the MaskRCNN model training process is specifically as follows:

s2.1, initializing parameters of the MaskRCNN network by using model parameters trained on a COCO data set;

step S2.2: after image samples in the training data set are scaled to 1024x1024, extracting an overall characteristic diagram of the training sample image by using a ResNet101+ FPN network in MaskRCNN;

step S2.3: inputting the overall feature map into an RPN network, predicting a candidate Region (ROI), and selecting positive and negative samples according to the overlapping ratio of a target frame of the candidate region and a labeled target frame;

step S2.4: ROIAlign pooling calculation is carried out on the ROI area on the feature map corresponding to the positive and negative samples, so as to obtain a candidate area feature map with a fixed size;

when ROIAlign pooling calculation is carried out, firstly, a candidate region target frame is mapped onto a feature map, then a feature image ROI region is obtained according to a minimum circumscribed rectangle algorithm, the ROI region is divided into m multiplied by m grids, 4 points on the feature map are selected by each grid to carry out bilinear difference, and finally, the feature map with the size of m multiplied by m is obtained;

step S2.5: classifying the candidate region feature graph after ROIAlign calculation and performing regression calculation on a target frame;

step S2.6: calculating a loss function of the MaskRCNN, performing gradient calculation on the loss function through a random gradient descent algorithm, and updating the weight of the MaskRCNN network;

step S2.7: repeating the steps S2.2 to S2.6 until the preset iteration times are reached, stopping training, and storing the MaskRCNN network model;

step S3: as shown in fig. 1, a trained MaskRCNN model is used to detect a large amount of video data and image data, and suspected smoke and flame targets are cut out, so as to construct an image classification dataset including 3 categories of smoke, flame and false alarm;

step S4: as shown in fig. 1, an EfficientNet network is constructed, a pre-trained model on an ImageNet dataset is used to perform parameter initialization on the EfficientNet network, the image classification dataset constructed in step S3 is used for training, and after 50 rounds of training, the training is stopped;

initializing parameters of an EfficientNet network by using the trained EfficientNet classification model parameters on the ImageNet data set, inputting a training data image, performing end-to-end training, performing gradient calculation on a classification loss function through an Adam optimization algorithm, updating the parameters of the EfficientNet network, and stopping training after 50 rounds of training are completed;

as shown in fig. 2, after the target MaskRCNN network and the target EfficientNet network are obtained, the method for identifying the video fire specifically includes the following steps:

step S5, as shown in fig. 3, a single frame image obtained by decoding a video stream is preprocessed, the preprocessed image is subjected to overall image feature extraction through a ResNet101+ FPN network in MaskRCNN, a candidate region is obtained by prediction after the feature map is input into an RPN network, a candidate feature map with a fixed size is output after roiign pooling calculation is performed on the feature map of the candidate region, and finally a classification regression network in MaskRCNN classifies the candidate feature map and regresses an object frame. Considering that the prediction result of MaskRCNN has a large target overlap, NMS calculation is required for the prediction of MaskRCNN to suppress targets with an excessively large overlap region.

Step S6, frame difference energy determination: after MaskRCNN network detection and NMS calculation are completed, dynamic detection needs to be performed on a suspected target region, as shown in fig. 4, a target ROI region in consecutive adjacent N frames (N >2) is first obtained, frame difference calculation is performed between adjacent 2 frames, binarization calculation is performed according to a threshold T to obtain a binary image, the number of pixels having non-zero values in all binary images is counted, and the number is divided by the area of the target frame to obtain a final energy value. Setting the pixels with the absolute value larger than the threshold value T in the frame difference image as 1, setting the rest pixels as 0, counting the number of non-zero-value pixels in all the frame difference images, and dividing the number of the pixels by the area of the target area to obtain the required energy value. If the energy value is larger than the set threshold value, the target frame is judged to be dynamic, and the target frame is effective, otherwise, the target frame is judged to be static, and the target frame is discarded.

Step S7, the classification network makes a final decision: if the suspected target area is judged to be dynamic, further judgment is needed, the suspected target area image is input into a target EfficientNet network, and the target EfficientNet classifies the image and judges whether the image is false alarm or smoke or flame. If the smoke or flame is judged, directly alarming, otherwise, directly entering the detection process of the next frame without alarming.

The embodiment provides a video fire detection method based on deep learning, which is characterized in that a MaskRCNN deep learning model is adopted to detect targets such as smoke, flame and the like in a video image, the image in a video stream is transmitted to the trained MaskRCNN model, and through a series of operations such as convolution, pooling and the like, characteristic information in the image can be fully extracted, and predicted coordinate information and score results of the targets such as smoke, flame and the like can be accurately output. The MaskRCNN model has higher accuracy and reliability, but a small amount of false alarms exist, and in order to further reduce the false alarms, the embodiment uses a frame difference-based dynamic energy detection method to filter the detection result of the MaskRCNN, so that most of the false alarms of static objects can be removed. Finally, the deep neural network is adopted to carry out the final classification decision on the image area corresponding to the detected object, so that the false alarm rate is further reduced.

Considering that the MaskRCNN algorithm requires training data to provide mask labeling information, and the embodiment does not need a semantic segmentation function, the embodiment eliminates the semantic segmentation related branch function in the MaskRCNN network.

The embodiment also provides a video fire identification system based on MaskRCNN, including:

the target MaskRCNN network construction module collects fire image samples, carries out sample marking, constructs a training data set, needs to mark coordinates and types of targets such as smoke, flame and the like in the images, and carries out data enhancement on the data set;

wherein, lambda-Beta (1.5 );

Constructing a MaskRCNN network, initializing the weight of the network, training the MaskRCNN network by using the training set constructed in the step S1, and stopping training after 100 rounds of training;

the MaskRCNN model training process is specifically as follows:

when ROIAlign pooling calculation is carried out, firstly, a candidate region target frame is mapped onto a feature map, then a feature image ROI region is obtained according to a minimum circumscribed rectangle algorithm, the ROI region is divided into m x m grids, 4 points on the feature map are selected by each grid to carry out bilinear difference, and finally, the feature map with the size of m x m is obtained;

the image classification data set construction module is used for detecting a large amount of video data and image data by using a trained MaskRCNN model, cutting out a suspected smoke and flame target, and constructing an image classification data set with the 3 categories of smoke, flame and false alarm;

a target EfficientNet network construction module, wherein the weight of the network is initialized, the image classification data set constructed in the step S3 is used for training, and the training is stopped after 50 rounds of training;

after the target MaskRCNN network and the target EfficientNet network are obtained, the method for identifying the video fire specifically comprises the following steps:

a smoke and fire target detection module, MaskRCNN detects smoke and fire targets in the video: acquiring images from a video stream, zooming the video images to 1024x1024 sizes, inputting the video images into a MashRCNN network, and predicting to obtain coordinates, probability values and categories of target frames of suspected smoke and flame; NMS non-maximum suppression is carried out on all the predicted targets, and a large number of overlapped invalid target frames can be filtered;

and the frame difference energy judging module cuts out an image area corresponding to a target frame (assuming that the width of the target frame is w and the height of the target frame is h) in each frame in the nearest adjacent N frames (N >3) according to the target frame obtained in the last step, respectively performs frame difference calculation on the adjacent 2 images, performs binarization calculation according to a threshold value T to obtain N-1 binary images, and counts the number of non-zero pixel values in all the binary images, wherein the number is the final energy value by dividing the area of the target frame. If the energy value is larger than the set energy threshold value, the target frame is effective, otherwise, the target frame is considered as a static object false alarm, and the target frame is discarded;

and the classification network final decision module directly cuts the target through a target frame reserved after the frame difference energy judgment, scales the target to 128x128 size, outputs the EfficientNet network for final classification, and gives an alarm if the prediction is smoke or flame, or does not give an alarm.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The MaskRCNN-based video fire identification method is characterized by comprising the following steps of: the method comprises the following steps:

s01, training a MaskRCNN network to obtain a target MaskRCNN network;

2. The MaskRCNN-based video fire identification method according to claim 1, wherein: the MaskRCNN network training in the target MaskRCNN network construction module specifically comprises the following steps: collecting fire image samples, carrying out sample labeling, constructing a training data set, labeling coordinates and categories of targets such as smoke, flame and the like in the images, and carrying out data enhancement on the data set; constructing a MaskRCNN network, initializing the weight of the MaskRCNN network, and training the MaskRCNN network by using the constructed training set to obtain a target MaskRCNN network; the sample labels are specifically: marking the coordinates of the upper left corner (x1, y1) and the coordinates of the lower right corner (x2, y2) of the smoke and flame objects, and performing data enhancement operations on samples in the image dataset, wherein the data enhancement operations comprise random cropping, random brightness dithering, random saturation dithering, random contrast dithering, random hue dithering and Mixup operations, and the processing formula of Mixup is as follows:

wherein, lambda-Beta (1.5 );

the training process of the MaskRCNN network comprises the following steps:

3. The MaskRCNN-based video fire identification method according to claim 2, wherein: the step S02 specifically includes: and detecting a large amount of video data and image data by using a target MaskRCNN network, cutting out targets with suspected smoke and flame, and constructing an image classification data set comprising 3 categories of smoke, flame and probability values.

4. The MaskRCNN-based video fire identification method according to claim 1, wherein: the step S03 specifically includes: and initializing parameters of the EfficientNet network by using the trained EfficientNet classification network parameters on the ImageNet data set, inputting the image classification data set, performing end-to-end training, performing gradient calculation on a classification loss function through an Adam optimization algorithm, updating the parameters of the EfficientNet network, and stopping training after finishing the training of a set round to obtain the target EfficientNet network.

5. The MaskRCNN-based video fire identification method according to claim 1, wherein: the step S04 specifically includes: acquiring images from a video stream, zooming the video images to 1024x1024 sizes, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; and performing NMS non-maximum suppression on all the predicted targets, and filtering out overlapped invalid target boxes.

6. The MaskRCNN-based video fire identification method according to claim 1, wherein: the step S05 specifically includes: cutting out an image area corresponding to a target frame in each frame of the nearest adjacent N frames according to the target frame, wherein N is larger than 2, respectively carrying out frame difference calculation on the adjacent 2 images of the N images acquired by each target frame, carrying out binarization calculation according to a threshold value T to obtain N-1 binary images, counting the number of non-zero values of pixel values in all the binary images, and dividing the number by the area of the target frame to obtain a final energy value; if the energy value is larger than the set energy threshold value, the target frame is effective, otherwise, the target frame is considered as a static object false alarm, and the target frame is discarded.

7. Video fire identification system based on maskRCNN, its characterized in that: the method comprises the following steps:

8. The MaskRCNN-based video fire identification system according to claim 7, wherein: the MaskRCNN network training in the target MaskRCNN network construction module specifically comprises the following steps: collecting fire image samples, carrying out sample labeling, constructing a training data set, labeling coordinates and categories of targets such as smoke, flame and the like in the images, and carrying out data enhancement on the data set; constructing a MaskRCNN network, initializing the weight of the MaskRCNN network, and training the MaskRCNN network by using the constructed training set to obtain a target MaskRCNN network; the sample labels are specifically: marking the coordinates of the upper left corner (x1, y1) and the coordinates of the lower right corner (x2, y2) of the smoke and flame objects, and performing data enhancement operations on samples in the image dataset, wherein the data enhancement operations comprise random cropping, random brightness dithering, random saturation dithering, random contrast dithering, random hue dithering and Mixup operations, and the processing formula of Mixup is as follows:

wherein, lambda-Beta (1.5 );

the training process of the MaskRCNN network comprises the following steps:

9. The MaskRCNN-based video fire identification system according to claim 7, wherein: the specific execution process of the smoke and fire target detection module is as follows: acquiring images from a video stream, zooming the video images to 1024x1024 sizes, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; and performing NMS non-maximum suppression on all the predicted targets, and filtering out overlapped invalid target boxes.

10. The MaskRCNN-based video fire identification system according to claim 7, wherein: the frame difference energy judging module specifically executes the following processes: cutting out an image area corresponding to a target frame in each frame of the nearest adjacent N frames (N >3) according to the target frame, respectively performing frame difference calculation on the adjacent 2 images of the N images acquired by each target frame, performing binarization calculation according to a threshold value T to obtain N-1 binary images, and counting the number of non-zero values of pixel values in all the binary images, wherein the number is divided by the area of the target frame to obtain a final energy value; if the energy value is larger than the set energy threshold value, the target frame is effective, otherwise, the target frame is considered as a static object false alarm, and the target frame is discarded.