CN112052797B

CN112052797B - MaskRCNN-based video fire disaster identification method and MaskRCNN-based video fire disaster identification system

Info

Publication number: CN112052797B
Application number: CN202010931021.0A
Authority: CN
Inventors: 陈锐; 钱廷柱; 刘洪奎; 郭云正
Original assignee: Hefei Kedalian Safety Technology Co ltd
Current assignee: Hefei Kedalian Safety Technology Co ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2024-07-16
Anticipated expiration: 2040-09-07
Also published as: CN112052797A

Abstract

The invention provides a MaskRCNN-based video fire disaster identification method and a MaskRCNN-based video fire disaster identification system, which adopt a MaskRCNN deep learning model to detect targets such as smoke, flame and the like in a video image, transmit images in a video stream to a trained MaskRCNN model, fully extract characteristic information in the images through a series of operations such as convolution, pooling and the like, and accurately output coordinate information and scoring results of the predicted targets such as smoke, flame and the like. The MaskRCNN model has higher accuracy and reliability, but also has a small amount of false alarm phenomenon, and in order to further reduce false alarm, the embodiment uses a dynamic energy detection method based on frame difference to filter the detection result of MaskRCNN, so that most of static object false alarm phenomenon can be removed. Finally, the embodiment adopts the deep neural network to carry out final classification decision on the image area corresponding to the detected object, thereby further reducing the false alarm rate.

Description

MaskRCNN-based video fire disaster identification method and MaskRCNN-based video fire disaster identification system

Technical Field

The invention relates to the technical field of fire disaster identification, in particular to a MaskRCNN-based video fire disaster identification method and system.

Background

Currently, existing fire detection methods can be broadly divided into two categories: fire detection methods based on traditional image processing and fire detection methods based on deep learning. In general, based on the visual characteristics of an image, early scholars propose modeling color information in the image to extract a suspected flame region, and the method has higher real-time performance, but only considers color characteristics, so that the problem of lower accuracy is faced. Then, researchers propose to detect dynamic areas in the video through a dynamic background modeling method, acquire candidate areas through a color model, and finally finish final screening through shape features such as a plurality of morphologies, textures and the like. Compared with a pure color model, the method has much higher performance, and meanwhile, the dynamic background modeling method greatly reduces the false alarm phenomenon of a static object. The above fire detection methods based on traditional image processing involve the use of color features to a greater or lesser extent, however, in some real scenes, the influence of factors such as color cast, overexposure, illumination, etc. of the camera may cause a substantial reduction in the reliability of the algorithm. The subsequent scholars introduce a machine learning algorithm such as a support vector machine, an artificial neural network and the like into fire disaster recognition, and classify the suspected region by using the machine learning algorithm after extracting the image features of the suspected region. According to the method, under the influence of interference factors such as illumination and shielding, good accuracy is obtained, and the performance is superior to that of most fire detection algorithms. However, these algorithms cannot effectively utilize a massive data set to improve the performance of the algorithm, and at the same time, researchers are required to manually design features, which is complicated. According to the fire detection method based on deep learning, an existing mature algorithm, such as FASTERRCNN, MASKRCNN, SSD, YOLO algorithm, is generally adopted, flame and smoke in a single Zhang Jingtai image are directly detected through a convolutional neural network, and targets such as suspected flame and suspected smoke in the image are located.

The method comprises the steps of firstly extracting a suspected smoke area and then carrying out image classification by using a convolutional neural network, wherein the application number of the method is 201911323715. X; the method has the problem of low detection precision.

In addition, the traditional image processing method needs to manually perform complex feature extraction and design, has strong dependence on manually selected features, is generally only suitable for a single specific scene, has poor performance in a real complex scene, and generally has the problems of low detection rate, serious false alarm phenomenon and the like.

Disclosure of Invention

The invention aims to solve the technical problems of improving the accuracy of video fire disaster identification and reducing the false alarm rate.

The invention solves the technical problems by the following technical means:

The video fire disaster identification method based on MaskRCNN is characterized by comprising the following steps of: the method comprises the following steps:

S01, training MaskRCNN a network to obtain a target MaskRCNN network;

s02, transmitting the images in the video stream to a target MaskRCNN network, extracting characteristic information in the images, and obtaining an image classification dataset;

s03, constructing EFFICIENTNET networks, initializing the weights of the networks, and training by using the image classification data set constructed in the step S02 to obtain a target EFFICIENTNET network;

s04, detecting a smoke target in the video by using a target MaskRCNN network to obtain target frame coordinates, probability values and categories of suspected smoke and flame;

s05, judging the frame difference energy, and aiming at the target frame obtained in the step S04, filtering the false alarm phenomenon of the static object through the frame difference energy judgment;

S06, finally deciding by the classification network, and classifying by adopting EFFICIENTNET network according to the target frame reserved after the frame difference energy is judged.

According to the invention, a MaskRCNN deep learning model is adopted to detect the targets such as smoke, flame and the like in the video image, the images in the video stream are transmitted to a trained MaskRCNN model, and the characteristic information in the images can be fully extracted through a series of operations such as convolution, pooling and the like, so that the coordinate information and scoring result of the predicted targets such as smoke, flame and the like can be accurately output. The MaskRCNN model has higher accuracy and reliability, but also has a small amount of false alarm phenomenon, and in order to further reduce false alarm, the embodiment uses a dynamic energy detection method based on frame difference to filter the detection result of MaskRCNN, so that most of static object false alarm phenomenon can be removed. Finally, the embodiment adopts the deep neural network to carry out final classification decision on the image area corresponding to the detected object, thereby further reducing the false alarm rate.

Further, the training MaskRCNN network in the target MaskRCNN network construction module specifically includes: collecting fire image samples, marking the samples, constructing a training data set, marking coordinates and categories of targets such as smoke, flame and the like in the images, and carrying out data enhancement on the data set; constructing MaskRCNN a network, initializing the weight of the MaskRCNN network, and training MaskRCNN network by using the constructed training set to obtain a target MaskRCNN network; the sample label is specifically as follows: the upper left (x 1, y 1) and lower right (x 2, y 2) corner coordinates of the smoke and flame objects are noted, and the data enhancement operations on the samples in the image dataset include random clipping, random luminance dithering, random saturation dithering, random contrast dithering, random hue dithering, mixup operations, mixup are processed as follows:

wherein lambda-Beta (1.5 );

Wherein x _i represents an image i to be fused, x _j represents an image j to be fused, and y _i and y _j represent labeling information of the image i and the image j, respectively;

the training process of MaskRCNN network is as follows:

step S011, initializing parameters of MaskRCNN networks by using the trained network parameters on the COCO data set;

step S012: after scaling the image samples in the training dataset to 1024x1024, extracting the integral feature map of the training sample image by utilizing ResNet < 101+ > FPN network in MaskRCNN;

Step S013: inputting the integral feature map into an RPN network, predicting an ROI (region of interest) region, and selecting positive and negative samples according to the overlapping ratio of a candidate region target frame and a labeling target frame;

Step S014: performing ROIAlign pooling calculation on the ROI region on the feature map corresponding to the positive and negative samples to obtain a candidate region feature map with fixed size;

When ROIAlign is used for pooling calculation, firstly, an ROI region target frame is mapped onto a feature map, then, a feature image ROI region is obtained according to a minimum circumscribed rectangle algorithm, the ROI region is divided into m multiplied by m grids, 4 points on the feature map are selected by each grid to carry out bilinear difference, and finally, the feature map with the size of m multiplied by m is obtained;

Step S015: classifying the ROI region feature map calculated by ROIAlign and carrying out regression calculation on the target frame;

Step S016: calculating MaskRCNN a loss function, performing gradient calculation on the loss function through a random gradient descent algorithm, and updating MaskRCNN network weight;

Step S017: repeating the steps S012 to S016 until the preset iteration times are reached, stopping training, and storing MaskRCNN networks.

Further, in the step S02, specifically: and detecting a large amount of video data and image data by using a target MaskRCNN network, cutting out targets with suspected smoke and flame, and constructing an image classification dataset with 3 categories including smoke, flame and probability value.

Further, the step S03 specifically includes: and initializing parameters of the EFFICIENTNET network by using the trained EFFICIENTNET classification network parameters on the ImageNet data set, inputting the image classification data set, performing end-to-end training, performing gradient calculation on the classification loss function by an Adam optimization algorithm, updating EFFICIENTNET network parameters, and stopping training after the set round of training is completed to obtain the target EFFICIENTNET network.

Further, the step S04 specifically includes: acquiring images from a video stream, scaling the video images to 1024x1024 size, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; NMS non-maximum suppression is carried out on all predicted targets, and overlapping invalid target frames are filtered.

Further, the step S05 specifically includes: cutting out an image area corresponding to a target frame in each frame of the nearest adjacent N frames according to the target frame, wherein N is more than 2, N images acquired by each target frame respectively perform frame difference calculation on the adjacent 2 images, binarization calculation is performed according to a threshold T to obtain N-1 binary images, the number of non-zero values of pixel values in all the binary images is counted, and the number is divided by the area of the target frame to obtain a final energy value; if the energy value is greater than the set energy threshold, the target frame is valid, otherwise, the target frame is discarded as a false positive of the static object.

The invention also provides a MaskRCNN-based video fire identification system, which comprises:

The target MaskRCNN network construction module trains MaskRCNN networks to obtain a target MaskRCNN network;

the image classification data set construction module is used for transmitting images in the video stream to the target MaskRCNN network, extracting characteristic information in the images and obtaining an image classification data set;

the target EFFICIENTNET network construction module constructs EFFICIENTNET network, initializes the weight of the network, and trains by using the image classification data set constructed in the step S02 to obtain a target EFFICIENTNET network;

The firework target detection module is used for detecting firework targets in the video by using a target MaskRCNN network to obtain target frame coordinates, probability values and categories of suspected smoke and flame;

the frame difference energy judging module is used for judging the frame difference energy of the target frame and filtering the false alarm phenomenon of the static object;

and the final decision module of the classification network adopts EFFICIENTNET network to classify the target frames reserved after the frame difference energy is judged.

wherein lambda-Beta (1.5 );

the training process of MaskRCNN network is as follows:

Further, the specific implementation process of the firework target detection module is as follows: acquiring images from a video stream, scaling the video images to 1024x1024 size, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; NMS non-maximum suppression is carried out on all predicted targets, and overlapping invalid target frames are filtered.

Further, the specific implementation process of the frame difference energy judging module is as follows: cutting out an image area corresponding to a target frame in each frame of the nearest adjacent N frames (N > 3) according to the target frame, respectively carrying out frame difference calculation on the adjacent 2 images by each target frame, carrying out binarization calculation according to a threshold T to obtain N-1 binary images, counting the number of non-zero values of pixel values in all the binary images, and dividing the number by the target frame area to obtain a final energy value; if the energy value is greater than the set energy threshold, the target frame is valid, otherwise, the target frame is discarded as a false positive of the static object.

The invention has the advantages that:

Based on the algorithm of deep learning, various characteristics of flame and smoke in the image can be automatically learned through a convolutional neural network, the flame and smoke in the image can be identified and positioned by the learned characteristics, and the method has the advantages of high detection rate, low false alarm rate, high robustness and the like. In actual deployment, the method is accelerated by software and hardware, and can basically meet the real-time requirement.

Drawings

FIG. 1 is a diagram of a MaskRCNN network and EFFICIENTNET network training process in accordance with an embodiment of the present invention;

FIG. 2 is a flow chart of detecting a fire by a video fire identification method according to an embodiment of the invention;

FIG. 3 is a flow chart of MaskRCNN network detection fire in the present invention;

FIG. 4 is a flow chart of dynamic detection in an embodiment of the invention;

FIG. 5 is a diagram showing the effect of Mixup methods according to embodiments of the present invention;

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment provides a MaskRCNN-based video fire disaster identification method, which comprises the following steps:

Step S1: as shown in fig. 1, collecting fire image samples, labeling the samples, constructing a training data set, labeling coordinates and categories of targets such as smoke, flame and the like in the images, and carrying out data enhancement on the data set; the data enhancement operations include random clipping, random luminance dithering, random saturation dithering, random contrast dithering, random hue dithering, mixup, etc. As shown in fig. 1, the embodiment provides an image fusion method named mixup, which fuses two different images according to a certain proportion, so that the diversity of a training dataset can be effectively increased, as shown in fig. 5;

The upper left corner coordinates (x 1, y 1) and lower right corner coordinates (x 2, y 2) of the smoke and flame objects need to be marked, and the data enhancement operation of the samples in the image data set comprises operations of random clipping, random brightness dithering, random saturation dithering, random contrast dithering, random hue dithering, mixup and the like. Mixup the processing formula is as follows:

wherein lambda-Beta (1.5 );

where x _i represents image i to be fused, x _j represents image j to be fused, and y _i and y _j represent labeling information (target frame and category information) of image i and image j, respectively.

Step S2: as shown in fig. 2, a MaskRCNN network is constructed, the weight of the network is initialized, the training set constructed in the step S1 is used for training MaskRCNN network, and after 100 rounds of training, training is stopped;

the training process of MaskRCNN model is specifically as follows:

s2.1, initializing parameters of MaskRCNN networks by using model parameters trained on a COCO data set;

Step S2.2: after scaling the image samples in the training dataset to 1024x1024, extracting the integral feature map of the training sample image by utilizing ResNet < 101+ > FPN network in MaskRCNN;

Step S2.3: inputting the integral feature map into an RPN network, predicting a candidate Region (ROI), and selecting positive and negative samples according to the overlapping ratio of a candidate region target frame and a labeling target frame;

Step S2.4: performing ROIAlign pooling calculation on the ROI region on the feature map corresponding to the positive and negative samples to obtain a candidate region feature map with fixed size;

When ROIAlign is used for pooling calculation, firstly, a candidate region target frame is mapped onto a feature map, then, a feature image ROI region is obtained according to a minimum circumscribed rectangle algorithm, the ROI region is divided into m multiplied by m grids, 4 points on the feature map are selected by each grid to carry out bilinear difference, and finally, the feature map with the size of m multiplied by m is obtained;

step S2.5: classifying the candidate region feature map calculated by ROIAlign and carrying out regression calculation on the target frame;

Step S2.6: calculating MaskRCNN a loss function, performing gradient calculation on the loss function through a random gradient descent algorithm, and updating MaskRCNN network weight;

Step S2.7: repeating the steps S2.2 to S2.6 until the preset iteration times are reached, stopping training, and storing MaskRCNN network models;

step S3: as shown in fig. 1, a trained MaskRCNN model is used for detecting a large amount of video data and image data, a suspected smoke and flame target is cut out, and a guarantee replacement smoke, flame and false alarm image classification dataset is constructed;

Step S4: as shown in fig. 1, a EFFICIENTNET network is constructed, parameters of the EFFICIENTNET network are initialized by using a model pre-trained on an ImageNet dataset, training is performed by using the image classification dataset constructed in the step S3, and after 50 rounds of training, training is stopped;

initializing parameters of a EFFICIENTNET network by using EFFICIENTNET classification model parameters trained on an ImageNet dataset, inputting training data images, performing end-to-end training, performing gradient calculation on a classification loss function by an Adam optimization algorithm, updating EFFICIENTNET network parameters, and stopping training after 50 rounds of training are completed;

as shown in fig. 2, after the target MaskRCNN network and the target EFFICIENTNET network are obtained, the video fire identification method specifically includes the following steps:

step S5, as shown in FIG. 3, preprocessing a single frame image obtained by decoding a video stream, extracting overall image features of the preprocessed image through ResNet101+FPN network in MaskRCNN, predicting to obtain a candidate region after the feature map is input into the RPN network, performing ROIAlign pooling calculation on the feature map of the candidate region, outputting a candidate feature map with a fixed size, and finally classifying the candidate feature map and regressing a target frame by a classification regression network in MaskRCNN. Considering that there is a large amount of target overlap in the predicted result of MaskRCNN, NMS calculation of the prediction of MaskRCNN is required to suppress targets with excessive overlap areas.

Step S6, frame difference energy judgment: after MaskRCNN network detection and NMS calculation are completed, dynamic detection is required to be carried out on a suspected target area, as shown in fig. 4, a target ROI area in N (N > 2) continuous adjacent frames is first obtained, frame difference calculation is carried out between the adjacent 2 frames, binarization calculation is carried out according to a threshold T, a binary image is obtained, the number of non-zero values of pixel values in all the binary images is counted, and the number is divided by the area of the target frame to obtain a final energy value. And setting the pixels with absolute values larger than the threshold value T in the frame difference images to be 1, setting the rest pixels to be 0, and counting the number of non-zero value pixels in all the frame difference images, wherein the number of the pixels is divided by the area of the target area, and the calculated energy value is obtained. If the energy value is larger than the set threshold, the target frame is judged to be dynamic, the target frame is effective, otherwise, the target frame is judged to be static, and the target frame is discarded.

Step S7, final decision of the classification network: if the suspected target area is determined to be dynamic, further determination is required, and the image of the suspected target area is input into the target EFFICIENTNET network, the target EFFICIENTNET classifies the image, and determines whether it is false alarm or smoke or flame. If smoke or flame is judged, the alarm is directly given, otherwise, the alarm is not given, and the detection process of the next frame is directly carried out.

The embodiment provides a video fire detection method based on deep learning, which adopts a MaskRCNN deep learning model to detect targets such as smoke, flame and the like in a video image, transmits images in a video stream to a trained MaskRCNN model, and can fully extract characteristic information in the images through a series of operations such as convolution, pooling and the like, so that coordinate information and scoring results of the predicted targets such as smoke, flame and the like can be accurately output. The MaskRCNN model has higher accuracy and reliability, but also has a small amount of false alarm phenomenon, and in order to further reduce false alarm, the embodiment uses a dynamic energy detection method based on frame difference to filter the detection result of MaskRCNN, so that most of static object false alarm phenomenon can be removed. Finally, the embodiment adopts the deep neural network to carry out final classification decision on the image area corresponding to the detected object, thereby further reducing the false alarm rate.

Considering that MaskRCNN algorithm requires training data to provide mask annotation information, and the present embodiment does not need semantic segmentation function, the present embodiment eliminates semantic segmentation related branching function in MaskRCNN network.

The embodiment also provides a video fire identification system based on MaskRCNN, which comprises:

The target MaskRCNN network construction module is used for collecting fire image samples, carrying out sample labeling, constructing a training data set, marking coordinates and categories of targets such as smoke, flame and the like in the images, and carrying out data enhancement on the data set;

wherein lambda-Beta (1.5 );

Constructing MaskRCNN a network, initializing the weight of the network, training the MaskRCNN network by using the training set constructed in the step S1, and stopping training after 100 rounds of training;

the training process of MaskRCNN model is specifically as follows:

When ROIAlign is used for pooling calculation, firstly, a candidate region target frame is mapped onto a feature map, then, a feature image ROI region is obtained according to a minimum circumscribed rectangle algorithm, the ROI region is divided into m x m grids, 4 points on the feature map are selected by each grid to carry out bilinear difference, and finally, the feature map with the size of m x m is obtained;

The image classification data set construction module is used for detecting a large amount of video data and image data by using a trained MaskRCNN model, cutting out targets with suspected smoke and flame, and constructing 3 kinds of image classification data sets of guarantee replacement smoke, flame and false alarm;

The target EFFICIENTNET network construction module initializes the weight of the network, trains by using the image classification dataset constructed in the step S3, and stops training after 50 rounds of training;

after the target MaskRCNN network and the target EFFICIENTNET network are obtained, the video fire identification method specifically comprises the following steps:

The firework target detection module MaskRCNN detects a firework target in a video: acquiring images from a video stream, scaling the video images to 1024x1024 size, inputting the video images into MashRCNN networks, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; NMS non-maximum suppression is carried out on all the predicted targets, and a large number of overlapped invalid target frames can be filtered;

and the frame difference energy judging module cuts out an image area corresponding to a target frame (assuming that the target frame width is w and the height is h) in each frame in the nearest adjacent N frames (N > 3) according to the target frame obtained in the last step, respectively carries out frame difference calculation on the adjacent 2 images by each target frame, carries out binarization calculation according to a threshold T to obtain N-1 binary images, counts the number of non-zero values of pixel values in all the binary images, and divides the number by the target frame area to obtain the final energy value. If the energy value is larger than the set energy threshold, the target frame is effective, otherwise, the target frame is regarded as false alarm of a static object, and the target frame is discarded;

And the final decision module of the classification network directly cuts the target through a target frame reserved after the frame difference energy is judged, scales the target to 128x128 size, outputs EFFICIENTNET the network to carry out final classification, and alarms if the prediction is smoke or flame, or does not alarm.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The video fire disaster identification method based on MaskRCNN is characterized by comprising the following steps of: the method comprises the following steps:

S01, training MaskRCNN a network to obtain a target MaskRCNN network;

S06, a final decision of a classification network is made, and a EFFICIENTNET network is adopted for classification aiming at a target frame reserved after the frame difference energy is judged;

The training MaskRCNN network in the target MaskRCNN network construction module specifically comprises: collecting fire image samples, marking the samples, constructing a training data set, marking coordinates and categories of smoke and flame targets in the images, and carrying out data enhancement on the data set; constructing MaskRCNN a network, initializing the weight of the MaskRCNN network, and training MaskRCNN network by using the constructed training set to obtain a target MaskRCNN network; the sample label is specifically as follows: the upper left (x 1, y 1) and lower right (x 2, y 2) corner coordinates of the smoke and flame objects are noted, and the data enhancement operations on the samples in the image dataset include random clipping, random luminance dithering, random saturation dithering, random contrast dithering, random hue dithering, mixup operations, mixup are processed as follows:

wherein lambda-Beta (1.5 );

the training process of MaskRCNN network is as follows:

Step S017: repeating the steps S012 to S016 until the preset iteration times are reached, stopping training, and storing MaskRCNN networks;

the step S02 specifically includes: detecting a large amount of video data and image data by using a target MaskRCNN network, cutting out targets with suspected smoke and flame, and constructing an image classification dataset with 3 categories including smoke, flame and probability value;

the step S03 specifically includes: initializing parameters of a EFFICIENTNET network by using the trained EFFICIENTNET classification network parameters on the ImageNet data set, inputting the image classification data set, performing end-to-end training, performing gradient calculation on a classification loss function by an Adam optimization algorithm, updating EFFICIENTNET network parameters, and stopping training after finishing the set round of training to obtain a target EFFICIENTNET network;

The step S04 specifically includes: acquiring images from a video stream, scaling the video images to 1024x1024 size, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; performing NMS non-maximum suppression on all predicted targets, and filtering out overlapped invalid target frames;

The step S05 specifically includes: cutting out an image area corresponding to a target frame in each frame of the nearest adjacent N frames according to the target frame, wherein N is more than 2, N images acquired by each target frame respectively perform frame difference calculation on the adjacent 2 images, binarization calculation is performed according to a threshold T to obtain N-1 binary images, the number of non-zero values of pixel values in all the binary images is counted, and the number is divided by the area of the target frame to obtain a final energy value; if the energy value is greater than the set energy threshold, the target frame is valid, otherwise, the target frame is discarded as a false positive of the static object.

2. MaskRCNN-based video fire identification system is characterized in that: comprising the following steps:

the final decision module of the classification network adopts EFFICIENTNET network to classify the target frame reserved after the frame difference energy is judged;

wherein lambda-Beta (1.5 );

the training process of MaskRCNN network is as follows:

The specific implementation process of the firework target detection module is as follows: acquiring images from a video stream, scaling the video images to 1024x1024 size, inputting the video images into a target MashRCNN network, and predicting to obtain target frame coordinates, probability values and categories of suspected smoke and flame; performing NMS non-maximum suppression on all predicted targets, and filtering out overlapped invalid target frames;

The frame difference energy judging module specifically executes the following steps: cutting out an image area corresponding to a target frame in each frame in the nearest adjacent N frames according to the target frame, wherein N is more than 2, N images acquired by each target frame respectively perform frame difference calculation on the adjacent 2 images, binarization calculation is performed according to a threshold T to obtain N-1 binary images, the number of non-zero values of pixel values in all the binary images is counted, and the number is divided by the area of the target frame to obtain a final energy value; if the energy value is greater than the set energy threshold, the target frame is valid, otherwise, the target frame is discarded as a false positive of the static object.