CN109829429B

CN109829429B - Security sensitive article detection method based on YOLOv3 under monitoring scene

Info

Publication number: CN109829429B
Application number: CN201910097619.1A
Authority: CN
Inventors: 柯逍; 黄新恩
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2022-08-09
Anticipated expiration: 2039-01-31
Also published as: CN109829429A

Abstract

The invention relates to a security sensitive article detection method based on YOLOv3 under a monitoring scene. The method comprises the following steps: step S1: respectively collecting image data of a cutter, a firearm, a trunk, a handbag and flame to form a security sensitive object image set; step S2: manually labeling the downloaded images to generate an xml file so as to meet the requirement of Yolov3 training; step S3: performing a plurality of data enhancements on the processed data; step S4: training a Yolov3 neural network model; step S5: and preprocessing the monitoring video information, and detecting the articles in the monitoring video. The method provided by the invention focuses on the auxiliary effect of computer vision on the security field, and has innovative significance; the method has high accuracy and good timeliness, and has a certain effect on the detection of sensitive articles in a monitoring scene.

Description

Security sensitive article detection method based on YOLOv3 under monitoring scene

Technical Field

The invention relates to the field of target recognition and computer vision, in particular to a security sensitive article detection method based on a YOLOv3 monitoring scene.

Background

With the increasing importance of security protection, monitoring products are increasingly popularized in various industries, and the intelligent security protection concept is also continuous to be deep. The traditional monitoring method is mainly used for real-time mechanical recording of the occurred events, and mainly aims at frightening thieves and obtaining evidence after the events. And can not block the illegal criminal event. And the intelligent security monitoring based on computer vision analysis on the monitoring video can analyze suspicious articles and behaviors in the monitoring scene in real time, alarm in time and prevent the suspicious articles and behaviors in the bud.

The intelligent security comprises a plurality of technologies, and the target detection is the most important core of the intelligent security. In recent years, a target detection algorithm is mature and widely applied to various fields, and in 2013, Ross Girshick and the like propose R-CNN, which is a milestone for applying a CNN method to target detection, and by means of good feature extraction and classification performance of CNN, conversion of a target detection problem is realized by a RegionProposal method, but there still exist several obvious problems, such as that images corresponding to a plurality of candidate regions need to be extracted in advance, a large disk space is occupied, and information loss is caused during normalization. Aiming at the problem that the extraction of the R-CNN features is too time-consuming, the KaimingHe provides SPP-Net, substantial improvement is made on the basis of the R-CNN, an image normalization process is cancelled, the problem of information loss is solved, the features of each candidate area can be obtained only by inputting original images once, the speed is greatly improved, but a plurality of problems still exist, the training process is still isolated, parameters cannot be integrally trained, and in addition, the Tuning cannot be simultaneously performed on convolution layers and full connection layers on two sides of the SPP-Layer, so that the effect of the depth CNN is limited to a great extent. In 2015, an original R-CNN author Ross Girshick further provides Fast-RCNN, which refers to the idea of SPP-Net, provides a simplified ROI pooling layer, and simultaneously adds a candidate frame mapping function, so that the network can be reversely propagated, the problem of the whole network training of SPP-Net is solved, and meanwhile, a deep network is further integrated through a multi-task Loss layer, so that the training process is unified, and the algorithm accuracy is improved. However, Fast-RCNN still does not solve the time-consuming problem of the Proposal Region. In 2016, Faster-RCN, co-proposed by KamingHe and Ross Girshick, incorporated candidate box extraction into the deep network by adding an additional RPN branch network. The near real-time performance is achieved by using a shared feature alternate training mode. In 2015, YOLO was proposed by Joseph Redmon et al, and compared with the R-CNN series of methods, another idea was provided to convert the Object Detection problem into a Regression problem. Given an input image, a bounding Box of a target and classification classes thereof are directly regressed at a plurality of positions of the image, and YOLO is a convolutional neural network capable of predicting the positions and the classes of the plurality of boxes at one time, can realize end-to-end target detection and identification, and has extremely high speed. In addition, the whole picture training model is directly selected for YOLO, so that the target area and the background area can be better distinguished. In 2016, the YOLO author also proposed YOLOv2, and aiming at the problems that the YOLO completes the prediction of the frame by using the data of the full connection layer, more spatial information is lost, and the positioning is inaccurate, the author removes the full connection layer from the network, and adds anchor boxes. YOLOv2 uses a K-means clustering method to train bounding boxes, and can automatically find better box width and height dimensions. In addition, a Batch Normalization operation is added after each convolution layer in the YOLOv2, and a dropout technology is removed, so that the convergence speed of the model can be increased, an overfitting prevention effect is achieved to a certain extent, and the precision is greatly improved. Then, YOLO v3 adopts upsampling and feature map fusion to fuse the shallow detail information and the deep semantic information, and multi-feature output is carried out, so that the detection effect of small objects can be greatly improved

Along with the development of computer technology, neotype intelligent security monitored control system has appeared on the market successively, monitor the crowd through computer vision technique, can not only discern special individual's action in the crowd, can also discern and assay crowd's action to judge and the early warning to individual in the surveillance video or the unusual action of crowd, when discovering unusual action (if fighting or abnormal conditions (extrusion trample etc.), send the warning automatically and remind the staff, very big improvement video monitoring's real-time and practicality. The series of intelligent security systems greatly accelerate the speed of event processing, but have shortcomings, because criminal suspects often carry with them the crime tools in most social security events, and the bags and boxes that carry with them are also suspected of hiding the crime tools, so the detection of such sensitive articles is helpful for finding dangerous cases in time, and the potential criminals can find out the dangerous cases in time when holding the dangerous articles in the monitoring, thereby being beneficial for security personnel to process the dangerous cases in time, and improving the shortcomings of the existing intelligent security systems to a certain extent. Therefore, the invention adopts YOLOv3 to detect security sensitive articles. The present invention first enumerates several types of items that need to be observed with emphasis in a monitored environment, such as knives, guns, bottles, luggage, handbags, and backpacks. Downloading and screening the image set on the network, and performing certain image enhancement. Subsequently, the data set was trained using YOLOv3, appropriate parameters were adjusted, and the trained model was saved. In the detection link, each frame of image of the monitoring video is respectively cut into four small images which are overlapped with each other and used as the input of the test, and the output result image is subjected to non-maximum value inhibition once to obtain the final detection result of each frame. The invention focuses on the auxiliary function of computer vision to the security protection field, and has innovative significance. The method provided by the invention has the advantages of high accuracy, good timeliness and certain effect on the detection of sensitive articles in a monitoring scene.

Disclosure of Invention

The invention aims to provide a method for detecting security sensitive articles under a monitoring scene based on YOLOv3, which has innovative significance by focusing on the auxiliary effect of computer vision on the security field, has high accuracy and good timeliness, and has a certain effect on detecting sensitive articles under the monitoring scene.

In order to achieve the purpose, the technical scheme of the invention is as follows: a security protection sensitive article detection method under a monitoring scene based on YOLOv3 comprises the following steps:

s1, collecting image data of security sensitive articles to form an image set of the security sensitive articles;

s2, labeling the images in the image set to generate an xml file so as to meet the requirement of Yolov3 training;

step S3, data enhancement is carried out on the data in the processed data set;

step S4, training a YOLOv3 neural network model;

and S5, preprocessing the monitoring video information, and detecting security sensitive articles in the monitoring video through a YOLOv3 neural network model.

In an embodiment of the present invention, in the step S1, data acquisition is performed through the following steps:

s11, analyzing the types of the articles to be observed in the monitoring scene, and determining security sensitive articles;

s12, downloading relevant pictures of the security sensitive articles through a crawler;

and step S13, screening the downloaded pictures and eliminating wrong pictures.

In an embodiment of the invention, the image data of the security sensitive object comprises image data of a knife, a firearm, a luggage, a handbag and a flame.

In an embodiment of the present invention, in the step S2, the image annotation is performed by:

step S21, downloading labelImg and configuring;

step S22, selecting an object for each frame by using labelImg, and storing the position information and classification information of the rectangular frame in an xml file.

In an embodiment of the present invention, in the step S3, data enhancement is performed through the following steps:

s31, performing contrast stretching on all pictures in the data set obtained in the step S2, keeping the labeling information in the corresponding xml unchanged, and adding the pictures subjected to contrast stretching into a new data set;

step S32, performing multi-scale transformation on all pictures in the data set obtained in the step S2, adjusting the length and width (resize) to 1/2 and 2 times of the initial size respectively, performing corresponding coordinate transformation on the labeling information in the xml at the same time, and adding the labeling information into a new data set at the same time;

s33, cutting all the pictures in the data set obtained in the S2, cutting the edge of each picture 1/10, keeping the center, performing corresponding coordinate transformation on the marking information in the xml, and adding a new data set;

and S34, adding random noise to all the pictures in the data set obtained in the step S2, keeping the marking information in the corresponding xml unchanged, and simultaneously adding the random noise into a new data set.

In an embodiment of the present invention, in the step S4, the training of the YOLOv3 neural network model is performed by:

step S41, training by adopting a deep learning framework darknet, and setting the initial parameters as follows:

initial learning rate, i.e., -learning rate: 0.001;

polynomial rate decay, i.e. -multinomial rate decay: a power of 4;

weight decay, namely-weight decay: 0.0005;

momentum, i.e., -momentum: 0.9;

step S42, generating anchor boxes (anchor boxes) required by YOLOv3 through k-means clustering, and predicting bounding boxes (bounding boxes) by using the anchor boxes (anchor boxes);

step S42, predicting the object score through logistic regression for each bounding box (bounding box), wherein each bounding box needs five basic parameters of x, y, w, h and confidence, wherein (x, y) is the center coordinate of the bounding box, (w, h) is the width and height of the bounding box, and confidence is the confidence;

step S43, outputting feature maps of three different scales by utilizing one down-sampling and one up-sampling;

step S44, realizing size change of tensor by changing convolution kernel step length in forward propagation;

step S45, let the loss function be:

the first row adopts a total square error as a loss function of position prediction, the second row adopts a root total square error as a loss function of width and height, the third row and the fourth row use SSE as a loss function for confidence, and the fifth row use SSE as a loss function for class probability; s in the formula ² Represents the number of grids, wherein 7 is multiplied by 7; b represents the number of prediction frames of each cell, and 3 is taken in the formula; lambda [ alpha ] _coord ＝5，λ _noobj ＝0.5，

Indicates whether an item is present in a cell grid (grid cell) i,

indicating that the jth bounding box in the unit grid i predicts the correct category;

step S46, calculating a weight value and a bias value after the convolutional neural network is updated by adopting a random gradient descent method;

step S47, after training is iterated to 10000 times, the learning rate is adjusted to 10 ^-4 Continuing training;

and step S48, stopping training after the iteration is carried out for 50000 times, and storing the trained model.

In an embodiment of the present invention, in step S5, the detection of the security sensitive item is performed through the following steps:

step S51, extracting each frame in the monitoring video as an input image;

step S52, cutting the input image into 4 small images which are overlapped with each other;

step S53, the yollov 3 neural network model adjusts (resize) each input atlas to 448 × 448 size, and divides the input atlas into 7 × 49 meshes on average, and the size of each mesh is 64 × 64;

step S54, for each grid, predicting 2 bounding boxes (bounding boxes), wherein each bounding box comprises 5 prediction quantities and 5 category probabilities;

step S55, screening windows with confidence coefficient lower than a set threshold value according to the 7 × 2 target windows predicted in the step S54, then removing redundant windows by using non-maximum value inhibition, and respectively obtaining the positions of the articles in the 4 small graphs;

step S56, coordinate transformation is performed on the coordinates of the detected article in the 4 thumbnails, non-maximum suppression is performed again, and information on the coordinates of the article repeatedly detected in the 4 thumbnails is removed.

Compared with the prior art, the invention has the following beneficial effects: the method provided by the invention has an innovative significance by focusing on the auxiliary effect of computer vision on the security field, is high in accuracy and good in timeliness, and has a certain effect on the detection of sensitive articles in a monitoring scene.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

The embodiment of the invention provides a method for detecting security-protection sensitive articles in a monitoring scene based on YOLOv3, which is shown in FIG. 1. The method aims at solving the problems that most of target detection applications in the market are still based on the traditional RCNN series algorithm, the speed is low, and the application to monitoring scenes is still laborious. The invention adopts YOLOv3 to detect security sensitive articles. The invention firstly enumerates a plurality of object types needing to be intensively observed in a monitoring environment, such as knives, guns, luggage, handbags and flames, so as to simultaneously detect the flames to control dangerous situations. Downloading and screening the image set on the network, and performing certain image enhancement. Then, the data set is trained by using the YoLO, appropriate parameter adjustment is carried out, and the trained model is stored. In the detection link, each frame of image of the monitoring video is respectively cut into four small images which are overlapped with each other and used as the input of the test, and the output result image is subjected to non-maximum suppression once to obtain the final detection result of each frame. The invention focuses on the auxiliary function of computer vision to the security protection field, and has innovative significance. The method provided by the invention has the advantages of high accuracy, good timeliness and certain effect on the detection of sensitive articles in a monitoring scene. The method comprises the following specific steps:

step S1: and respectively acquiring image data of the cutter, the firearm, the trunk, the handbag and the flame to form a security sensitive object image set.

Step S2: and manually labeling the downloaded images to generate an xml file so as to meet the requirement of the Yolov3 training.

Step S3: several data enhancements are made to the processed data.

Step S4: the YOLOv3 neural network model was trained.

Step S5: and preprocessing the monitoring video information, and detecting the articles in the monitoring video.

In this embodiment, in the step S3, data enhancement is performed on the data through the following steps:

step S31: and (5) performing contrast stretching on all pictures in the data set obtained in the step (S2), keeping the marking information in the corresponding xml unchanged, and adding the pictures subjected to contrast stretching into the new data set.

Step S32: and (4) performing multi-scale transformation on all pictures in the data set obtained in the step (S2), adjusting the length and width (resize) of the pictures to be 1/2 times and 2 times of the initial size, performing corresponding coordinate transformation on the labeling information in the xml at the same time, and adding a new data set at the same time.

Step S33: and (5) cutting all pictures in the data set obtained in the step (S2), cutting the edge of each picture 1/10, keeping the center, performing corresponding coordinate transformation on the marking information in the xml, and adding a new data set.

Step S34: and adding random noise to all pictures in the data set obtained in the step S2, keeping the marking information in the corresponding xml unchanged, and adding the pictures into a new data set.

In this embodiment, in step S4, training of the neural network is performed through the following steps:

step S41: training by adopting a deep learning frame dark net, and setting initial parameters:

initial learning rate-learning rate: 0.001;

polynomial rate decay-polynomial rate decay: a power of 4;

weight attenuation-weight decay: 0.0005;

momentum-momentum: 0.9.

step S42: anchor boxes (anchor boxes) required by YOLOv3 are generated through k-means clustering, and bounding boxes (bounding boxes) are predicted by using the anchor boxes (anchor boxes).

Step S42: and predicting the object score by logistic regression for each bounding box (bounding box), wherein each bounding box needs to have five basic parameters of (x, y, w, h, confidence), wherein (x, y) is the center coordinate of the bounding box, (w, h) is the width and height of the bounding box, and confidence is confidence.

Step S43: and outputting the feature maps with three different scales by using one down-sampling and one up-sampling.

Step S44: the size variation of the tensor is achieved by changing the convolution kernel step size in the forward propagation.

Step S45: the loss function is:

wherein, the first line adopts sum-squared error as the loss function of position prediction, and the second line adopts root-squared error as the widthAnd a height loss function, wherein SSE is used as the loss function for the confidence level in the third row and the fourth row, and SSE is used as the loss function for the class probability in the fifth row; s in the formula ² Represents the number of grids, wherein 7 is multiplied by 7; b represents the number of prediction frames of each cell, and 3 is taken in the formula; lambda [ alpha ] _coord ＝5，λ _noobj ＝0.5，

Indicates whether an item is present in a cell grid (grid cell) i,

step S46: and calculating the weight value and the offset value after the convolutional neural network is updated by adopting a random gradient descent method.

Step S47: after training iterations to 10000 times, the learning rate is adjusted to 10 ^-4 And continuing training.

Step S48: stopping training after 50000 times of iteration, and storing the trained model.

In this embodiment, in the step S5, the article detection is performed by:

step S51: each frame in the surveillance video is extracted as an input image.

Step S52: the input image is cut into 4 small pictures which are overlapped with each other.

Step S53: the network resizes (resize) each of the input thumbnails to 448 × 448 sizes and divides them into 7 × 49 grids on average, each of 64 × 64 sizes.

Step S54: for each mesh, 2 bounding boxes (bounding boxes) are predicted, each containing 5 predictors and 5 class probabilities.

Step S55: and (4) predicting 7 × 2 target windows according to the previous step, screening windows with lower confidence coefficient according to a threshold value, and then removing redundant windows by utilizing non-maximum value inhibition to respectively obtain the positions of the articles in the 4 small graphs.

Step S56: and (4) carrying out coordinate transformation on the coordinates of the detected article in the 4 small images, and carrying out non-maximum value inhibition again to remove the coordinate information of the article repeatedly detected in the 4 small images, thereby further improving the accuracy.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A security protection sensitive article detection method under a monitoring scene based on YOLOv3 is characterized by comprising the following steps:

step S3, data enhancement is carried out on the data in the processed data set;

step S4, training a YOLOv3 neural network model;

s5, preprocessing monitoring video information, and detecting security sensitive articles in the monitoring video through a YOLOv3 neural network model;

in the step S4, training of the YOLOv3 neural network model is performed by:

initial learning rate, i.e., -learning rate: 0.001;

polynomial rate decay, i.e. -multinomial rate decay: a power of 4;

weight decay, namely-weight decay: 0.0005;

momentum, i.e., -momentum: 0.9;

step S42, generating an anchor box required by YOLOv3 through k-means clustering, and predicting a bounding box by using the anchor box;

step S42, predicting the object score through logistic regression for each bounding box, wherein each bounding box needs five basic parameters of x, y, w, h and confidence, wherein (x, y) is the center coordinate of the bounding box, (w, h) is the width and height of the bounding box, and confidence is confidence;

step S45, let the loss function be:

Indicating whether an item is present in grid cell i,

step S48, stopping training after the iteration is carried out for 50000 times, and storing the trained model;

in step S5, the security sensitive item is detected by:

step S51, extracting each frame in the monitoring video as an input image;

step S53, adjusting each input thumbnail to 448 × 448 size by the YOLOv3 neural network model, and dividing into 7 × 7 ═ 49 meshes on average, wherein the size of each mesh is 64 × 64;

step S54, for each grid, predicting 2 bounding boxes, wherein each bounding box comprises 5 prediction quantities and 5 category probabilities;

2. The method for detecting security-sensitive articles under the monitoring scene based on YOLOv3 of claim 1, wherein in step S1, data collection is performed by:

and step S13, screening the downloaded pictures and eliminating wrong pictures.

3. The method for detecting security-sensitive articles in a monitoring scene based on YOLOv3 as claimed in claim 1 or 2, wherein the image data of security-sensitive articles includes image data of knives, firearms, luggage, handbags and flames.

4. The method for detecting security-sensitive articles under the monitoring scene based on YOLOv3 of claim 1, wherein in step S2, the image labeling is performed by:

step S21, downloading labelImg and configuring;

5. The method for detecting security sensitive articles in a monitoring scene based on YOLOv3 as claimed in claim 1, wherein in step S3, data enhancement is performed by the following steps:

step S32, performing multi-scale transformation on all the pictures in the data set obtained in the step S2, adjusting the length and width of the pictures to 1/2 and 2 times of the initial size, performing corresponding coordinate transformation on the labeling information in the xml at the same time, and adding the labeling information into a new data set at the same time;