CN112131983A

CN112131983A - Helmet wearing detection method based on improved YOLOv3 network

Info

Publication number: CN112131983A
Application number: CN202010953092.0A
Authority: CN
Inventors: 董明刚; 魏雪影; 敬超
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-25

Abstract

The invention discloses a helmet wearing detection method based on an improved YOLOv3 network. A safety helmet wearing detection data set (SHWDS) is manufactured by intercepting construction site video monitoring data and collecting samples through a public data set screening method. Based on a YOLOv3 network, according to the size characteristics of a data set sample, 8-time downsampling information and 16-time downsampling information are respectively fused with deep semantic information, a feature fusion target detection layer of the two scales is established, in order to avoid gradient disappearance and enhance feature multiplexing, 2 DBL units and 2 ResNet units are used before an 8-time downsampling feature diagram, and meanwhile, in order to increase the stability and convergence rate of target regression, a DIOU loss function is used for replacing an original loss function to obtain a model network for detection. And clustering the data by using a K-means algorithm, applying the prior frame size obtained by clustering to each scale network layer, training and detecting on a self-made SHWDS, and storing a training optimal model for a real construction site.

Description

Helmet wearing detection method based on improved YOLOv3 network

Technical Field

The invention belongs to the technical field of feature extraction and target detection, and provides a helmet wearing detection method based on an improved YOLOv3 network.

Background

With the continuous development of computer vision related technologies, target detection technologies have been widely applied in the industrial field, wherein helmet wearing detection is one of important applications. The traditional safety helmet wearing detection algorithm needs to be realized through manual design features, the requirement on the environment is high, and the problems of low accuracy, poor model generalization and the like exist in actual detection. The deep convolutional neural network has stronger robustness because the deep convolutional neural network can independently complete the learning of target characteristics and extract key information. In recent years, with the development of deep learning, various target detection algorithms based on deep learning are proposed, of which the most representative are a two-stage model and a one-stage model. The algorithm of fast RCNN, Mask RCNN and the like is to divide the target detection into two stages, namely, firstly, extracting the target information of a candidate frame by using a regional candidate network RPN, and then, completing the prediction and identification of the position and the category of the candidate target frame by using a detection network. Algorithms such as YOLOV3, YOLOV4 treat object detection as a spatial regression problem, and perform target detection with spatially separated bounding boxes and associated class probabilities. Although the two-stage model has higher detection precision, the detection speed is lower, and the one-stage model has higher detection speed, so the one-stage model is selected in the improvement. The detection performance of the Yolov4 is higher than that of the Yolov3, but the requirement on equipment is high, and the method is finally selected to detect the wearing of the safety helmet on the basis of the Yolov 3.

Yolov3 adopts darknet-53 as a backbone network for feature extraction, borrows the thought of ResNet, and adds a residual error module in the network, thus being beneficial to solving the problem of gradient disappearance of a deep network. Secondly, multi-scale detection is adopted, 3 feature graphs with different scales are selected for object detection, and features with finer granularity can be detected. The predictive object class in Yolov3 uses a logistic function to replace the softmax function, and supports the detection of multi-label objects. Since the small targets in the SHWDS data set account for about 95%, the YOLOv3 algorithm still has the problems of high missed detection rate and low recall rate in the detection of the SHWDS data set at this time.

BN (batch normalization) batch regularization:

BN is to maintain the same distribution of inputs to each layer of neural network during deep neural network training, and re-normalize the activation values of the previous layer on each batch, i.e. to make the mean of the output data close to 0 and the standard deviation close to 1. The problem that the input distribution of each layer of the network is changed all the time, so that the difficulty of the training process is increased is solved. The batch-normalization is also beneficial to standardizing the model, so that overfitting still can be avoided after dropout optimization is abandoned, gradient extinction and gradient explosion can be avoided, convergence of the network is accelerated, and generalization capability of the network is improved.

Cross entropy loss function:

the cross entropy loss function may determine how close the actual output is to the desired output. Like other loss functions, the cross entropy loss function is used for updating weights between neuron connections so as to achieve the purpose of reducing training errors. Compared with the variance loss function, the cross entropy loss function overcomes the problem of slow learning speed. It is often used in classification problems, and cross entropy is mainly used with Sigmoid and other functions, since it involves computing the probability of each class.

Sigmoid function:

sigmoid function is a common Sigmoid function in biology. The probability of each category is calculated using Sigmoid function, each category is not necessarily mutually exclusive, one object can be predicted into multiple categories, and the eigenvalues can be utilized to a greater extent.

Leaky ReLU activation function:

the characteristics that the output of the ReLU activation function is zero when the input is negative make the ReLU activation function easy to appear the situation that the gradient disappears in the training. In order to solve the problem, Leaky ReLU gives a non-zero slope to all negative values, and because the derivative is always not zero, the occurrence of silent neurons can be reduced, learning based on gradient is allowed, and the problem that the neurons cannot learn after the Relu function enters a negative interval is solved.

DIOU loss function:

the IOU loss function can reflect the detection effects of the prediction detection box and the real detection box, but is not sensitive to the scale, and the non-overlapped part cannot be directly optimized. The DIOU considers the distance, the overlapping rate and the scale between the target and the anchor, so that the regression of the target frame becomes more stable, the problems of divergence and the like in the training process like IoU and GIoU are avoided, and the detection accuracy is improved.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems of low detection rate and high omission factor of small objects in a YOLOv3 network pair, the invention provides a helmet wearing detection method based on an improved YOLOv3 network, which has improved detection speed and accuracy.

The idea of the invention is as follows: and intercepting the video monitoring data of the construction site and acquiring a sample by a method for screening a public data set to manufacture an SHWDS data set. Based on a YOLOv3 network, according to the size characteristics of a data set sample, 8-time downsampling information and 16-time downsampling information are respectively fused with deep semantic information, a feature fusion target detection layer on the two scales is established, in order to avoid gradient disappearance and enhance feature multiplexing, 2 DBL units and 2 ResNet units are used before an 8-time downsampling feature diagram, and meanwhile, in order to increase the stability and convergence rate of target regression, a DIOU loss function is used for replacing an original loss function to obtain a model network structure for detection. And clustering the data by using a K-means algorithm, applying the eight prior frame sizes obtained by clustering to each scale network layer, training and detecting on a self-made data set, and storing a training optimal model for a real construction site.

Further, the specific steps of establishing the helmet wearing detection data set in the step (1) are as follows:

(1.1) acquiring image data from a video file by using an OpenCV development library and an open source labeling tool LabelImg, performing multi-label labeling, and automatically generating a corresponding xml format labeling file to obtain a data set a;

(1.2) downloading a Safety Helmet Detection data set from a kaggle competition website according to experiment requirements, and processing the data set to obtain a required data set b;

and (1.3) merging the data sets a and b and dividing the data sets according to a certain proportion.

(1.4) expanding the training set verification set sample by using an image enhancement technology;

(1.5) randomly selecting a certain number of pictures from the expanded data set as a training set and a verification set, and processing according to the VOC2007 format to obtain a SHWDS data set.

Further, the specific steps of building the improved YOLOv3 network model in the step (2) are as follows:

(2.1) constructing a yolov3 feature extraction network Darknet 53;

(2.2) fusing the 16-time down-sampling layer with deep semantic information to serve as a first YOLO detection layer;

(2.3) perform UpSample operation with 8 times down-sampled feature map concat as the second YOLO detection layer.

Further, the specific steps of training the improved YOLOv3 network model by using the homemade SHWDS data set in step (3) are as follows:

(3.1) defining the height and width of an input network picture;

(3.2) analyzing all xml files in the training verification set to obtain a train.

(3.3) carrying out clustering operation on the txt obtained to obtain a required anchor box frame;

(3.4) creating an initial model _ body;

(3.5) loading yolov3 pre-training weights;

(3.6) defining a loss function model _ loss;

(3.7) creating a final training Model using the functional Model class Model;

(3.8) defining variables required by training and assigning values;

(3.9) freezing the front 185 layers of the training model;

(3.10) carrying out model training and updating variable values;

(3.11) when the epoch < (50), if j equals 5, performing step 3.13, otherwise performing step 3.10;

(3.12) when 50< epoch < ═ 100, performing step 3.13;

(3.13) training all layers, setting batch _ size to be 8, setting an activation function to be Adam, setting a learning rate to be 0.0001, and setting a loss function to be a custom model _ loss;

(3.14) if j is 5 or epoch is 100, save the model weight last1.h5 and end the model training.

Further, the specific steps of storing the optimal model in the step (4) for detecting that a worker in a construction site wears the safety helmet are as follows:

(4.1) preprocessing the original picture to obtain a 416 x 416 new picture and executing normalization operation;

(4.2) inputting the processed pictures into a trained yolov3 network model for training;

and (4.3) acquiring each category and the corresponding target frame position, displaying on the picture, and finishing the detection.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow chart of the SHWDS data set of FIG. 1;

FIG. 3 is a flow chart of the construction of the improved YOLOv3 network model in FIG. 1;

FIG. 4 is a flow diagram of the improved YOLOv3 network model trained using the SHWDS dataset of FIG. 1;

FIG. 5 is a flowchart of the process of FIG. 1 for storing the optimized model for helmet wearing detection by a worker at a construction site.

Detailed Description

The invention is further elucidated with reference to the drawings and the detailed description.

As shown in fig. 1-5, the present invention comprises the steps of:

the method comprises the following steps: referring to fig. 2, the homemade SHWDS data set step 101 goes from step 201 to step 205:

step 201: the method comprises the steps of using an OpenCV development library to acquire image data from a video file according to a time period capture frame and store the image data as an image format file, using an open source labeling tool LabelImg to label the image format file in a multi-label mode according to a labeling format of VOC2007, and automatically generating a corresponding xml format labeling file, wherein the xml format labeling file comprises an object name and coordinate information of a boundary box. The categories are helmet with a safety helmet and head without a safety helmet, and a data set b is obtained;

step 202: according to the experiment requirements, downloading a Safety Helmet Detection data set from a kaggle competition website, analyzing xml by using ElementTree, and deleting target box information with the object name of person to obtain a data set b;

step 203: merging the data sets a and b to obtain a data set c with 8174 pictures in total, and according to the data set c: 2, dividing the ratio into a training set and a test set;

step 204: the method comprises the steps of utilizing an image enhancement technology to scale an original image in a training set between 0.25 and 2, distorting the original image between 0.5 and 1.5 in width and height, enabling hue to be 0.1, enhancing brightness and saturation to be 1.5 times of the original image, and simultaneously utilizing a data dithering mode to expand a sample to obtain an expansion set with 8208 pictures in total data number;

step 205: randomly selecting 6660 pictures from the expansion set as a training set, 1548 pictures as a verification set, obtaining an SHDWS data set by imitating the format marking of VOC2007, and analyzing the data set to obtain the proportion of the small and medium objects to the large objects of 5: 95;

step two: as shown in fig. 3, step 102 of building an improved YOLOv3 network model includes steps 301 to 306:

step 301: using continuous convolution of 3 × 3 and 1 × 1 and a shorcut connection to realize a trunk network Darknet53, and adopting Leaky relu as an activation function;

step 302: fusing deep semantic information output by the con75 layer with a 16-time down-sampling layer to serve as a first YOLO detection layer;

step 303: executing UpSample on the 16-time down-sampling feature map and 8-time down-sampling feature map concat to serve as a second YOLO detection layer;

step 304: using 6 DBL cells and 1 convolution with 1 x 1 before the first YOLO detection layer;

step 305: using 2 DBL cells and 2 ResNet cells and 1 convolution with 1 x 1 before the second YOLO detection layer;

step 306: setting DIOU as a regression loss function of the target frame, and using a cross entropy loss function as a loss function of confidence degree and category;

step three: referring to FIG. 4, training the improved YOLOv3 network model step 103 using the homemade SHWDS dataset goes from step 401 to step 414:

step 401: defining the height h and the width w of an input network picture, wherein the value w is 416, and h is 416;

step 402: analyzing the training verification set by using an ElementTree to obtain a train.

Step 403: performing a clustering operation on txt obtained above obtains 8 anchor boxes, and the calculation formula of the distance is d (b, c) to 1-IoU (b, c). Wherein b represents a set of real frames, c represents a set of cluster centers of the bounding boxes, IoU (b, c) represents the ratio of intersection and union of the real frames and the cluster centers of the bounding boxes, and each feature layer corresponds to 4 anchor boxes respectively;

step 404: using a yolo _ body function to create an initial model _ body, num _ anchors is 4, and num _ classes is 2;

step 405: model _ body loads yolov3 pre-training weights yolov3_ weight;

step 406: defining a loss function model _ loss, anchors set to 8, num _ class set to 2, and ignore _ thresh set to 0.5;

step 407: and (3) creating a final training Model by using a functional Model class Model, inputting the Model _ body.input and y _ true, and outputting the Model _ loss. Input is input of an initial model, and y _ true is a tag value;

step 408: defining a batch training number epoch to be 0, a counting variable j1 to be 0, j2 to be 0, k to be 0, a val _ loss variation m, an lr variation n, and a val _ loss performance improvement minimum value min _ delta to be 0;

step 409: freezing the front 185 layers of the model, setting the batch _ size to be 16, the learning rate lr to be 0.001, the activation function to be Adam, and the loss function to be model _ loss;

step 410: carrying out model training to update the weight of the neural network connection function, wherein epoch is equal to epoch + 1;

step 411: if n is 0, k is k + 1;

step 412: if m < min _ delta, j1 j1+1, j2 j2+ 1;

step 413: if k is 2, lr is 0.5, and k is 0;

step 414: when the epoch < > is 50, j2 is 5, go to step 416, otherwise go to step 410;

step 415: when 50< epoch <100, if j1 is 5, go to step 418, otherwise go to step 416;

step 416: training all layers, setting batch _ size to be 16, lr to be 0.0001, and setting a loss function to be model _ loss and an activation function to be Adam;

step 417: when the epoch is 100, go to step 418;

step 418: saving model weight last1.h 5;

step 419: and finishing the model training.

Step four: as in FIG. 5, the save optimal model for the Job site worker-worn crash helmet detection step 104 proceeds from step 501 to step 504;

step 501: calling a letterbox _ image function to generate a 416 multiplied by 416 new picture filled with 'absolute gray' R128-G128-B128;

step 502: dividing the new picture value by 255 for normalization processing;

step 503: inputting the processed picture into a trained model for training;

step 504: and acquiring the condition that the worker wears the safety helmet and the positions of the target frames corresponding to the categories, and displaying the positions on the picture to finish detection.

In order to better illustrate the effectiveness of the method, the test set of the SHWDS data set and the monitoring video of the construction site are tested, and the original Yolov3 network and IOU loss function and the improved Yolov3 network and DIOU loss function of the method are compared, so that the experimental result shows that the detection speed and the detection precision are improved by about 2% by using the improved Yolov3 network.

The above description is only an example of the present invention and is not intended to limit the present invention. All equivalents which come within the spirit of the invention are therefore intended to be embraced therein. Details not described herein are well within the skill of those in the art.

Claims

1. A safety helmet wearing detection method based on an improved YOLOv3 network is characterized by comprising the following specific steps:

(1) the method comprises the steps of intercepting construction site video data and collecting samples by a public data set screening method to manufacture an SHWDS data set;

(2) building an improved YOLOv3 network model;

(3) training the model by using a self-made SHWDS data set;

(4) and saving the optimal model for detecting whether workers wear safety helmets on the construction site.