CN112989958A

CN112989958A - Helmet wearing identification method based on YOLOv4 and significance detection

Info

Publication number: CN112989958A
Application number: CN202110195098.0A
Authority: CN
Inventors: 李岳阳; 兰天; 罗海驰; 杜鹏; 朱一昕; 樊启高; 毕恺韬
Original assignee: Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute
Current assignee: Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-06-18

Abstract

The invention discloses a helmet wearing identification method based on YOLOv4 and significance detection, which comprises the following steps: labeling an existing data set, and training a Yolov4 target detection model; downloading and expanding a significance data set, and training a significance detection model; obtaining a recognition result of the target and target frame information by using the trained target detection model; obtaining the saliency estimation of the image by using the trained saliency detection model; clipping the significance estimation by using the position information of the target frame body; by the mode, the dynamic significance estimation of the image is obtained by utilizing the significance detection method, so that background influence can be effectively removed, and a high significance result on a moving target pixel level is obtained; the target detection result is rechecked through the significance detection result, so that the false detection probability can be greatly reduced, interference items to the target detection in the background can be effectively distinguished, and the target detection precision is improved.

Description

Helmet wearing identification method based on YOLOv4 and significance detection

Technical Field

The invention relates to the field of machine vision and pattern recognition, in particular to a helmet wearing recognition method based on YOLOv4 and significance detection.

Background

The safety helmet is the most common and practical personal protection appliance, and can effectively prevent and reduce the head injury caused by external dangerous sources. For a long time, the problems of low comprehensive quality and weak safety consciousness of operating personnel in construction areas in China generally exist, the wearing consciousness of basic protection facilities such as safety helmets is lacked, the operation risk is greatly increased, and safety accidents happen occasionally. From the published network data, safety incidents due to unsafe behavior of the constructors account for 95% of the total category of all incidents.

At present, the potential safety hazard problem that exists is worn at the safety helmet, what the enterprise mainly relied on is that relevant managers's tour or look over through the control of making a video recording by security personnel, and this manpower and material resources not only are wasted, and inefficiency moreover.

In recent years, the technology in the artificial intelligence system is more mature, and the number of successful application cases in the technical fields of deep learning and computer vision is not enough, such as speech recognition, fingerprint recognition and face recognition which are widely known. The method has the advantages of full automation, no human interference, high precision and the like, and can be applied to the fields of supervision, security and the like. Once the technology is popularized, great change is generated to the society, people can be liberated from simple repeated labor, and the social productivity is greatly improved.

Disclosure of Invention

The invention mainly solves the technical problem of providing a helmet wearing identification method based on YOLOv4 and significance detection, which can solve the problem of influence of complex background in the conventional target detection and significance detection, can well identify workers wearing helmets, finally cuts the corresponding significance estimation images by using the frame positions in the target detection to obtain significance estimation images of all single targets, and rechecks the targets by using the images to finally achieve the effect of improving the identification accuracy.

In order to solve the technical problems, the invention adopts a technical scheme that: the helmet wearing identification method based on YOLOv4 and significance detection comprises the following steps:

step 1: labeling an existing data set, and training a Yolov4 target detection model;

marking the data by using marking software to obtain a file recording the size and the label of the target position in each picture, and dividing the data into two parts which are respectively a training set and a test set;

adopting a YOLOv4 network as a target detection model, during training, based on the output of target detection, carrying out iterative computation to minimize the loss function value of the detection model, and obtaining the trained target detection model after reaching the predetermined iteration times;

step 2: downloading and expanding a significance data set, and training a significance detection model;

the training set used in training the significance detection model is from a significance detection data set disclosed in a network, data expansion is carried out on the existing significance detection data set to obtain a usable video training data set, and the significance detection model with higher accuracy is trained;

and step 3: obtaining a recognition result of the target and target frame information by using the trained target detection model;

transmitting a single frame obtained from a picture or a video shot by a camera into the target detection model by using the well-trained Yolov4 target detection model in the step 1 to obtain the output of a target detection result;

and 4, step 4: obtaining the saliency estimation of the image by using the trained saliency detection model;

performing a significance detection task by using the significance model trained in the step 2, inputting each frame of image monitored by the video into a neural network, and then outputting significance mapping by the network; the static saliency model takes a single-frame image as input and generates pixel-level saliency estimation; the input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated;

and 5: clipping the significance estimation by using the position information of the target frame body;

because the original image, the result image of the target detection model and the estimated image identified by the saliency detection model are consistent in size and specification, the target frame position in the target detection result corresponds to the target position in the saliency estimated image, and the target position can be marked in the saliency estimated image by directly utilizing the output finally obtained in the step 3;

step 6: rechecking single small pictures of all targets;

and 5, judging whether the clipped single saliency map has the target or not by using the saliency estimation map of each target obtained in the step 5.

In a preferred embodiment of the present invention, in step 1, the YOLOv4 network mainly includes a backbone feature extraction network and an enhanced feature extraction network;

the main feature extraction network adopts a CSPDarkNet53 architecture, the input of the main feature extraction network is a 3-channel picture, and in order to ensure the input consistency, the original picture is scaled in an equal ratio; then, in order to ensure that the picture is not distorted, the length-width ratio of the picture is not changed when the short edge is adjusted, and the gray area is expanded up and down or left and right on the short edge; in the trunk feature extraction network, a CSPnet improved residual block is adopted for convolution for many times, and finally three results of feature extraction are input of a subsequent enhanced feature extraction network;

the reinforced feature extraction network comprises an SPP structure and a PANet structure; in the SPP structure, after the last feature layer of CSPdakrnet 53 is convoluted for three times, the processing is carried out by respectively utilizing the maximum pooling of four different scales, so that the receptive field can be greatly increased, and the most obvious context feature can be separated; the PANET is an example segmentation algorithm, and the specific structure of the PANet has the function of repeatedly improving the characteristics;

three effective characteristic layers obtained by a main characteristic extraction network and an SPP structure use multiple convolution up-sampling and convolution down-sampling to effectively realize characteristic extraction and finally obtain three YOLOread outputs.

In a preferred embodiment of the present invention, in step 1, the loss function value is calculated by the following three parts:

1) calculating the error between the position of the predicted frame and the position of the real frame by using the CIOU; compared with the IOU, when the CIOU is calculated, the errors caused by different positions and different frame width-height ratios are considered when the intersection area ratio between the two frames is the same, so that the result is more accurate;

2) errors due to target confidence; when a target is correctly detected, the higher the confidence score is, the smaller the error is, otherwise, the larger the error is;

identifying errors caused by categories; i.e. the comparison of the predicted species result with the actual result.

In a preferred embodiment of the present invention, each of the saliency detection data sets disclosed in step 2 includes thousands of pictures and corresponding saliency maps for training a static saliency detection network; for training the dynamic significance detection network, a video frame data set adjacent to the pictures is also needed;

the difference information at the pixel level between adjacent frames is represented by an optical flow field V = (u, V), where u is a vertical direction, V is a horizontal direction, and the position of one point is represented by X (X, y). Therefore, the difference relationship between the adjacent frames I and I' can be expressed by the following formula:

the vertical direction u is taken as an example, since the principles in the horizontal and vertical directions are identical. And dividing pixels in the picture into foreground pixels f and background pixels b, and processing the foreground pixels and the background pixels separately. Extracting 10% of random initialization motion values from background pixels b, wherein the range is [ -d, d ], d = h/10, and h is the height of a picture, and jogging partial pixels in the background to simulate the noise of the background in an actual video; in the foreground pixel f, firstly, a foreground main motion mode m is assumed, the numerical value of m is the main motion direction and the distance of a foreground target between two frames, and then, the value of each pixel is randomly selected in an interval [ m-d/10, m + d/10] to generate the motion difference of different pixels, so that the effect that all pixels of the foreground have the same overall movement trend and the specific movement distance of each pixel is different and is consistent with the actual effect is achieved; thus, a new picture which enables the foreground object to move based on the original picture is generated; after the foreground pixels and the background pixels are processed, the foreground pixels and the background pixels are combined to generate a multi-frame video data set with a target moving.

In a preferred embodiment of the present invention, the overall operation formula of the static saliency network is:

y is image output, I is image input, FS is characteristic output generated by the convolution layer, DS is deconvolution operation, the output Y is ensured to be the same as the input image I, and theta F and theta D represent parameters during convolution and deconvolution;

in deconvolution, the feature matrix is mutually corresponding to the first half convolution operation, and after expansion, the feature matrix is multiplied by a transposed matrix of a convolution kernel in the convolution step to obtain a feature map with the size expanded by two times.

In a preferred embodiment of the present invention, the dynamic saliency estimation is implemented by inputting a static saliency estimation map and adjacent frames of the original map and the original map into a dynamic saliency detection network, which are connected in a channel to obtain an input in the format (a, B, C), which enters the convolution operation of the first layer by the following formula:

where W is the weight at convolution and b is the bias term. I is_tAnd I_t+1For two adjacent frames, P_tIs I_tA temporally corresponding static saliency image; the convolution and deconvolution operations of the subsequent dynamic significance network are consistent with the network structure of the static significance; through the optical flow comparison of the pixel levels of two adjacent frames, a dynamic high-significance target can be better detected, and the significance recognition accuracy is improved.

In a preferred embodiment of the present invention, YOLOv4 divides each input image into three scales for output in step 3, each scale corresponds to three prior frames, and the three outputs total nine prior frames for detection;

the output of the first scale is subjected to convolution operation for multiple times, the compression degree is large, the method is suitable for identifying and detecting a large target, and the corresponding prior frames are the three largest prior frames;

the scale two is positioned in the middle of the output of the three scales, is suitable for detecting the target with the medium size, and utilizes three prior frames with the medium size;

the scale III is an output format with the least convolution times, so that three prior frames with smaller scales are utilized to have better identification effect on small targets in the picture;

in the recognition target positioning, YOLOv4 adopts the following formula in order to obtain the frame position information:

wherein (t)_x,t_y,t_w,t_h) The prediction output of the model is the network learning target; (c)_x,c_y) Is the coordinate offset of the cell, in units of cell side length, (p)_w,p_h) Is the preset side length of Anchor Box, (b)_x,b_y,b_w,b_h) The center coordinates and width and height of the predicted bounding box are finally obtained.

In a preferred embodiment of the present invention, in step 4, in order to effectively distinguish the background part and the moving target part in the image, the model is divided into a static model and a dynamic model, which are combined with each other to capture the space and time information of the image at the same time, and a pixel-level saliency map is directly generated through a full convolution network;

the static saliency model takes a single-frame image as input and generates pixel-level saliency estimation; the input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated.

In a preferred embodiment of the present invention, in step 6, a highlight white in the saliency estimation is defined as an effective target, and a black is defined as a background; if the white proportion is large, the identified target is a valid target, and the recheck is passed; and if the black ratio is large, the identified target is a background, and the identification result is judged to be false detection.

The invention has the beneficial effects that: according to the method, a target detection model is established by using YOLOv4, then a saliency detection model is used for rechecking a YOLOv4 target detection result, and a Convolutional Neural Network (CNN) is adopted for both models, so that the method has good robustness and accuracy.

Compared with the method of only using YOLOv4 and the like to carry out the target detection task, the method of the invention also uses the significance detection method to obtain the dynamic significance estimation of the image, thereby effectively removing the background influence and obtaining the high significance result of the moving target pixel level. The target detection result is rechecked through the significance detection result, so that the false detection probability can be greatly reduced, interference items to the target detection in the background can be effectively distinguished, and the target detection precision is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a flow chart of the detection of the present invention;

FIG. 2 is a network structure of a Yolov4 target detection model;

FIG. 3 is a static saliency model network structure used by the present invention;

FIG. 4 is a combined structure of a static saliency module and a dynamic saliency module of the present invention;

FIG. 5 is an example of the recognition result of the present invention;

FIG. 6 is a diagram illustrating an exemplary review operation performed on a false positive test picture according to the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that the terms "front" and "back" and the like indicate orientations and positional relationships based on orientations and positional relationships shown in the drawings or orientations and positional relationships where the products of the present invention are conventionally placed in use, and are used for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the devices or elements to be referred must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the present invention, unless otherwise expressly stated or limited, the first feature may be present on or under the second feature in direct contact with the first and second feature, or may be present in the first and second feature not in direct contact but in contact with another feature between them. Also, the first feature being above, on or above the second feature includes the first feature being directly above and obliquely above the second feature, or merely means that the first feature is at a higher level than the second feature. A first feature that underlies, and underlies a second feature includes a first feature that is directly under and obliquely under a second feature, or simply means that the first feature is at a lesser level than the second feature.

The embodiment of the invention comprises the following steps:

a helmet wearing identification method based on YOLOv4 and significance detection comprises the following steps:

step 1: labeling the existing data set, and training a Yolov4 target detection model.

The label labeling software is utilized when the data labeling is executed, emphasizes on labeling related tasks of target detection, can rapidly and conveniently label the frame and the label of a single picture, and has good applicability in the target detection task. And after the labeling is finished, obtaining an xml file recording the size of the target position and the label in each picture, and then finishing the preparation of the data required by the training. These data were then divided into two parts, training and test sets respectively.

In step 1, a YOLOv4 network is used as a target detection model, as shown in fig. 2, which mainly includes a trunk feature extraction network and a reinforced feature extraction network. The main feature extraction network adopts a CSPDarkNet53 architecture, 3-channel pictures with 416 × 416 are input, and in order to ensure the input consistency, the original pictures are scaled equally, and the long edges of the original pictures are set to 416 sizes. Then, in order to ensure that the picture is not distorted, the aspect ratio of the picture is not changed when the short edge is adjusted, and the gray area is expanded above, below, left and right of the short edge, so that the whole picture reaches the input size of 416 x 416. In a backbone network, a CSPnet improved residual block is adopted for convolution for many times, the maximum characteristic of the residual network is easy to optimize, meanwhile, a certain network depth can be increased to improve the accuracy, and three results of final feature extraction are input of a subsequent enhanced feature extraction network.

The enhanced feature extraction network comprises an SPP structure and a PANet structure. In the SPP structure, after the last feature layer of CSPdarknet53 is convoluted for three times, the feature layers are processed by using four maximum pooling of different scales, the sizes of the maximum pooling kernels are respectively 13x13, 9x9, 5x5 and 1x1 (1 x1 is no processing), the receptive field can be greatly increased, and the most remarkable contextual features can be separated. The PANET is an example segmentation algorithm, and the specific structure of the PANet has the function of repeatedly improving the characteristics. Three effective characteristic layers obtained by a main characteristic extraction network and an SPP structure use multiple convolution up-sampling and convolution down-sampling to effectively realize characteristic extraction and finally obtain three YOLOread outputs. The detection method using multiple scales can effectively improve the detection accuracy.

When the YOLOv4 is trained, based on the three yoloead outputs, the loss function value of the minimum detection model is calculated iteratively, and when the predetermined iteration times is reached, the trained target detection model can be obtained. In defining the loss function, the loss function value is calculated by three parts:

1) the CIOU is used to calculate the error between the predicted frame and the true frame position. Compared with the IOU, when the CIOU is calculated, the errors caused by different positions and different frame width-height ratios are considered when the intersection area ratio between the two frames is the same, so that the result is more accurate.

2) Target confidence. When an object is correctly detected, the higher the confidence score, the smaller the error and vice versa.

3) And identifying errors caused by the category. I.e. the comparison of the predicted species result with the actual result.

Step 2: and downloading and expanding a significance data set, and training a significance detection model.

The training set used in training the significance detection model IS from significance detection data sets disclosed in the network, including ECSSD, HKU-IS, THUR15K, PASCAL-S, DUTS, and the like. The data sets comprise an original image and a picture which corresponds to the original image and is marked with a standard significance target, wherein the picture has a simple pure-color background and a complex and various background. Based on these existing public data sets, a significance detection model with higher accuracy can be trained.

Each public data set includes thousands of pictures, and corresponding saliency maps, that train well static saliency detection networks. For training the dynamic saliency detection network, a video frame data set adjacent to the pictures is also needed, so that in the invention, the data expansion needs to be carried out on the existing saliency picture data set to obtain a usable video training data set.

For different pictures, the present invention represents difference information of pixel levels between adjacent frames by an optical flow field V = (u, V), where u is a vertical direction, V is a horizontal direction, and X (X, y) represents the position of one point. Therefore, the difference relationship between the adjacent frames I and I' can be expressed by the following formula:

the vertical direction u is taken as an example, since the principles in the horizontal and vertical directions are identical. And dividing pixels in the picture into foreground pixels f and background pixels b, and processing the foreground pixels and the background pixels separately. And extracting 10% of random initialization motion values from the background pixels b, wherein the motion values are in the range of [ -d, d ], d = h/10, and h is the height of the picture, and jogging partial pixels in the background to simulate the noise of the background in the actual video. In the foreground pixel f, firstly, a foreground main motion mode m is assumed, the numerical value of m is the main motion direction and the distance of the foreground target between the two frames, and then, the value of each pixel is randomly selected in an interval [ m-d/10, m + d/10] to generate the motion difference of different pixels, so that the effect that all pixels of the foreground have the same overall movement trend and the specific movement distance of each pixel is different is achieved, and the effect is consistent with the actual effect. This generates a new picture that moves the foreground object based on the original picture. After the foreground pixels and the background pixels are processed, the foreground pixels and the background pixels are combined to generate a multi-frame video data set with a target moving.

In the saliency detection model, a picture or video frame is input into a neural network, which outputs a saliency map, where lighter pixels represent objects with high saliency values and darker pixels represent the background.

As shown in fig. 3, the network structure of the static saliency model is mainly divided into two parts, and the convolution of the left half part is used for extracting picture features; and the right half part is deconvoluted, so that the size of the network output is the same as that of the input.

In the significance detection network, the input picture needs to be scaled to (224, 224, 3), and then subjected to 3 × 3 convolutions 13 times, so as to finally obtain the feature output with the format of (14, 14, 512), and then the feature picture is returned to the original size by using the corresponding deconvolution. Compared with convolution operation of a large convolution kernel, the method has the advantages that the convolution operation of a small convolution kernel for multiple times is used, the network depth can be improved under the condition that the receptive field is not changed, and therefore the network learning accuracy rate is improved. The overall operation formula of the static significance network is as follows:

y is image output, I is image input, FS is characteristic output generated by the convolutional layer, DS is deconvolution operation, the output Y is ensured to be the same as the input image I in size, and theta F and theta D represent parameters during convolution and deconvolution.

During deconvolution, the feature matrix is corresponding to the first half convolution operation, after expansion, the feature matrix is multiplied by a transpose matrix of a convolution kernel during the convolution step, so that a feature graph with twice expanded size is obtained, and finally, the output with the length and the width being 224 is obtained.

To obtain the dynamic saliency estimate, as shown in fig. 4, the static saliency estimate map obtained as described above and the adjacent frames of the original map and the original map of the map need to be simultaneously input to the dynamic saliency detection network, and the three are connected on the channel to obtain the input in the format (224, 224, 7), which enters the convolution operation of the first layer by the following formula:

where W is the weight at convolution and b is the bias term. I is_tAnd I_t+1For two adjacent frames, P_tIs I_tA corresponding static saliency image. Subsequent toThe convolution and deconvolution operations of the dynamic significance network are consistent with the structure of the static significance network.

Through the optical flow comparison of the pixel levels of two adjacent frames, a dynamic high-significance target can be better detected, and the significance recognition accuracy is improved.

The output of the static significance network is used as part of the input of the dynamic significance network, the space-time significance result of the picture is directly generated, the dynamic significance and the static significance are fused and are explicitly embedded into the dynamic significance network instead of a double-flow network for training space-time characteristics, and repeated operation and redundant network parameters are reduced. The model utilizes the optical flow image to directly infer the time information of two adjacent frames of the video instead of the traditional method of mainly comparing the color difference of different pixels, thereby obtaining higher computational efficiency and precision.

And step 3: and obtaining the recognition result of the target and the target frame information by using the trained target detection model.

The trained Yolov4 target detection model detects the test data, and the possible results of the output image are three: firstly, workers normally wear safety helmets, and the target is to mark a frame with a hat label on the head of the worker; secondly, a worker who does not wear a safety helmet marks a frame with a person label on the head of the worker; and thirdly, the safety helmet is not worn on the head of a worker, the image is a false detection target, the false detection target cannot be detected and marked by the trained YOLOv4 under the general condition, and if the false detection target is a hat label, the false detection target can be screened out in the subsequent re-detection operation of the invention.

During detection, the YOLOv4 mainly uses a multi-scale detection method, and can effectively solve the problem that when the position of a target is different from the position of a camera, images can display different sizes.

YOLOv4 divides each input image into three scale outputs, each scale corresponding to three prior boxes, and the three outputs total nine prior boxes for detection. The first scale is 13 × 13, the output of the first scale is subjected to convolution operation for multiple times, the degree of compression is large, the method is suitable for identification and detection of a large target, and the corresponding prior frames are the three largest prior frames.

The second dimension is 26 x 26, which is located in the middle of the output of the three dimensions, and is suitable for detecting the target with the medium size, and three prior frames with the medium size are also used.

The three dimensions of the scale are 52 x 52, and the output format is the output format with the least convolution times, so that three prior frames with smaller dimensions are utilized to have better identification effect on small targets in the picture. The YOLOv4 is used for independently detecting fused feature maps of multiple scales, and finally, good identification accuracy is achieved for targets of different sizes.

wherein (t)_x,t_y,t_w,t_h) The prediction output of the model is the network learning target. (c)_x,c_y) Is the coordinate offset of the cell, in units of cell side length, (p)_w,p_h) Is the preset side length of Anchor Box, (b)_x,b_y,b_w,b_h) The center coordinates and width and height of the predicted bounding box are finally obtained.

And 3, transmitting the single frame obtained from the picture or video shot by the camera into the target detection model by using the trained helmet detection model obtained in the step 1 to obtain the outputs of the helmet detection result hat and person, and visually seeing whether the shot worker correctly wears the helmet.

In order to facilitate the subsequent step 5 of cropping the image, all the detected objects and the position information of each object, which are left, top, right and bottom, are marked on the original image. The four values are directly output with the model_x,b_y,b_w,b_hIn contrast, where (left, top) is the pixel coordinate of the upper left corner of the frame, (right, b)ottom) is the pixel coordinate of the lower right corner of the frame body, and can be obtained only by simple conversion.

And 4, step 4: and obtaining the saliency estimation of the image by using the trained saliency detection model.

And 4, executing a saliency detection task by using the trained saliency model obtained in the step 2, inputting each frame of image monitored by the video into a neural network, and outputting saliency mapping by the network, wherein high-brightness white is a high saliency area, and black is an insignificant background area.

As shown in fig. 4, in order to effectively distinguish a background portion and a moving target portion in an image, a model is divided into a static model and a dynamic model, and the static model and the dynamic model are combined with each other to capture space and time information of the image at the same time, so that a moving high-saliency target in the image is effectively identified, and a pixel-level saliency map is directly generated through a full convolution network.

The static saliency model takes a single frame image as input and generates pixel-level saliency estimation. The input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated.

Compared with the traditional method for comparing the color difference among all pixels, the saliency estimated picture obtained by the method greatly improves the identification accuracy of the moving target. By using the comparison between the adjacent frames, the background can be effectively judged to be low in significance, and the accuracy of the identification result is greatly improved.

To facilitate the subsequent step 5 of cropping the image, the output of the dynamic saliency model is scaled to the original image size.

And 5: and clipping the significance estimation by using the position information of the target frame body.

Since the original image, the result image of the target detection model, and the estimated image identified by the saliency detection model are all in accordance with the size specification, the target frame position in the target detection result corresponds to the target position in the saliency estimated image, and the target position can be marked in the saliency estimated image directly by using the frame coordinates left, top, right, and bottom obtained finally in step 3.

As shown in fig. 5, the detection results of the present invention include (a) input original images for testing and (b) result images detected by the YOLOv4 target detection model, and after the original images are input into the trained YOLOv4 helmet detection model, correct helmet-wearing workers marked with "hat" and no helmet-wearing workers marked with "person" can be obtained, and the frame positions of these targets can be saved. (c) A static saliency picture. (d) And (4) dynamic significance estimation graph. (e) In order to find out the corresponding position in the significance detection result (d) by utilizing the coordinate information obtained in the step (b), all the targets are separately cut out by utilizing cropping operation to obtain small pictures of a single target and the small pictures are stored locally, so that the subsequent reinspection operation is facilitated.

Step 6: and (5) rechecking the single small pictures of all the targets.

And step 6, the rechecking operation utilizes the significance estimation image of each target obtained in the step 5, and because highlight white is an effective target and black is a background in the significance estimation, the rechecking operation mainly judges the cut single significance image to judge whether the target exists. If the white proportion is large, the identified target is a valid target, and the recheck is passed; and if the black ratio is large, the identified target is a background, and the identification result is judged to be false detection. Specifically, from the experimental results, for each pixel in a saliency estimation result map, if the gray value thereof is between [0, 10], it is determined as a black background pixel; otherwise, the target pixel is obtained. And after all the pixel operations are completed, calculating the black background pixel ratio in the picture.

As shown in fig. 6, (a) is the original image, (b) is the YOLOv4 detection result image, (c) is the dynamic saliency estimation image, and (d) is the clipping result of the corresponding positions of the 3 objects in the dynamic saliency image. After the black percentage of the single-sheet saliency target is obtained, the black percentage is compared with a set threshold value, and in the invention, the threshold value can be set to 70%. And if the black proportion in the significance estimation graph of a single target is more than 70%, judging that the target is a background, and judging that the retest of the target fails to pass through, and judging that the target is false detected. And if the black proportion in the significance estimation graph of the single target is less than 70%, judging that the target is really the target needing to be detected, and passing the rechecking. Therefore, in fig. 6, two of the 3 objects are checked for error (the object with the helmet and the object without the helmet), and one is checked for error (the helmet placed on the table).

Through the re-inspection operation after the target detection, the false detection rate in the target detection can be greatly reduced, and the overall identification accuracy is improved.

3703 pictures and a piece of video data were used in the experiment. The pictures are divided into a training set (3203) and a testing set (450). And training a YOLOv4 target detection model by using the training set data, wherein the target detection accuracy is more than 96% during testing.

For video data, a Yolov4 target detection result is obtained firstly during testing, and then rechecking is carried out through a significance detection model, and experimental results show that the method can effectively detect the target and can effectively screen the false detection target through rechecking.

When the method is used, the video data returned by the monitoring camera can be utilized to carry out target detection through the neural network, and then the original image is input into the saliency detection model and respectively passes through the static saliency estimation model and the dynamic saliency estimation model to obtain the saliency estimation about the target. The identification method solves the problem of influence of complex background in the past target detection and significance detection, and can well identify workers wearing the safety helmet. And finally, cutting the corresponding significance estimation graphs by using the frame positions in the target detection to obtain significance estimation graphs of all single targets, and rechecking the targets by using the graphs to finally achieve the effect of improving the identification accuracy.

According to the method, a target detection model is established by using YOLOv4, then a saliency detection model is used for rechecking a YOLOv4 target detection result, and a Convolutional Neural Network (CNN) is adopted for both models, so that the method has good robustness and accuracy.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the present specification, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A helmet wearing identification method based on YOLOv4 and significance detection is characterized by comprising the following steps:

step 6: rechecking single small pictures of all targets;

2. The method for identifying wearing of a safety helmet based on YOLOv4 and significance detection according to claim 1, wherein in step 1, the YOLOv4 network mainly comprises a trunk feature extraction network and a reinforced feature extraction network;

3. The method for identifying wearing of safety helmets based on YOLOv4 and significance detection according to claim 1, wherein in step 1, the loss function value is calculated by the following three parts:

calculating the error between the position of the predicted frame and the position of the real frame by using the CIOU; compared with the IOU, when the CIOU is calculated, the errors caused by different positions and different frame width-height ratios are considered when the intersection area ratio between the two frames is the same, so that the result is more accurate;

errors due to target confidence; when a target is correctly detected, the higher the confidence score is, the smaller the error is, otherwise, the larger the error is;

4. The method for identifying wearing of safety helmets based on Yolov4 and saliency detection according to claim 1, wherein each saliency detection data set disclosed in step 2 comprises thousands of pictures and corresponding saliency maps for training a static saliency detection network; for training the dynamic significance detection network, a video frame data set adjacent to the pictures is also needed;

the difference information at the pixel level between adjacent frames is expressed by optical flow field V = (u, V), where u is the vertical direction, V is the horizontal direction, and X (X, y) represents the position of one point, so the difference relationship between adjacent frames I and I' can be expressed by the following formula:

because the principle is consistent in the horizontal direction and the vertical direction, taking the vertical direction u as an example, dividing the pixels in the picture into a foreground pixel f and a background pixel b, processing the foreground pixel and the background pixel separately, extracting 10% of random initialization motion values in the background pixel b, wherein the range is [ -d, d ], d = h/10, and h is the height of the picture, and jogging partial pixels in the background to simulate the noise of the background in the actual video; in the foreground pixel f, firstly, a foreground main motion mode m is assumed, the numerical value of m is the main motion direction and the distance of a foreground target between two frames, and then, the value of each pixel is randomly selected in an interval [ m-d/10, m + d/10] to generate the motion difference of different pixels, so that the effect that all pixels of the foreground have the same overall movement trend and the specific movement distance of each pixel is different and is consistent with the actual effect is achieved; thus, a new picture which enables the foreground object to move based on the original picture is generated; after the foreground pixels and the background pixels are processed, the foreground pixels and the background pixels are combined to generate a multi-frame video data set with a target moving.

5. The method for identifying wearing of safety helmets based on Yolov4 and significance detection according to claim 4, wherein the overall operation formula of the static significance network is as follows:

6. The method of claim 5 wherein the dynamic saliency estimation is a static saliency estimation map and the adjacent frames of the original map and the original map, and the dynamic saliency detection network is inputted simultaneously, the three are connected on the channel to obtain the input with the format (A, B, C), and the input with the format enters the convolution operation of the first layer through the following formula:

where W is the weight at convolution, b is the bias term, I_tAnd I_t+1For two adjacent frames, P_tIs I_tA temporally corresponding static saliency image; the convolution and deconvolution operations of the subsequent dynamic significance network are consistent with the network structure of the static significance; through the optical flow comparison of the pixel levels of two adjacent frames, a dynamic high-significance target can be better detected, and the significance recognition accuracy is improved.

7. The method of claim 1, wherein YOLOv4 and significance detection-based headgear wear identification in step 3 is implemented by YOLOv4 dividing each input image into three scale outputs, each scale output corresponding to three prior frames, and the three outputs total nine prior frames for detection;

8. The method for identifying wearing of safety helmet based on YOLOv4 and significance detection as claimed in claim 1, wherein in step 4, in order to effectively distinguish the background part from the moving target part in the image, the model is divided into a static model and a dynamic model, and the static model and the dynamic model are combined with each other to capture the spatial and temporal information of the image simultaneously, and the significance map at pixel level is directly generated through a full convolution network;

9. The method for identifying wearing of safety helmet based on YOLOv4 and significance detection according to claim 1, wherein in step 6, high brightness white in significance estimation is defined as effective target, and black is defined as background; if the white proportion is large, the identified target is a valid target, and the recheck is passed; and if the black ratio is large, the identified target is a background, and the identification result is judged to be false detection.