CN112070043A

CN112070043A - Safety helmet wearing convolutional network based on feature fusion, training and detecting method

Info

Publication number: CN112070043A
Application number: CN202010966231.3A
Authority: CN
Inventors: 周敏新; 张方舟; 王学宇; 任鹏
Original assignee: Changshu Institute of Technology
Current assignee: Changshu Institute of Technology
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-12-11
Anticipated expiration: 2040-09-15
Also published as: CN112070043B

Abstract

The invention discloses a convolutional network worn on a safety helmet based on feature fusion, a training method and a detection method. Three modules were introduced in the centrnet in sequence: the characteristic pyramid module adopts a top-down process, firstly Conv-5 adopts n times of up-sampling, and Conv-4 adopts m × m convolution kernels to change the number of channels and is fused with the Conv-5 characteristic layer after up-sampling; conv-4 and Conv-3 are similar to the operation, namely n times of upsampling is carried out firstly and then the upsampled n times of upsampled m times of convolution kernel is fused with the next layer; the global guide module comprises a pyramid pooling module and a global guide flow module; the feature integration module performs n, 2n and 4n times of down sampling on the fused features, performs average pooling, performs corresponding times of up sampling integration, and performs convolution with a convolution kernel of 3m × 3 m. The invention greatly improves the detection efficiency, obviously improves the detection effect on the wearing condition of the safety helmet of a worker with small image mesoscale, has the detection speed of 21fps and basically meets the real-time property.

Description

Safety helmet wearing convolutional network based on feature fusion, training and detecting method

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a helmet wearing detection method CenterNet-Feure-Fusion based on feature Fusion,

background

Statistical data show that 13 major accidents happen in the building industry in May nationwide this year, and the death of 51 people is increased by 18.2% and 34.2% respectively compared with the death of the last year. The safety helmet is an important tool for protecting the head of a worker, and has important significance for life safety. However, many workers lack safety awareness, and the situation that the safety helmet is not worn occurs, so that it is a significant matter to automatically detect whether the worker wears the safety helmet or not.

According to investigation, most of monitoring cameras on a construction site are arranged at a high position, the proportion of the shot images occupied by workers is small, and the characteristics are difficult to identify, so that the problem of wearing and detecting the safety helmet of small-scale workers is urgently needed to be solved.

Sensor-based approaches focus on location and tracking technologies such as Radio Frequency Identification (RFID) and Wireless Local Area Networks (WLANs). Dong et al developed a real-time location system (RTLS) and virtual configuration for worker location tracking, i.e., a pressure sensor is placed in the helmet and pressure information is transmitted via bluetooth to determine whether the worker is wearing the helmet. Zhang et al developed an intelligent headgear system using an internet of things based architecture. In order to determine the use condition of the safety helmet, an infrared beam detector and a thermal infrared sensor are placed in the safety helmet.

Object detection is to identify not only the class of an object but also to predict the location of the object in an image, typically marked with a box. The traditional target detection method generally uses a frame of a sliding window, and mainly comprises three steps:

(1) sliding on the image by using sliding windows with different scales, and selecting a certain part as a candidate area;

(2) and extracting the visual features of the candidate regions. Such as HOG features, Haar features, etc.;

(3) classifying by using a classifier, such as an SVM classifier;

with the development of deep learning, a target detection algorithm based on deep learning becomes mainstream, and different from the traditional method, the characteristics are autonomously learned through a multilayer convolutional neural network without designing the characteristics by self. The current target detection algorithms based on deep learning are roughly divided into two types: a two-stage process and a one-stage process. The difference between the two methods is that the two-stage method needs to extract a preselected candidate frame and then carry out classification prediction on the candidate frame, such as R-CNN series, FPN and the like. The one-stage method does not need to extract a preselected candidate frame, and directly treats the target positioning problem as a regression problem, such as a YOLO series, SSD, and the like.

Compared with a sensor-based method, the vision-based method is concerned by more and more people, the use of a camera and the improvement of computer vision and pattern recognition technology lay a solid foundation for the vision-based safety helmet wearing detection, and the safety helmet wearing detection belongs to a small application in target detection. The traditional method based on manual design features is as follows: liu Xiao Hui et al uses skin color detection method to detect the area of face, extracts Hu moment feature of the above area, and uses SVM to classify. Park et al extract the HOG characteristics of the body part of the person and the safety helmet respectively, and then match according to their spatial relationship, when the person is not standing, the method has poor detection effect. Li et al propose a color-based hybrid descriptor to extract features of different color helmets, and then use a build-level support vector machine to classify objects into four categories (red, yellow, blue and no helmets). With the development of deep learning target detection, Fang et al uses FasterR-CNN algorithm to perform helmet detection, firstly uses RPN network to select candidate region frame, and then predicts each candidate region. The handsome et al uses the YOLOv3 algorithm to detect the safety helmet, does not need to preselect candidate frames in advance, and directly treats the target positioning problem as a regression problem.

A novel feature fusion mode is introduced on the basis of the current advanced CenterNet target detection algorithm, and the problem that small-scale workers on a site detect wearing safety helmets is solved. According to the above contents, the existing helmet wearing detection method is still in a starting stage, once the method based on the sensor exceeds a signal range, the detection cannot be carried out, equipment needs to be charged regularly and cannot be used for a long time, the method based on the manual design features is poor in generalization performance, low in universality and limited in use scene. Only then is there a later approach based on deep learning target detection, but the deep learning target detection algorithms used in the past for headgear wearing detection still have the following problems: (1) an anchor frame mechanism is adopted, the size and the length-width ratio of the anchor frame are designed to be troublesome, and a large number of redundant frames exist, so that the positive and negative samples are seriously unbalanced. (2) Because the workers are scattered under the complex working condition, the distance from the lens is short, the proportion of the image occupied by the workers far away is small, and the features are difficult to obtain.

The original centrnet structure is shown in figure 1: inputting a 512 × 512 picture, extracting the features of the image through a backbone network, and performing upsampling by using a bilinear interpolation method to obtain a 128 × 128 feature map in order to obtain a high-resolution feature map. The prediction part adopts 3 branches for generating a key point thermodynamic diagram and scale prediction of a bounding box (W)_i，H_i) And deviation prediction of key points (Δ X)_i，ΔY_i) Convolution of 3 × 3 and 1 × 1 is adopted. From the predicted centre point coordinates (X)_i，Y_i) Namely, the peak point in the thermodynamic diagram of the key point, the scale of the bounding box and the deviation value can locate the position (X) of the target_i+ΔX_i-W_i/2，Y_i+ΔY_i-H_i/2， X_i+ΔX_i+W_i/2，Y_i+ΔY_i+H_i/2). It can be seen that the centret, although up-sampling the last layer of feature map, obtains a higher resolution output feature map, does not incorporate the details of the shallow features, and therefore, the centret is less effective in detecting the wearing of small-scale workers' safety helmets.

Disclosure of Invention

1. The invention aims to provide a novel method.

The invention provides a safety helmet wearing convolutional network based on feature fusion, a training and detecting method, a storage medium and a device, aiming at solving the problems that the number of network layers is increased, semantic information is rich, position information is lost, shallow feature position information is rich and semantic information is lost in the prior art.

2. The technical scheme adopted by the invention is disclosed.

The invention discloses a method for generating a convolutional network for detecting safety helmet wearing based on feature fusion, which sequentially introduces three modules, a feature pyramid module, a global guide module and a feature integration module into a CenterNet:

the characteristic pyramid module adopts a top-down process, firstly Conv-5 adopts n times of up-sampling, and Conv-4 adopts m × m convolution kernels to change the number of channels and is fused with the Conv-5 characteristic layer after up-sampling; conv-4 and Conv-3 are similar to the operation, namely n times of upsampling is carried out firstly and then the upsampled n times of upsampled m times of convolution kernel is fused with the next layer;

the global guide module comprises a pyramid pooling module and a global guide flow module; the global guiding flow module is used for respectively adding n, 2n and 4n times of up-sampled pyramid pooling features during each horizontal connection of the feature pyramid in the top-down process;

the feature integration module performs n, 2n and 4n times of down sampling on the fused features, performs average pooling, performs corresponding times of up sampling integration, and performs convolution with a convolution kernel of 3m × 3 m.

Preferably, step 1, a top-down process is adopted, starting from Conv-5, Conv-5 is firstly up-sampled by 2 times, and Conv-4 is changed in channel number by 1 × 1 convolution kernel and fused with the up-sampled Conv-5 feature layer; conv-4 and Conv-3 are similar to the above, 2 times of upsampling is carried out, and then the upsampling is fused with the next layer of features after being convolved by 1 x 1 convolution kernel, so that the finally fused features are obtained;

step 2, global guide module

Step 2.1, capturing global information

Carrying out average pooling on the last layer of the CenterNet feature extraction network, namely Conv-5, to generate pooling features with different scales of 1 x 1, 2 x 2, 3 x 3 and 6 x 6, changing the number of channels of the pooling features into the original 1/4 by using convolution of 1 x 1, then sampling back to the original feature layer size through bilinear interpolation, finally merging the original features together to obtain the features after pyramid pooling, and aggregating context information of different areas, thereby capturing global information;

step 2.2, globally guiding flow module

2, 4 and 8 times of pyramid pooling characteristics sampled are respectively added during each horizontal connection in the characteristic pyramid top-down process;

step 3, firstly, performing 2, 4 and 8 times of down sampling on the fused features, then performing average pooling, then performing corresponding times of up sampling and integrating together, and then performing convolution with a convolution kernel with the size of 3 x 3; the feature integration module is introduced into the centrnet.

Preferably, a feature integration module is added on the basis of the step 2, the Conv-5 is firstly subjected to feature integration, then the Conv-4 is fused, the fused features are subjected to feature integration, the third layer and the second layer are similar to the third layer, and the fused features of each layer are integrated to obtain the final fused features; the key point, center point deviation and target size are predicted separately for three branches.

The invention provides a convolutional network training method for detecting wearing of safety helmets, which comprises the following steps that the first stage is a forward propagation stage, and the second stage is a backward propagation stage:

(4) the initial weights of the network were taken as the weights obtained from the original centrnet trained on the COCO dataset.

(5) The forward propagation of the input image, modified centret network, yields the keypoint thermodynamic diagram, the center point bias, and the size of the target.

(6) Calculating an error between the predicted value and the target value; the error function is divided into three parts:

L＝L_k+λ₁L_size+λ₂L_off (I.4)

wherein the formula

Representing a key point classification loss function, and adopting a focus loss function to solve the problem of unbalance of positive and negative samples during training;

which represents the detected center point of the image,

representing a background; the key points of the truth value are distributed to the thermodynamic diagram through the Gaussian function

In the above, N represents the number of targets in the image, and alpha and beta are hyper-parameters of the loss function and are respectively 2 and 4; equation I.2 represents the target size penalty, using the L1 penalty, assuming

Is the coordinates of the kth target bounding box, then

Is the coordinate of the center point of the kth target,

the size of the k-th object is indicated,

is the predicted target size; equation i.3 represents the center point bias loss, p is the position of the center point in the input image,

representing the position of the center point p after R-fold down-sampling,

indicating that the center point position, rounded after down-sampling, will have a bias,

for the predicted center point deviation, formula i.4 is the total loss, and λ, λ are the specific gravities of the different loss functions.

Continuously adjusting the network weight by adopting a gradient descent method; the learning rate adjustment weight of 1.25e-4 is adopted, the iteration times are 200, the learning rate attenuation step length is 90 and 120, and the learning rate attenuation factor is 0.1.

The invention has proposed a safety helmet and worn the detection method, predict and get the thermodynamic diagram of key point through the said convolution network, compare all its corresponding points with its adjacent one point, if the response value of this point is greater than or equal to this one point value then keep, keep all peak points meeting the condition finally; order to

Is a set of peak points; the final target bounding box is

Wherein

In order to predict the deviation of the center point,

is the predicted target size; the predicted coordinate values are displayed in the imageThe form of a box is drawn and the corresponding category is displayed.

The invention provides a storage medium for storing the safety helmet wearing detection method based on feature fusion.

The invention provides a storage device, comprising:

a memory;

one or more processors, and

one or more programs stored in the memory and configured to be executed by the one or more processors, the programs when executed by the processors implementing the feature fusion based headgear wear detection method of claim 5.

3. The technical effect produced by the invention.

1) The invention adopts a characteristic pyramid module with a U-shaped structure, integrates multilayer characteristics and improves the sensitivity to small targets.

2) According to the invention, a global guide module and a feature integration module are introduced on the basis of the feature pyramid, so that the details of the significant target are further sharpened.

3) The invention introduces a global guide flow and feature integration module to gradually refine deep semantic information; the feature integration module can reduce aliasing effect caused by sampling at high multiple, and can observe local information at different spatial positions in different scale spaces, so that the perception field of the whole network is further expanded.

4) According to the invention, the disclosed data set Safety-Helmet-training-dataset (SHWD) is adopted to train and test the CenterNet FF, the result shows that compared with the CenterNet, the detection efficiency is greatly improved, the detection effect on the Wearing condition of the Safety Helmet of a worker with small image mesoscale is improved more obviously, the detection speed can reach 21fps, and the real-time performance is basically met.

Drawings

Figure 1 is a diagram of the architecture of a centrnet network.

Fig. 2 is a structural diagram of a feature pyramid module.

FIG. 3 is a block diagram of a feature pyramid in combination with a global boot module.

FIG. 4 is a pyramid pooling module structure.

FIG. 5 is a feature integration module architecture.

Fig. 6 is a diagram of a centrenetff network architecture.

FIG. 7 is a comparison of the results of the CenterNet FF and CenterNet tests on the SHWD dataset.

FIG. 8 is a comparison of results of the test of CenterNetFF against other deep learning target detection algorithms on a SHWD data set.

Fig. 9 is a sample of a partial data set.

Fig. 10 shows the effect of CenterNetFF on field worker helmet wear.

Figure 11 is a comparison of the results of CenterNetFF and CenterNet in the detection of helmet wear on a worker at a worksite.

Detailed Description

Aiming at the problems that workers are scattered under complex working conditions, the distance from the camera is far or near, the proportion of the image occupied by the workers at the far distance is small, and the features are difficult to extract, the invention provides a method CenterNeTFF based on novel feature fusion for detecting the wearing of safety helmets of the workers at the construction site. Comprises the following steps:

step 1, as shown in fig. 6, the sizes of the Conv _2 to Conv _5 feature maps gradually decrease with the increase of the number of network layers, the resolution is lower, the semantic information is richer and richer, but the position information is lacked, the shallow feature resolution is high, the feeling is also matched with the small target size, the position information is rich, but the semantic information is lacked. The centret shown in fig. 1 only predicts through the last feature layer, ignores the detail information of shallow features, and is not sensitive to small targets, and has poor detection effect on workers far away from the camera on the construction site. According to the invention, a top-down process is adopted, as shown in FIG. 2, F represents a feature fusion process, namely an F module in FIG. 6, starting from Conv-5, Conv-5 is firstly up-sampled by 2 times, and Conv-4 is fused with the up-sampled Conv-5 feature layer by changing the number of channels by a 1 x 1 convolution kernel. Conv-4 and Conv-3 are similar to the above, and 2 times of upsampling is performed first, and then the upsampled samples are fused with the next layer of features after being convolved by 1 x 1 convolution kernel to obtain final fused features P2.

Step 2, the characteristic pyramid is a typical structure fusing multiple characteristic layers, but has a disadvantage that deep semantic information is gradually diluted in the top-down process. In the past, researches show that the actual sensing visual field of the convolutional neural network is smaller than the theoretical sensing visual field, especially on deep features, so that the sensing visual field of the whole network is not enough to capture global information of an input image, a remarkable target is easily phagocytosed by the background, namely, a worker on a construction site is easily phagocytosed by the background of a building and the like, and missing detection is caused. The invention introduces a Global Guide Module (GGM) based on a characteristic Pyramid, and the GGM comprises two modules of Pyramid Pooling (PPM) and Global Guide Flow (GGFs). Firstly, the last layer of the CenterNet feature extraction network, namely Conv-5, is subjected to average pooling to generate pooled features with different scales of 1 x 1, 2 x 2, 3 x 3 and 6 x 6, then the number of channels of the pooled features is changed into the original 1/4 by using convolution of 1 x 1, then the original feature layer size is sampled back through bilinear interpolation, and finally the features and the original features are combined together to obtain the features after pyramid pooling, and context information of different regions is aggregated, so that global information is captured. In the second step, as shown by the dotted line box in fig. 3, 2, 4, and 8 times of the pyramid pooling features of upsampling are added during each horizontal connection in the top-down process of the feature pyramid, so that semantic information is not diluted.

Step 3, the global guide module in the previous step adds global guide information to each feature layer in the top-down process, but also brings some problems, the traditional feature pyramid module adopts a convolution kernel of 3 × 3 to perform convolution after 2 times of upsampling feature fusion is performed to eliminate aliasing effect caused by upsampling, and the global guide module needs to perform upsampling of a multiple of 4 or 8 times, so that how to efficiently process the difference between the GGFs and the feature layers with different scales is necessary. The invention adopts a feature integration module, and the specific structure is shown in figure 5, the fused features are firstly sampled by 2, 4 and 8 times, then averaged and pooled, and then sampled by corresponding times and integrated together, and then convolved with a convolution kernel with the size of 3 x 3. Introducing a feature integration module into the centrnet as shown in fig. 6, adding a feature integration module a on the basis of the second step, firstly performing feature integration on Conv-5, then performing fusion with Conv-4, performing feature integration on the fused features, wherein the third layer is similar to the second layer, integrating the fused features of each layer to obtain the final fused feature F2, and the feature integration module can reduce aliasing effect caused by sampling at high times, observe local information at different spatial positions in different scale spaces, and further expand the experience view of the whole network. Next, as with the original CenterNet, the keypoints, center point deviations, and target sizes are predicted separately for three branches.

The experimental data set of the invention is: SHWD, classified into hat and person categories, with a total of 7581 data sets comprising 9044 positive examples of a helmet and 11151 negative examples of an unworn helmet. The invention divides the data set into 4548 training sets, 1516 verification sets and 1517 test sets. The partial data set is shown in fig. 9.

Training process

Experiment hardware environment: ubuntu 16.04, Tesla P100 video card and video memory 16G. The code running environment is as follows: deep learning framework (pytorch0.4.1), python3.6, CUDA8.0, cudnn 5.1.

The training process is divided into two phases: the first stage is the forward propagation stage and the second stage is the backward propagation stage. The specific process comprises the following steps:

(1) the invention uses the weights obtained from the training of the original centrnet on the COCO dataset as the initial weights of the network.

(2) The 512 x 512 size image is input, 16 maps are transmitted each time, and the forward propagation of the improved centret network results in the keypoint thermodynamic map, the center point bias, and the size of the target.

(3) An error between the predicted value and the target value is calculated. The error function is divided into three parts:

L＝L_k+λ₁L_size+λ₂L_off (I.4)

wherein the formula

And representing a key point classification loss function, and adopting a focus loss function to solve the problem of unbalance of positive and negative samples during training.

Which represents the detected center point of the image,

representing the background. The key points of the truth value are distributed to the thermodynamic diagram through the Gaussian function

In the above, N represents the number of objects in the image, and α and β are hyper-parameters of the loss function, which are 2 and 4, respectively. Equation I.2 represents the target size penalty, using the L1 penalty, assuming

Is the coordinates of the kth target bounding box, then

Is the coordinate of the center point of the kth target,

the size of the k-th object is indicated,

is the predicted target size. Equation i.3 represents the center point bias loss, p is the position of the center point in the input image,

representing the position of the center point p after R-fold down-sampling,

is the predicted center point deviation. Formula I.4 is the total loss, and in order to assign specific weights of different loss functions, λ is added₁＝0.1,λ₂＝1。

(4) And continuously adjusting the network weight by adopting a gradient descent method to minimize the error. The invention adopts the learning rate of 1.25e-4 to adjust the weight, the iteration times is 200, the learning rate attenuation step length is 90 and 120, and the learning rate attenuation factor is 0.1.

Detection of

And obtaining a key point thermodynamic diagram through network prediction, comparing all corresponding points with 8 adjacent points, if the response value of the point is greater than or equal to the 8 point values, reserving, and finally reserving all 100 peak points meeting the condition. Order to

Is the set of peak points. The final target bounding box is

Wherein

In order to predict the deviation of the center point,

is the predicted target size. And drawing the predicted coordinate values in the image in a frame form, and displaying the corresponding categories.

In order to verify the effectiveness of the feature pyramid, the global guide module and the feature integration module provided by the invention, the three modules are introduced into the centrnet in sequence.

Lines

2, 6 and 10 of fig. 7 show the detection results after the feature pyramid is added under different backbone networks, and it can be seen that the average detection accuracy AP is respectively improved by 0.4%, 1.4% and 1.8%, and the average recall rate AR is respectively improved by 2.4%, 4.3% and 0.7%. Meanwhile, the detection effect of small-scale targets is greatly improved, and AP_smallRespectively increased by 4.0%, 3.6% and 1.0%, and increased by AR_smallThe sensitivity to small targets is obviously improved by the introduction of the characteristic pyramid by 4.9%, 4.9% and 2.0% respectively.

Lines

3, 7 and 11 in fig. 7 are results after the global boot module is introduced, and the average detection accuracy AP and the average recall rate AR are further improved, where AP is respectively improved by 0.4%, 0.7% and 0.8%, and AR is respectively improved by 0.5%, 0.4% and 1.0%. Finally the centrenetff algorithm herein AP reached 87.8%, AR reached 43.1%, and the detected AP for small scale targets_smallUp to 33.2%, AR_small43.1% compared to the original CENTERNet AP_smallAnd AR_smallRespectively improved by 2.0 percent and 2.7 percent.

The CenterNetFF algorithm is compared with the currently advanced target detection algorithms, namely, Faster R-CNN, Yolov3 and SSD, and under the condition of the same experimental environment, the result is shown in the figure, and the result shows that the Faster R-CNN has almost the same average precision as the algorithm in the text and reaches 87.02 percent. However, the detection speed is only 4.92fps, and the real-time effect is far from being achieved, while the YOLOv3 and the SSD can achieve the real-time effect in the detection speed, but the average precision is slightly lower than that of the SSD, and the requirement on the precision of the wearing detection of the safety helmet under the complex working condition cannot be met. The average accuracy of the CenterNetFF of the invention reaches 87.8%, the detection speed also reaches 21fps, and the real-time property is satisfied.

Fig. 10 is a graph of the detection of the SHWD data set by the improved centret algorithm, both by workers at distance and nearby. Fig. 11 is a comparison graph of the detection effect on the SHWD data set before and after model improvement, fig. 11a is the detection effect before no improvement, fig. 11b is the detection effect after improvement, and it can be seen that centrenetff solves the small-scale worker detection problem.

Claims

1. A method for generating a convolutional network for detecting safety helmet wearing based on feature fusion is characterized in that three modules are introduced into a CenterNet in sequence:

2. The feature fusion based helmet wearing detection convolutional network generating method of claim 1,

step 1, adopting a top-down process, starting from Conv-5, firstly adopting 2 times of upsampling for Conv-5, adopting 1 x 1 convolution kernel for Conv-4 to change the number of channels, and fusing with the upsampled Conv-5 characteristic layer; conv-4 and Conv-3 are similar to the above, 2 times of upsampling is carried out, and then the upsampling is fused with the next layer of features after being convolved by 1 x 1 convolution kernel, so that the finally fused features are obtained;

step 2, global guide module

Step 2.1, capturing global information

step 2.2, globally guiding flow module

3. The feature fusion based helmet wearing detection convolutional network generating method of claim 2,

adding a feature integration module on the basis of the step 2, firstly performing feature integration on Conv-5, then fusing with Conv-4, performing feature integration on fused features, wherein the third layer is similar to the second layer in structure, and integrating the fused features of each layer to obtain final fused features; the key point, center point deviation and target size are predicted separately for three branches.

4. The helmet wearing detection convolutional network training method based on feature fusion of claim 3, comprising a first stage of forward propagation and a second stage of backward propagation:

(1) the initial weights of the network were taken as the weights obtained from the original centrnet trained on the COCO dataset.

(2) The forward propagation of the input image, modified centret network, yields the keypoint thermodynamic diagram, the center point bias, and the size of the target.

(3) Calculating an error between the predicted value and the target value; the error function is divided into three parts:

L＝L_k+λ₁L_size+λ₂L_off (I.4)

the formula I.1 represents a key point classification loss function, and a focus loss function is adopted to solve the problem of unbalance of positive and negative samples during training;

which represents the detected center point of the image,

Is the coordinates of the kth target bounding box, then

Is the coordinate of the center point of the kth target,

the size of the k-th object is indicated,

representing the position of the center point p after R-fold down-sampling,

for predicted center point deviation formula I.4 is the total loss, λ₁、λ₂Are the specific gravity of different loss functions.

5. The helmet wearing detection convolutional network training method based on feature fusion of claim 4, wherein a gradient descent method is adopted to continuously adjust the network weight; the learning rate adjustment weight of 1.25e-4 is adopted, the iteration times are 200, the learning rate attenuation step length is 90 and 120, and the learning rate attenuation factor is 0.1.

6. A safety helmet wearing detection method based on feature fusion as claimed in claim 4 or 5, characterized in that, a key point thermodynamic diagram is obtained through the prediction of the convolution network, all corresponding points are compared with the adjacent point I, if the response value of the point is greater than or equal to the point I, all peak points meeting the condition are retained; order to

Is a set of peak points; the final target bounding box is

Wherein

In order to predict the deviation of the center point,

is the predicted target size; and drawing the predicted coordinate values in the image in a frame form, and displaying the corresponding categories.

7. A storage medium storing the method for detecting wearing of a helmet based on feature fusion according to claim 6.

8. A memory device, comprising:

a memory;

one or more processors, and

one or more programs stored in the memory and configured to be executed by the one or more processors, the programs when executed by the processors implementing the feature fusion based headgear wear detection method of claim 6.