CN113792584B

CN113792584B - Wearing detection method and system for safety protection tool

Info

Publication number: CN113792584B
Application number: CN202110887741.6A
Authority: CN
Inventors: 刘斌; 段亮; 刁磊; 魏立力; 岳昆; 李忠斌; 胡矿
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2023-10-27
Anticipated expiration: 2041-08-03
Also published as: CN113792584A

Abstract

The invention discloses a wearing detection method and system for a safety protector. The method comprises the following steps: extracting features of the target scene image by adopting a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix to be detected and a mask feature map matrix to be detected; generating an anchor frame to be detected by the feature map matrix to be detected; based on the feature map matrix to be detected, reducing coordinates in the anchor frame coordinate matrix to be detected by adopting a trained area candidate network and non-maximum value suppression method to obtain a sample to be detected; inputting a sample to be tested and a mask feature map matrix to be tested into a trained mask convolution model to obtain a worker mask and a safety protection mask in a target scene image; and determining whether the worker wears the safety protection tool in the target scene image according to the intersection relation between the worker mask and the safety protection tool mask of the target scene image. The invention can improve the accuracy of safety protection tool detection and reduce the labor cost.

Description

Wearing detection method and system for safety protection tool

Technical Field

The invention relates to the field of protective clothing detection, in particular to a method and a system for detecting wearing of a safety protective clothing.

Background

Object detection is a long-standing fundamental problem in the field of computer vision. The task of object detection is to determine whether a given class of object instances exists in a given image; if so, the spatial location and coverage of each target instance is returned, such as a bounding box. As a basis for image understanding and computer vision, object detection is a basis for solving more complex and higher-level visual tasks such as segmentation, scene understanding, object tracking, image description, event detection, and activity recognition, and has wide application in many fields of artificial intelligence and information technology, including aspects of robot vision, autopilot, human-computer interaction, and the like.

Target detection can be categorized into target detection based on conventional manual features and target detection based on deep learning. Early target detection is mostly constructed based on manual features, and a deformable part model (Deformable Partbased Model, DPM) is represented by an algorithm, and the main idea of the DPM is to split and convert the detection problem of the whole target in the traditional target detection algorithm into the detection problem of each part of the model, and then aggregate the detection results of each part to obtain a final detection result.

Traditional construction is energized through 5G technology and artificial intelligence technology, intelligent detection of illegal operation of workers is achieved, probability of injury to the workers is reduced, safety awareness and operation standard of the workers are improved, and the method is a first step of 5G infrastructure construction.

The utility model provides a detection is worn to safety protection ware, belongs to the target detection field in the computer vision. The existing data set is a mobile 5G infrastructure construction site photo, the data set is large in scale, and the problems of insufficient photo quality and photo content exist. The traditional manual review photo has higher time cost; based on the existing object detection model, although the object and its range can be detected, the relationship between the object (i.e., person and security guard) cannot be accurately determined based on the range alone.

Disclosure of Invention

Based on the above, the embodiment of the invention provides a method and a system for detecting the wearing of a safety protector, so as to improve the accuracy of the detection of the safety protector and reduce the labor cost.

In order to achieve the above object, the present invention provides the following solutions:

a method of security brace wear detection comprising:

acquiring a target scene image;

extracting features of the target scene image by adopting a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix to be detected and a mask feature map matrix to be detected;

Generating anchor frames to be detected by taking each pixel point in the feature map matrix to be detected as a center, and storing coordinates of each anchor frame to be detected by adopting an anchor frame coordinate matrix to be detected;

inputting the feature map matrix to be detected into a trained region candidate network to obtain the foreground probability of each anchor frame to be detected;

reducing coordinates in the coordinate matrix of the anchor frame to be detected based on the foreground probability and non-maximum suppression method of the anchor frame to be detected to obtain a sample to be detected;

inputting the sample to be tested and the mask feature map matrix to be tested into a trained mask convolution model to obtain a worker mask and a safety protection mask in the target scene image;

and determining whether the worker wears the safety protector in the target scene image according to the intersection relation between the worker mask and the safety protector mask of the target scene image.

The invention also provides a system for detecting wearing of the safety protection device, which comprises:

the image acquisition module is used for acquiring a target scene image;

the feature extraction module is used for extracting features of the target scene image by adopting a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix to be detected and a mask feature map matrix to be detected;

The anchor frame generation module is used for generating an anchor frame to be detected by taking each pixel point in the feature map matrix to be detected as a center, and storing the coordinates of each anchor frame to be detected by adopting the anchor frame coordinate matrix to be detected;

the foreground probability calculation module is used for inputting the feature map matrix to be detected into a trained area candidate network to obtain the foreground probability of each anchor frame to be detected;

the sample to be detected generation module is used for reducing coordinates in the coordinate matrix of the anchor frame to be detected based on the foreground probability and the non-maximum value suppression method of the anchor frame to be detected to obtain a sample to be detected;

the mask detection module is used for inputting the sample to be detected and the mask feature map matrix to be detected into a trained mask convolution model to obtain a worker mask and a safety protection mask in the target scene image;

the safety protection tool wearing detection module is used for determining whether a worker wears the safety protection tool or not in the target scene image according to the intersection relation between the worker mask and the safety protection tool mask of the target scene image.

Compared with the prior art, the invention has the beneficial effects that:

the embodiment of the invention provides a wearing detection method and system for a safety protector, wherein feature images with different sizes in a target scene image are extracted by utilizing a ResNet-101 residual neural network, and feature fusion is carried out on the feature images by utilizing a feature pyramid network (Feature PyramidNetworks for Object Detection, FPN), so that the performance of small object detection is improved; the to-be-detected sample and the to-be-detected Mask feature map matrix are input into a trained Mask convolution model (Mask R-CNN) to obtain a worker Mask and a safety guard Mask in a target scene image, so that the relation between a target (namely, a person and the safety guard) is accurately judged, the accuracy of safety guard detection is improved, and the labor cost is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for detecting wearing of a safety protector according to an embodiment of the present invention;

FIG. 2 is a process diagram showing a specific implementation of a method for detecting wearing of a safety brace according to an embodiment of the present invention;

FIG. 3 is a block diagram of a ResNet-101 residual neural network provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature pyramid network fusion process provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process of a region candidate network, a classification regression network and a mask network according to an embodiment of the present invention;

fig. 6 is a structural diagram of a wearing detection system for a safety brace according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Currently, there are two mainstream models based on deep learning target detection. The first is a deep learning target detection model based on target candidates, the steps of which are multi-stage, such as Regions with CNN Features (R-CNN), faster-RCNN model; the second is an object detection model based on an integrated convolutional network, the steps of which are single-stage, such as You Only Look Once (YOLO), single Shot MultiBox Detector (SSD), etc. The invention discloses a deep learning target detection model based on target candidates.

Based on the two types of object detection models, although the object and the range thereof can be detected, the relationship between the object (i.e., the person and the safety brace) cannot be accurately determined based on the range alone.

Only the contour information of the targets is obtained, the relation between the targets can be accurately judged, and therefore the detection accuracy is improved. In contrast, the method is based on a Mask region convolutional neural network model (Mask R-CNN), the training mark data is stored by adopting a labeling format of COCO (Common Objects in Context) data sets, outline information of an object to be detected, namely Mask information, is predicted based on the model obtained through training, the Mask information output during prediction is matched according to categories, and the relation between workers and safety protection tools is calculated, so that whether the workers wear the safety protection tools or not is judged.

Fig. 1 is a flowchart of a method for detecting wearing of a safety protector according to an embodiment of the present invention. Referring to fig. 1, the method for detecting wearing of a safety protector provided in this embodiment includes:

step 101: a target scene image is acquired.

Step 102: and extracting features of the target scene image by adopting a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix to be detected and a mask feature map matrix to be detected.

Step 102, specifically includes: 1) Extracting features of the target scene image by adopting a ResNet-101 residual neural network to obtain an initial matrix to be detected; the initial matrix to be measured comprises a characteristic diagram of K stages. 2) And carrying out feature fusion on the feature graphs of each stage in the initial matrix to be detected by adopting a feature pyramid network to obtain K fused feature graphs. 3) And carrying out maximum pooling operation on the fused feature images in the K stage by using a set step length to obtain a newly added feature image. 4) Determining a feature map matrix to be detected and a mask feature map matrix to be detected; the feature map matrix to be tested comprises K-stage fused feature maps and newly-added feature maps; the mask feature map matrix to be tested comprises K-stage fused feature maps.

Step 103: and generating an anchor frame to be detected by taking each pixel point in the feature map matrix to be detected as a center, and storing the coordinates of each anchor frame to be detected by adopting the anchor frame coordinate matrix to be detected.

Step 104: and inputting the feature map matrix to be detected into a trained area candidate network to obtain the foreground probability of each anchor frame to be detected.

Step 105: and reducing coordinates in the coordinate matrix of the anchor frame to be detected based on the foreground probability and the non-maximum suppression method of the anchor frame to be detected to obtain a sample to be detected.

Step 105 specifically includes: 1) And performing descending order arrangement on the corresponding anchor frames to be detected in the anchor frame coordinate matrix to be detected according to the foreground probability to obtain an anchor frame sequence. 2) And determining the first set number of anchor frames to be detected in the anchor frame sequence as first target anchor frames. 3) Reducing the first target anchor frames to a second set number by adopting a non-maximum suppression method to obtain second target anchor frames; the second set number is smaller than the first set number. 4) Carrying out cross-over ratio operation on the largest anchor frame to be detected and other anchor frames to be detected respectively to obtain a plurality of cross-over ratio values; the largest anchor frame to be detected is the anchor frame to be detected with the largest foreground probability in the second target anchor frame; the other anchor frames to be tested are the anchor frames to be tested except the largest anchor frame to be tested in the anchor frame sequence. 5) And screening the second target anchor frame by adopting the cross ratio to obtain a third target anchor frame, and determining the coordinates in the coordinate matrix of the anchor frame to be detected corresponding to the third target anchor frame as a sample to be detected.

Step 106: and inputting the sample to be tested and the mask feature map matrix to be tested into a trained mask convolution model to obtain a worker mask and a safety protection mask in the target scene image.

Step 106, specifically includes: 1) Inputting the sample to be tested and the mask feature map matrix to be tested into a trained classification regression network to obtain a frame matrix and a category matrix of the target scene image, wherein the frame matrix stores coordinate information of frames corresponding to various categories of each sample to be tested, and the category matrix stores category probabilities of frames corresponding to various categories of each sample to be tested, and the coordinate information comprises ordinate coordinates of pixel points on the mask feature map to be tested, abscissa coordinates of pixel points on the mask feature map to be tested, lower left corner coordinates of the anchor frame to be tested and upper right corner coordinates of the anchor frame to be tested. 2) And selecting the coordinate information of the frame corresponding to the maximum category probability of each sample to be tested from the frame matrix according to the category matrix. 3) And determining a target matrix, wherein the target matrix comprises coordinate information of a frame corresponding to the maximum class probability of all the samples to be tested and corresponding classes. 4) Reducing the number of corresponding samples to be tested in the target matrix by adopting a non-maximum suppression method to obtain a reduced matrix; the reduction matrix comprises coordinate information of a frame corresponding to the maximum class probability of the sample to be measured after reduction and corresponding classes. 5) And inputting the reduced matrix into a trained mask network to obtain a worker mask and a safety protection mask in the target scene image.

For example, the sample to be tested (supposing a) obtained in step 105 is firstly subjected to a trained classification regression network, a frame matrix (1, a, θ, 4) and a category matrix (1, a, θ) (θ is a category number) are obtained, a category xi with the highest category probability is selected and recorded for each b (1.ltoreq.b.ltoreq.a), and a frame corresponding to the category is found from the frame matrix according to the category xi, so that a target matrix (1, a, 5), (5=4+1, the first 4 are coordinate information, 1 represents the category of the coordinate information, namely xi); then reducing the number of samples to be detected according to a non-maximum suppression method, and supposing that the number is reduced from a to a ', wherein the reduction matrix is (1, a', 5); and finally, inputting the reduced matrix into a trained mask network to obtain a worker mask and a safety protection mask.

Step 107: and determining whether the worker wears the safety protector in the target scene image according to the intersection relation between the worker mask and the safety protector mask of the target scene image.

In step 104, the method for determining the trained area candidate network includes:

1) And (5) preprocessing data. Firstly, acquiring a training data set; the training data set comprises training pictures and corresponding labeling information; the labeling information includes a photograph number, a photograph size, mask information, and mask category information. And then scaling and filling all training photos in the training data set into uniform size, recording filling information, scaling mask information in the labeling information according to a set proportion, and extracting a real bounding box of the mask to obtain a real bounding box matrix. The training data set with uniform size is COCO (Common Objects in Context) format, and in the subsequent training process, the COCO format is adopted as the training data format, so that the training and predicting effects can be improved. Referring to fig. 2, this step is specifically:

1, data preprocessing:

1.1: data acquisition and filtering

And taking a picture of a working scene of a worker, and filtering the photos with overexposure, underexposure, unclear blurring and incomplete pictures by adopting a manual selection mode so as to construct a photo set X.

1.2: labeling

X was labeled using Labelme labeling tool. When marking data, marking masks and categories for the outlines of safety guards and workers, and constructing a marked data set X _n×2 ＝{D,J}。D＝{d ₁ ,d ₂ ,...,d _n }, where d _i (1 is less than or equal to i is less than or equal to n) represents an unlabeled training photo, namely an original photo in a photo set X; j= { J ₁ ,J ₂ ,...,J _n }, wherein J _i (1.ltoreq.i.ltoreq.n) represents d _i Labeling information of J _i ＝{G _D ,G _S ,G _M ,G _C }, wherein G _D (1≤G _D N) represents a photograph number; g _S ＝[h,w]Representing the photo size, h is d _i W is d _i Is of the width of (a);representing marked masks (mask information), where Preserving two-dimensional information, i.e. v (j, k) e 0,1]The pixel value with the mask is 1, and the pixel value without the mask is 0; />Representing category information, wherein->Wherein t represents the number of marked masks. X is added according to the ratio of 9:1 _n×2 Divided into a training data set X' and a test data set X ".

1.3: will d _i Scaling and filling to uniform size

Setting the scaling s _c Maximum photo length m of D _l And minimum photo length m of D _s 。h _l ＝max(h,w)，h _s ＝min(h,w)，s _c ＝max(m _s /h _s ,s _c ). If s _c ×h _l ＞m _l ，s _c ＝m _l /h _l For h _l ,h _s Performing equivalence reduction, and using pixel pair d with value of 0 _i Filling, let h _l ＝h _s ＝m _l The method comprises the steps of carrying out a first treatment on the surface of the If s _c ×h _l ≤m _l With a pixel pair d of value 0 _i Filling, let h _l ＝h _s ＝m _l . The scaled-up filled photo is d' _i . S at the beginning _c =1, i.e. the photograph is not zoomed by default; m is m _l ＝1024，m _s =800, i.e. the common photo does not exceed this value.

1.4：d _i Recording of padding information

Calculating d according to formula (1) _i Is used for filling information of w _d Preserve d _i And (5) filling the position information.

Wherein t is _p 、b _p 、l _p 、r _p Respectively represent d _i Top, bottom, left, right fill the width of a pixel with a value of 0, w _d ＝[t _p ,l _p ,h+t _p ,w+l _p ]，(t _p ,l _p ) Representation d _i At d' _i Lower left corner coordinates of (h+t) _p ,w+l _p ) Representation d _i At d' _i Upper right hand corner coordinates of (b).

1.5: obtaining a real bounding box matrix G _B

Pair J _i Mask G in (1) _M The proceeding proportion is s _c From G by the following rule _M The true bounding box of the mask is extracted. Calculation ofIf there is a point where a pixel in the vertical direction is 1, the values in the vertical direction are all 1. Finding the first 1 from left to right is denoted as x ₁ Then continue searching the last 1 position to record as x ₂ The method comprises the steps of carrying out a first treatment on the surface of the Calculate->If there is a point where a pixel in the horizontal direction is 1, the values in the horizontal direction are all 1. Find the first 1 from top to bottom and mark y as ₂ Then continue searching the last 1 position to be marked as y ₁ . Then (x) ₁ ,y ₁ ) And (x) ₂ ,y ₂ ) The lower left and upper right corner coordinates of the real bounding box, respectively. Repeating n times from G _M Finding out the coordinates of all bounding boxes, using the real bounding box matrix G _B Record (S)/(S)>

2) And (5) extracting characteristics. Firstly, carrying out feature extraction on training pictures with uniform sizes by adopting a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix and a mask feature map matrix of each training picture with uniform sizes; then, generating anchor frames by taking each pixel point in the feature map matrix as a center, and storing coordinates of each anchor frame by adopting an anchor frame coordinate matrix; and calculating a real frame offset frame matrix based on the real boundary frame matrix and the anchor frame coordinate matrix.

The feature extraction method comprises the steps of extracting features of training photos with uniform sizes by adopting a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix and a mask feature map matrix of each training photo with uniform sizes, wherein the feature map matrix and the mask feature map matrix specifically comprise the following steps:

(1) extracting features of the training pictures with uniform sizes by adopting a ResNet-101 residual neural network to obtain an initial training matrix; the initial training matrix comprises training feature diagrams of K stages.

(2) And carrying out feature fusion on the training feature graphs of each stage in the initial training matrix by adopting a feature pyramid network to obtain K stages of fused training feature graphs.

(3) And carrying out maximum pooling operation on the fused training feature images in the K stage according to a set step length to obtain a newly added training feature image.

(4) Determining a feature map matrix and a mask feature map matrix; the feature map matrix comprises K-stage fused training feature maps and the newly added training feature maps; the mask feature map matrix comprises K-stage fused training feature maps.

Still referring to fig. 2, in practical application, the implementation process of the step 2) is as follows:

2: feature extraction

2.1: extraction of photo d 'with ResNet-101' _i Is a feature map of (1). The structure of the ResNet-101 residual neural network is shown in FIG. 3.

d' _i Is 1024 x 1024 in size and (1,1024,1024,3) in shape, and 3 represents the number of channels of the color picture. ResNet-101 extracted target features were divided into 5 stages, the procedure is as in Table 1.

TABLE 1 feature map extraction Process

Feature map matrix c= [ C ₂ ,C ₃ ,C ₄ ,C ₅ ]Wherein C _k ,k∈[2,5]Representing photo d' _i Feature map of the kth stage. In C ₅ For example, (1,32,32,2048) shows a feature map of size 32×32, and the number of channels is 2048.

2.2: feature fusion is performed on the feature map matrix C by using a Feature Pyramid Network (FPN). The fusion process of the FPN is shown in fig. 4, where 2X represents upsampling to 2 times the original feature map size.

The d 'is obtained by (2.1)' _i Feature map matrix C of (C), utilize C ₂ 、C ₃ 、C ₄ 、C ₅ The feature map pyramid structure is established, and the specific fusion process is shown in table 2.

TABLE 2 feature map fusion process

Will P ₅ Performing maximum pooling operation with step length of 2 to obtain P ₆ The dimension is (1,16,16,256), the size is 16×16, and the channel number is 256. Binding to the slave P ₂ To P ₆ Feature map, obtaining a feature map matrix R of RPN _F And mask convolution network feature map matrix M _R . Wherein R is _F ＝[P ₂ ,P ₃ ,P ₄ ,P ₅ ,P ₆ ]For inputting area candidate networks; m is M _R ＝[P ₂ ,P ₃ ,P ₄ ,P ₅ ]For inputting a mask convolution network.

2.3: generating anchor frames

An anchor box is a series of boxes that are generated at the pixels of the feature map. In a characteristic diagram P ₂ For example, P ₂ The feature map size of (2) is 256×256, corresponding to photo d' _i Is 4, P ₂ Each pixel point of the array corresponds to d' _i Is a region of area 4 x 4, which is the anchor frame. Traversing a feature map matrix R _F At P _i And (2) generating an anchor frame for each pixel point on the substrate (i is more than or equal to 2 and less than or equal to 6). Setting the anchor frame sizes of different feature graphs to be alpha= [ alpha ] ₂ ,α ₃ ,α ₄ ,α ₅ ,α ₆ ]Make the characteristic map P _j (P _j ∈R _F ) The size of the anchor frame corresponding to each pixel point becomes alpha _j ×α _j (α _j E alpha). Setting the aspect ratio R= [ R ] of the anchor frame ₁ ,r ₂ ,r ₃ ]Each pixel point corresponds to three anchor frames with different lengths and widths.H＝[h ₁ ,h ₂ ,h ₃ ]，W＝[w ₁ ,w ₂ ,w ₃ ]. H and W are generated according to formula (2).

Wherein i is E [1,3 ]]，j∈[2,6]H represents three different heights of the anchor frame, W represents three different widths of the anchor frame, and H _i ×w _i ＝α _j ×α _j . Table 3 shows the number of anchor boxes for different feature map generation.

TABLE 3 relationship between feature map and anchor frame quantity

Feature map	Number of anchor frames
		P ₂	256×256×3＝196608
P ₃	128×128×3＝49152
		P ₄	64×64×3＝12288
P ₅	32×32×3＝3072
		P ₆	16×16×3＝768

P ₂ ,P ₃ ,P ₄ ,P ₅ ,P ₆ Together can generateAnd anchor frames. And (3) calculating the anchor frame coordinates through a formula (3).

y ₁ ＝y _center -0.5×h _i ,i∈[1,3]

x ₁ ＝x _center -0.5×w _i ,i∈[1,3]

y ₂ ＝y ₁ +h _i

x ₂ ＝x ₁ +w _i (3)

Wherein y is _center And x _center Respectively representing the ordinate and the abscissa of the pixel point on the feature map, (x) ₁ ,y ₁ ) And (x) ₂ ,y ₂ ) Representing the coordinates of the lower left corner and the upper right corner of the anchor frame respectively. Using an anchor frame matrix M _A Record P ₂ 、P ₃ 、P ₄ 、P ₅ And P ₆ The anchor frame is formed with a dimension (1,261888,4).

2.4: calculating a true offset frame matrix I' _B

Calculating an anchor frame matrix M through a formula (4) _A And a true bounding box matrix G _B Cross ratio I of (1) _U The calculated cross-ratios are saved with a matrix IU.

I＝max(x' ₂ -x' ₁ ,0)×max(y' ₂ -y' ₁ ,0)

I _U ＝I/U (4)

Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Respectively represent the left lower corner and the right upper corner of the anchor frame,/->And->Respectively represent G _B Is->Left lower corner and right upper corner of (c). (x' ₁ ,y' ₁ ) And (x' ₂ ,y' ₂ ) Respectively representing the lower left corner and the upper right corner of the intersection of the two, I representing the area of the intersection, and U representing the area of the union.

Using IU _max Preserving maximum IUFrom I, find the value of the cross ratio I from IU _U 。

If I _U More than or equal to 0.7, the number is 1, and 0.3 is less than I _U < 0.7, marked with a number of 0, if I _U < 0.3, labeled with a number of-1, using R _PN The matrix records the marking data. And (5) converting coordinates of the anchor frame and the real frame through a formula (5).

Converting the offset values of the anchor frame and the real frame through a formula (6)

d _h ＝log(h ¹ /h ² )

d _w ＝log(w ¹ /w ² ) (6)

Wherein the method comprises the steps ofRepresents the center coordinates of the anchor frame, (h) ¹ ,w ¹ ) Representing the width and height of the anchor frame; />Represents the center coordinates of the real frame, (h) ² ,w ² ) Representing the width and height of the real border. The offset value uses the real offset frame matrix I' _B Preservation, according to experience (len (I' _B ) =256), i.e. I' _B Number of offset values saved.

3) And constructing an area generation network (Region Proposal Network, RPN), and training the area generation network by adopting a feature map matrix to obtain a trained area candidate network. The regional generation network comprises a sharing layer, a first convolution layer and a second convolution layer; the first convolution layer and the second convolution layer are parallel; the sharing layer is used for keeping the size of the input characteristic diagram unchanged; the first convolution layer is used for convolving the feature images output by the sharing layer to output a prediction frame and the foreground probability of the prediction frame; and the second convolution layer is used for convolving the feature map output by the sharing layer to output a predicted frame offset value. And during training, inputting the feature map matrix into the sharing layer, calculating a regional network loss value according to the real frame offset frame matrix, the predicted frame output by the first convolution layer and the predicted frame offset value output by the second convolution layer, and training the regional generation network by taking the minimum regional network loss value as a target to obtain a trained regional candidate network.

Still referring to fig. 2, in practical application, the implementation process of the step 3) is as follows:

3: construction and training of RPN

3.1: construction of RPN

The RPN consists of two parts, the first half consisting of 512 3 x 3 convolution kernels, which function to fix the channel of the input signature to 512 and keep the signature size unchanged. The fixed feature diagram is called a shared layer and is used as a shared input of the second half part; the second half is composed of two parallel first convolution layers ₁ And a second convolution layer l ₂ Composition is prepared. First convolution layer l ₁ Consists of 6 1 x 1 convolution kernels, a second convolution layer l ₂ Consists of 12 1 x 1 convolution kernels.

3.2: construction of shared layers

R is sequentially aligned by using the front half part of RPN _F P in (3) ₂ ,P ₃ ,P ₄ ,P ₅ ,P ₆ The number of channels is changed to 512 by convolution, and the size is kept unchanged, so that a shared layer is formed.

3.3: foreground probability beta of prediction frame ₁ And background probability beta ₂ Calculation of (2)

The foreground represents the object to be detected, the background represents the object not to be detected, and the prediction frame represents the feature map P _i And outputting an anchor frame after RPN convolution. Using a first convolution layer l ₁ Convolving the shared layer to obtain data with dimensions (1, w×h, 6) and resetting to (1, w×h×3, 2), wherein the data are the front/back data predicted by RPN, and R 'is used for processing the data' _C The predicted data is saved.Wherein w×h×3 represents the feature map P _i The number of generated prediction frames, w, h respectively represent P _i Is the width and height of the frame, 3 indicates the anchor frame type, 2 indicates that the prediction frame selection region is the foreground data V ₁ And background data V ₂ Is a dimension of (c). R 'is calculated by using the formula (7)' _C Calculate and output the foreground/background probability beta ₁ And beta ₂ Using a matrix R' _L Respectively saving the generated foreground probability beta ₁ And background probability beta ₂ Wherein

3.4: calculation of prediction frame offset value delta

Using a second convolution layer l ₂ The shared layer is convolved to obtain data with dimensions (1, w×h, 12), and then deformed into (1, w×h×3, 4), denoted as δ, which is the predicted data of RPN. Wherein w×h×3 represents the feature map P _i The number of predicted frames generated, w and h represent P, respectively _i Is represented by [ d ], 3 represents the anchor frame type, 4 represents the dimension of the offset between the prediction frame and the anchor frame, and the content thereof is _x ,d _y ,log(d _w ),log(d _h )]，d _x ,d _y Representing the offset of the coordinates of the central point of the predicted frame and the central point of the anchor frame, d _w And d _h Representing the ratio between the width and height of the prediction frame and the width and height of the anchor frame. By shifting matrix R 'with prediction frame' _B Preserving the generated delta, wherein

3.5: calculation of RPN loss value

3.5.1: calculation of the classification loss of the RPN

Input R _PN And R'. _C Find outAnd then find R 'from i' _C Corresponding values of (2) participating in the calculation of the loss value, respectively>Is R _PN The median value is the total number of 1. The loss value is calculated according to equation (8).

The region candidate network parameters are trained according to the back propagation of the loss function.

3.5.2: calculation of regression loss for RPN networks

Input R _PN ,R' _B And I' _B Find outFrom R 'according to i' _B Find the corresponding valueAnd->The loss value is calculated according to equation (9).

Wherein x is _ij Is thatThe j-th component and->The difference of the jth component (1.ltoreq.j.ltoreq.4), m is +.>Is a number of (3). The region candidate network parameters are trained according to the back propagation of the loss function.

In step 106, the method for determining the trained mask convolution model includes:

1) And reducing coordinates in the anchor frame coordinate matrix based on the foreground probability and non-maximum suppression method of the prediction frame to obtain a training sample. Specific: the corresponding anchor frames in the anchor frame coordinate matrix are arranged in a descending order according to the foreground probability of the prediction frame, and a training anchor frame sequence is obtained; determining a first set number of anchor frames in the training anchor frame sequence as first training target anchor frames; reducing the first training target anchor frames to a second set number by adopting a non-maximum suppression method to obtain second training target anchor frames; the second set number is smaller than the first set number; performing cross-correlation operation on the maximum anchor frame and other anchor frames respectively to obtain a plurality of training cross-correlation values; the largest anchor frame is the anchor frame with the largest foreground probability in the second training target anchor frame; the other anchor frames are the anchor frames except the maximum anchor frame in the training anchor frame sequence; and screening the second training target anchor frame by adopting the training intersection ratio to obtain a third training target anchor frame, and determining coordinates in an anchor frame coordinate matrix corresponding to the third training target anchor frame as training samples.

Still referring to fig. 2, in practical application, the implementation process of the step 1) is as follows:

4: mask convolution network training data generation

4.1: anchor frame M _A Screening of (C)

Extracting the prediction frame foreground/background probability matrix R 'generated in the step (3.3)' _L All the foreground probabilities in the model are foreground probability matrix beta ₁ By combining beta ₁ For M _A Ordering and reserving alpha before ranking ₁ The index value is recorded. According to experience alpha ₁ 6000, according to the index value K' ₆₀₀₀ 、P' ₆₀₀₀ 、A' ₆₀₀₀ . Wherein K' ₆₀₀₀ M representing foreground probability top 6000 _A A matrix of foreground probability values; p'. ₆₀₀₀ RPN network generated prediction box representing foreground probability top 6000 relative to M _A Is a matrix of offset values, P' ₆₀₀₀ ＝[P ₁ ,...,P ₆₀₀₀ ],P _i ＝[d _x ,d _y ,log(d _w ),log(d _h )](1≤i≤6000)；A' ₆₀₀₀ Representing M representing front 6000 of foreground probability ranking _A Is a 'of the coordinates of A' ₆₀₀₀ ＝[A ₁ ,A ₂ ,...,A ₆₀₀₀ ],A _i ＝[x ₁ ,y ₁ ,x ₂ ,y ₂ ](1≤i≤6000)。

4.2: by means of P' ₆₀₀₀ For A' ₆₀₀₀ Make adjustments

A _i ＝[x ₁ ,y ₁ ,x ₂ ,y ₂ ](1≤i≤6000)，P _i ＝[d _x ,d _y ,log(d _w ),log(d _h )](1.ltoreq.i.ltoreq.6000). Pair A using equation (10) _i Converting the format.

Wherein h and w each represent A _i Height and width of (x) _center ,y _center ) Representation A _i Is defined by the center coordinates of the lens. Using P _i For A after transformation _i And (3) performing adjustment, and completing the adjustment operation by using the formula (11).

Will be adjusted A _i Converted to the original format by the formula (12)

4.3：A' ₆₀₀₀ Boundary correction

A' ₆₀₀₀ The coordinates of the anchor frame are more than 1 or less than 0, and the coordinates of the anchor frame are ensured to be [0,1 ] through a formula (13) ]Range.

4.4：A' ₆₀₀₀ Screening of (C)

Sequentially from A' ₆₀₀₀ Selecting a foreground probability value beta ₁ The highest anchor frame is sequentially subjected to cross-ratio operation with the anchor frames with the numerical values arranged behind the highest anchor frame, and I is calculated by a formula (4) _U . If I _U More than or equal to t (more than or equal to 0 and less than or equal to 1), the targets selected by the two anchor frames are the same target, and beta in the two targets is stored ₁ And the highest anchor frame, or else, the two anchor frames are stored. If the number of anchor frames meeting the condition is greater than a certain value alpha' ₁ Then remove beta ₁ Smaller anchor frames, the number of which reaches alpha' ₁ The method comprises the steps of carrying out a first treatment on the surface of the If the number of anchor frames satisfying the condition is less than alpha' ₁ Then 0 is filled to an amount of alpha' ₁ . Empirically, t is set to 0.7, α' ₁ Set to 2000. Matrix A 'for anchor frame after screening' ₂₀₀₀ And (5) preserving.

4.5: will A' ₂₀₀₀ Divided into positive and negative training samples

4.5.1: from A' ₂₀₀₀ Selecting eta anchor frames as samples

1) Let η be 200. Obtaining the marking data J of the participated training photos from the step (1) _i Obtaining a real bounding box matrix G from (1.5) _B A 'is calculated by the formula (4)' ₂₀₀₀ And G _B Cross ratio I of (1) _U Stored in a two-dimensional matrix M with dimensions (2000, n), n＝len(G _B )。

2) Preservation of A with a one-dimensional matrix M' of dimension 2000 _i (1.ltoreq.i.ltoreq.2000)I of (2) _U M '= { M' ₁ ,...,M' ₂₀₀₀ }，0≤M' _i And is less than or equal to 1. If M' _i More than or equal to 0.5, M' _i A corresponding to subscript i _i Is a positive sample; if M' _i < 0.5, M' _i A corresponding to subscript i _i Is a negative sample. With I _p Recording the index value of positive sample, I _n Negative sample subscript values are recorded.

3) The number of samples of each photo participating in training is set as e, and the proportion of positive samples participating in training is set as f '(0 < f' < 1). If I _p The number of recorded subscripts is greater than e×f', then the number of recorded subscripts is randomly determined from I _p E x f' subscripts are selected; if I _p If the number of recorded subscripts is smaller than e×f ', 0 is filled up until the number is e×f ', and the subscript value of the positive sample is used as matrix I ' _P And (5) preserving. The selection of the negative sample is consistent with that of the positive sample, and the lower standard value of the negative sample is used as matrix I' _n And (5) preserving. According to I' _P The subscript of (2) obtains positive sample anchor frame data by using a positive sample matrix R _p The mixture is preserved and is then processed,according to I' _n Obtaining negative sample anchor frame data by using a negative sample matrix R _n And (5) preserving.

4.5.2: is a positive sample R _p Matching a true bounding box

According to I' _P The index value of (2) gets its corresponding I from M _U Using a two-dimensional matrix M _P Preservation from M _P Find I in each row of (1) _U Subscript of maximum value using one-dimensional matrix I _m The mixture is preserved and is then processed, according to I _m Values of (2)From G respectively _B 、G _C And G _M Obtain R _p Corresponding real bounding box matrix R _b ，/>Real class name matrix R _c And a real mask matrix R _m . Conversion of +.>And->Is a data format of (a). Then calculate +. >And->Offset value ρ between _i 。

Wherein, (g) _x ,g _y ) Sum (g) _h ,g _w ) Respectively representCenter coordinates and width and height of (a); (r) _x ,r _y ) Sum (r) _h ,r _w ) Respectively indicate->Center coordinates and width and height, ρ _i ＝[d _y ,d _x ,d _w ,d _h ]Offset matrix ρ= [ ρ ] ₁ ,ρ ₂ ,...,ρ _e*f' ]。

4.6: generating training data for mask convolution network

Negative sample R _n The bounding box, class name, mask and offset values of (1) are all filled with 0's. And R is R _b 、R _c 、R _m And ρ together form a new R' _b 、R' _c 、R' _m And ρ'. ρ ', R' _m And R'. _c For calculating a loss value for the mask convolution network.

2) And constructing a mask convolution network, and training the mask convolution network by using training samples to obtain a trained mask convolution model. Wherein the mask convolution network comprises a parallel classification regression network and a mask network. During training, inputting the feature map matrix, the mask feature map matrix, the labeling information and the real boundary box matrix into the classification regression network, calculating a classification network loss value according to the output prediction mask type and the real type, and training by taking the minimum classification network loss value as a target to obtain a trained classification regression network; inputting the feature map matrix, the mask feature map matrix, the labeling information and the real boundary box matrix into the mask network, calculating a mask loss value according to the output predicted mask and the real mask, and training with the minimum mask loss value as a target to obtain a trained mask convolution model; the trained mask convolution model includes the trained classification regression network and the trained mask convolution model in parallel.

5: mask convolution network construction and training

5.1: construction and training of classification regression network

5.1.1: construction of classification regression network

The mask convolution network has two parallel branch networks, namely a classification regression network and a mask network. Will be positive sample R _p And (2.2) the M obtained _R R is calculated by the formula (15) in the input classification regression network _p Belonging to M _R Is a feature map of the above.

w and h each represent R _p In (a)And the width and height of ε represents +.>Shall belong to p _ε A feature map; k (k) ₀ The feature layer mapped when w=244 and h=244 is represented, generally taken as 4, i.e. corresponds to the feature map P ₄ . For found feature map P _ε Take out->And the region corresponding to the coordinates is subjected to pooling operation by using a bilinear interpolation mode. The specific procedure is shown in Table 4.

TABLE 4 construction of a Classification regression network

Where k represents the size of the convolution kernel, i represents the number of input channels, and o represents the number of output channels. Timedistributed is a layer wrapper that modifies conv2d, BN, dense, softmax, i.e., performs the same operation on all 200 training samples, yielding 200 results. pyramidtroiadign is a bilinear interpolation layer used to scale down feature maps to a size of 7 x 7. mrcnn_class_logits are stored with matrix E, e= [ E ] ₁ ,E ₂ ,...,E ₂₀₀ ]E _i ＝[ω ₁ ,ω ₂ ,...,ω _θ ]. E is used for calculating the classification loss of the mask convolution network, mrcnn_bbox is stored by using a matrix H, and H= [ H ] ₁ ,H ₂ ,...,H ₂₀₀ ]，H _i ＝[θ,d _x ,d _y ,log(d _w ),log(d _h )]. H is used for calculating regression loss value of mask convolution network, and mrcnn_class_probss representsIs the probability of a certain class.

5.1.2: calculating a loss function of a categorized regression network

Inputting (1.2) the type information G of the mark _C 、R' _c And E. The value of E is fixed at [0,1 ] by using the formula (16)]The loss function is calculated again using equation (17). G _C The prediction loss of categories not among the prediction categories of the present image is eliminated.

Wherein isPhi is->Is a number of (3).

5.1.3: calculating a loss function of a categorized regression network

Inputs ρ ', H and R' _c The loss function value is calculated using equation (9).

5.2: mask network construction and training

5.2.1: mask network construction

The mask network was constructed as shown in table 5.

TABLE 5 construction of mask network

/>

Where mrcnn_mask is called pi for calculating mask loss values (mask_loss) in a mask convolution network.

5.2.2: mask network loss function calculation

Input R' _c Pi and R' _m Mask network for eachAn output profile of theta x 28 is generated,

that is, θ binary mask images of 28×28, the masked pixel value is 1, and the unmasked pixel value is 0. The pi shape is first changed to (1, 200, θ, 28). Then R 'is to' _m Scaling the mask to a shape of 28 x 28 and then performing a mask scaling on R' _m And pi, applying a sigmoid function to each pixel point in the pi, so that the value is in the range of 0,1]. By R'. _c Find the leadSubscript other than 0, a subscript matrix i is obtained, and a matrix j of values thereof is obtained. From R 'according to subscript i' _m Find R' _m Pi' is derived from pi from i and j. The loss function is calculated by equation (18).

Wherein psi representsIs 'represents pi' _k (π' _k E pi'). And training mask convolution network parameters according to the back propagation of the loss function, and storing the trained parameters.

5.3: model evaluation and model fixing

Test set X "= { D, J } is entered into the model, where d= { D = { ₁ ,d ₂ ,...,d _n' }，d _i (1.ltoreq.i.ltoreq.n ') represents unlabeled photographs, i.e., original photographs in test set X'; j= { J ₁ ,J ₂ ,...,J _n' }, wherein J _i (1.ltoreq.i.ltoreq.n') represents d _i N' is the number of pictures of the test set. Will d _i D is calculated by the input model _i Prediction category Y of (2) _C Anchor frame Y _B Class probability Y _S Mask Y _M . According to the prediction category Y _C Partition Y _B 、Y _S And Y _M Calculating the intersection ratio I of the prediction anchor frame of each category and the real marking frame of the corresponding category _U . If I _U < ε, the anchor box is marked as F _P The method comprises the steps of carrying out a first treatment on the surface of the If I _U More than or equal to epsilon, marking the anchor frame as T _P Wherein ε (0 < ε < 1) represents a threshold value. According to Y _S Sorting the anchor frames of the same kind, traversing the mark of each anchor frame in turn, and calculating the T appearing so far _P F (F) _P The number of occurrences is denoted A _TP And A _FP Then, the precision P and recall R are calculated by equation (19) and equation (20):

average accuracy (Average Precision, AP) is calculated by equation (21):

wherein j represents d _i In the j-th class, N represents the number of the j-th class, P (k) represents the P value when k detection targets are identified, and DeltaR (k) represents R when the identification number is changed from k-1 to k.

Calculating d by equation (22) _i Average of all class average precision.

Wherein m represents d _i The number of species present in the cell. Calculating and comparing mAPs of different models so that mAP values are the mostThe high model serves as the model that ultimately requires fixed parameters.

The training process of the above-mentioned region candidate network, classification regression network and mask network is shown in fig. 5.

In practical application, when target detection is performed on a target scene image, in step 106 and step 107, the sample to be detected and the mask feature map matrix to be detected are input into a trained mask convolution model to obtain a worker mask and a safety protection mask in the target scene image; then, whether the worker wears the safety guard in the target scene image is determined according to the intersection relation between the worker mask and the safety guard mask of the target scene image. Still referring to fig. 2, the specific detection process is as follows:

6: target detection according to trained network parameters

And loading the trained parameters, inputting the picture to be detected, and predicting a frame matrix, a mask matrix, a category matrix and a category probability value of the target by the model according to the parameters. The shape of each mask is 28×28, and the mask is scaled to the size of its corresponding frame to fit the outline of the object in the picture to be detected. The frequency of the number of occurrences of the tag in the category matrix is counted. For example, the numeral "1" represents a frame selection of a worker, the numeral "2" represents a frame selection of a helmet, the numeral "3" represents a reflective garment, and the numeral "0" represents a background. Matrix T for different numbers ₁ 、T ₂ And T ₃ And respectively storing masks of workers, safety helmets and reflective clothing. And (3) calculating the intersecting relation omega between the safety protective clothing such as the safety helmet, the reflective clothing and the mask of the worker by using the formula (23).

Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Separate tableIndicating the lower left and upper right coordinates of the security mask, < >>And then (I)>Representing the lower left and upper right corner coordinates of the worker mask, respectively. (x' ₁ ,y' ₁ )，(x' ₂ ,y' ₂ ) Respectively representing the lower left corner and the upper right corner of the intersection of the two, wherein I represents the area of the intersection, and S represents the area of the safety shield mask.

If len (T) ₁ ) =0, then no worker is present in the photograph; if len (T) ₁ )＞0,len(T _2,3 ) =0, then the worker in the photograph is not wearing the safety brace; if len (T) ₁ )＞0，len(T _2,3 ) > 0, then T is calculated using equation (23) ₁ Sequentially with T ₂ ，T ₃ Calculating, and selecting to maximize Ω each timeAnd from T ₂ ,T ₃ And deleting the corresponding item. If Ω is less than pi, pi is less than (0, 1), the worker does not intersect with the helmet or reflective clothing, i.e. the helmet or reflective clothing is detected, but the protective clothing is not worn on the worker, and the number n of unqualified workers is not equal to ₁ +1; if Ω is more than or equal to pi, the worker intersects with the safety helmet or the reflective clothing, namely the safety helmet or the reflective clothing is detected, and the protective clothing is worn on the worker, and the qualified number n ₂ +1. If n ₁ If the number is more than 0, the worker does not wear the safety protection tool qualified in the photo; if n ₂ ＝len(T ₁ ) All workers in the photo wear the safety protector.

A specific example is given below to further explain the method for detecting the wearing of the safety protection device, where the specific example is based on the wearing detection of the mask-CNN reflective garment, and includes the following steps:

(1) Data preprocessing: and taking a picture of the working scene of the worker, sampling the picture, and filtering the sampled picture to obtain a picture set X of the safety protector and the contour of the worker. Labeling X, and forming a training set X 'and a test set X'. And scaling the pictures in the X 'into uniform size, and extracting the labeling information J of the X'.

(2) Feature extraction: building a ResNet-101+FPN model, and inputting pictures in the training set X' into the model to obtain a feature map matrix R _F And M _R . By R _F A series of frames (i.e. anchor frames) are generated by taking each pixel point of the matrix M as the center _A And saving the coordinates of the generated anchor frame.

(3) Construction and training of area candidate networks (RegionProposalNetwork, RPN): building a regional candidate network, and arranging a characteristic diagram matrix R _F Inputting the region candidate network, calculating internal parameters and a foreground probability matrix beta ₁ ＝{β ₁ ,β ₂ ...,β _n }(0≤β _i ≤1，1≤i≤n)。

(4) Generating training samples of a mask convolution model: according to beta ₁ Probability value pair M of (2) _A Descending order, and retaining alpha before ranking ₁ Anchor frame M of (1) _A Continuing to reduce M by using non-maximum suppression method _A Up to an amount of alpha' ₁ And each. Non-maximum suppression, i.e. cycling, from M _A The selected foreground probability beta _i The highest anchor frame and the anchor frame with the numerical value row behind are sequentially subjected to cross-ratio operation to calculate the cross-ratio I _U Is the ratio of the intersecting part area to the merging part area of the two anchor frames. If I _U More than or equal to a threshold t (more than or equal to 0 and less than or equal to 1), and preserving beta in the two ₁ And the highest anchor frame, or else, the two anchor frames are stored. Dividing the anchor frame reduced by the maximum value method into positive and negative samples according to the set sample number and positive and negative sample proportion to form a training sample R _p 。

(5) Building and training a mask convolution model: building a mask convolution model by using R _P And training a mask convolution model, and calculating internal parameters. And calculating the accuracy of the models with different super parameters by using the data of the test set X', and taking the model with the highest accuracy as the model with the final super parameters to be fixed.

(6) Based on the trained models in the steps (3) - (5), inputting a photo to be detected, outputting a target detection result, performing logic judgment according to the target detection result, and outputting a judgment result of whether the worker wears the safety protection tool in the photo.

The specific implementation of this specific example will be described in detail.

1. Pretreatment of

According to (1.1), the working scene of the worker is photographed and sampled, and the photos with overexposure, underexposure, blurry and incomplete pictures are filtered in a manual selection mode to form a photo set X.

Data tagging of photo set X using Labelme according to (1.2) to construct tagged data set X _n×2 ＝{D,J}。D＝{d ₁ ,d ₂ ,...,d _n }，J＝{J ₁ ,J ₂ ,...,J _n }，J _i ＝{G _D ,G _S ,G _M ,G _C If (3)The category of the ith mask is worker if +.>The class of the ith mask is reflective clothing. The marking information is shown in table 6.

TABLE 6 labeling information examples

According to (1.3) d _i Scaling and filling 1024×1024 to form new photo d' _i And records the padding information, which is shown in table 7.

TABLE 7 filling information example Table

Left lower corner abscissa	Left lower corner ordinate	Upper right-hand corner abscissa	Upper right vertical coordinate	Scaling ratio
					112	112	912	912	1.6
112	112	912	912	1.6
					0	172	1024	852	1.36
112	230	912	1001	2.78
					0	127	1024	896	1.538

According to (1.4) pair J _i In (a) maskThe proceeding proportion is s _c Is scaled from by the following ruleExtracting a true bounding box G _B 。

Will beIf there is a point where a pixel in the vertical direction is 1, the values in the vertical direction are all 1. Finding the first 1 from left to right is denoted as x ₁ Then continue searching the last 1 position to record as x ₂ 。

Will beIf there is a point where a pixel in the horizontal direction is 1, the values in the horizontal direction are all 1. Find the first 1 from top to bottom and mark y as ₂ Then continue searching the last 1 position to be marked as y ₁ . Then (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ) The lower left and upper right corner coordinates of the real bounding box, respectively. Repeating n times from G _M The coordinates of all bounding boxes are found. The information records are shown in Table 8.

TABLE 8 real bounding box coordinate example Table

2: feature extraction

Photo d 'was extracted with ResNet-101 according to (2.1)' _i Is a feature map of (1). d' _i Is 1024 x 1024 in size and (1,1024,1024,3) in shape, and 3 represents the number of channels of the color picture. The feature map extracted by ResNet is C ₁ 、C ₂ 、C ₃ 、C ₄ And C ₅ . Utilizing the feature map pyramid network pair C in (2.2) ₁ 、C ₂ 、C ₃ 、C ₄ And C ₅ Feature fusion is carried out to form P ₂ 、P ₃ 、P ₄ 、P ₅ . Will P ₅ Performing maximum pooling operation with step length of 2 to obtain P ₆ The dimension is (1,16,16,256), the size is 16×16, and the channel number is 256. Binding to the slave P ₂ To P ₆ Feature map, obtaining RPN feature map matrix R _F And mask convolution network feature map matrix M _R . Wherein R is _F ＝[P ₂ ,P ₃ ,P ₄ ,P ₅ ,P ₆ ]For inputting area candidate networks; m is M _R ＝[P ₂ ,P ₃ ,P ₄ ,P ₅ ]For inputting a mask convolution network.

The size α= [32,64,128,256,512] of the anchor frame is set according to (2.3), the aspect ratio r= [0.5,1,2] of the anchor frame is set, and the anchor frame is generated. The picture with the number of 10 is taken as an example, and the anchor frame coordinate information is shown in table 9.

TABLE 9 Anchor frame coordinate information

Left lower corner abscissa	Left lower corner ordinate	Upper right-hand corner abscissa	Upper right vertical coordinate
				-22.627	-11.313	22.627	11.313
-16	-16	16	16
				......	.....	......	......
597.961	778.981	1322.039	1141.0193
				704	704	704	704
778.981	597.961	1141.019	1322.039

Computing a true offset bounding box matrix I 'according to (2.4)' _B And R is _PN As shown in tables 10 and 11, respectively.

TABLE 10 offset of anchor boxes from real bounding boxes

d _x	d _y	d _w	d _h
				-0.50	-0.22	0.65	0.88
1.19	0.14	1.20	-0.23
				0.31	0.14	1.20	-0.23
-0.58	0.14	1.20	-0.23
				-1.46	0.14	1.20	-0.23
0	0	0	0
				......	......	......	......
0	0	0	0

TABLE 11R _PN Marking data

Anchor frame id	Marking information
		250821	1
251013	1
		259428	1
259424	1
		259620	1
142827	-1
		144771	-1
......	......
		261888	0

3: construction and training of RPN

Constructing RPN according to (3.1), and sequentially aiming R according to (3.2) by utilizing the front half part of RPN _F In [ P ] ₂ ,P ₃ ,P ₄ ,P ₅ ,P ₆ ]The number of channels is changed to 512 by convolution, and the size is kept unchanged, so that a shared layer is formed. Obtaining the foreground/background prediction data R 'according to (3.3)' _C 。R' _C The Jing Gailv beta before and after the calculation of the formula (7) is obtained ₁ ,β ₂ . Using matrix R' _L Preserving the foreground probability beta ₁ And background probability beta ₂ 。

The predicted offset is obtained according to (3.4). By shifting matrix R 'with prediction frame' _B The generated offset is saved.

The RPN parameters are trained by computing the loss function of the RPN and back-propagating according to (3.5).

4: generating mask convolution network training data

Screening out the anchor frame generated in (2.3) according to (4.1).

And (3) adjusting the anchor frames sequenced in (4.1) according to (4.2) and (4.3).

The anchor frame screened in (4.3) was rescreened according to (4.4).

Dividing positive and negative samples of the anchor frame in (4.4) according to (4.5) by using a sample anchor frame matrix R _p And (5) preserving.

5: mask convolution network construction and training

And (3) building a classification regression network according to (5.1.1), inputting data for training, calculating a loss function according to (5.1.2), and back-propagating to calculate parameters of the classification regression network. And (5) constructing a mask network according to the step (5.2.1), and inputting data for training. The loss function is calculated according to (5.2.2), and the parameters of the mask network are calculated by back propagation. Model evaluation and model fixation were performed according to (5.3), a test set X "was obtained from (1.2), 7 photographs were randomly selected for testing, and different thresholds were set, and the evaluation results are shown in table 12.

TABLE 12 mAP values at different thresholds

	Threshold = 0.6	Threshold = 0.7	Threshold = 0.8	Threshold = 0.9
					Test photo 1	1	1	1	1
Test photo 2	1	1	1	1
					Test photo 3	1	1	1	1
Test photo 4	1	1	1	0
					Test photo 5	1	1	1	1
Test photo 6	1	1	1	0.5
					Test photo 7	1	1	1	0.5
mAP	1	1	1	0.72

6: target detection according to trained network parameters

And loading the trained parameters, inputting the picture to be detected, and predicting a frame matrix, a mask matrix, a category matrix and a category probability value of the target by the model according to the parameters. The frequency of the number of occurrences of the tag in the category matrix is counted. The numeral "1" represents the frame selection of the frame by the worker, the numeral "2" represents the frame selection of the reflective garment, and the numeral "0" represents the background. Matrix T for different numbers ₁ ，T ₂ And respectively storing masks of workers and reflective clothing. And (3) calculating the intersection relationship omega of the reflective clothing and the mask of the worker through a formula (23), and randomly selecting 10 photos to be detected, wherein the detection result is shown in table 13.

TABLE 13 model test results

According to the method for detecting the wearing of the safety protection tool, the COCO format is used as the format of training data, so that the training and predicting effects are improved; based on a deep learning model, feature graphs with different sizes in a photo are extracted by utilizing a ResNet-101 residual neural network, and feature fusion is carried out on adjacent feature graphs by utilizing a feature pyramid network, so that the performance of small object detection is improved; potential relations between the feature map and the labels in the training data are mined, a safety protector wearing detection model (a trained mask convolution model) is constructed according to the potential relations, output information in model prediction is fully utilized to perform relation analysis on targets to be detected, accuracy of target detection is effectively improved, and labor cost is greatly reduced. The method has the following advantages:

(1) The invention adopts Mask R-CNN network, can accurately identify the category of security protection tool and person, marks the object area by using frame, and can extract the object area from the picture, obtain the outline information of the target, thereby obtaining more detail information of the target.

(2) The invention has stronger compatibility to the size of the input picture, namely, has no fixed requirement to the size of the input picture. The invention scales and fills the input picture to the uniform size, records the scaling and filling information, and restores the size of the picture according to the recorded scaling and filling information when outputting data.

(3) The invention adopts the residual error network (ResNet-101) as the backbone network to extract the image characteristics, and the difficulty of gradient disappearance and training can be reduced because the residual error structure does not increase model parameters. When the position relation between the Mask and the figure Mask is calculated, mask information output by the Mask R-CNN network is fully utilized, and the safety Mask detection efficiency is improved.

(4) The invention adopts a characteristic pyramid network (FPN) which is used for fusing the characteristic graphs, so that the information of two adjacent characteristic graphs are fused together. The FPN mainly solves the multi-scale problem in object detection, and greatly improves the detection performance of small objects under the condition of basically not increasing the calculation amount of the original model by simple network connection change, thereby improving the detectable visual field of the method.

(5) The invention applies RoIAlignon to adjust the size of each RoI to be a fixed size, extracts the corresponding feature of each RoI adjusted to be the fixed size on the feature map through bilinear interpolation, replaces rounding operation in FasterR-CNN, reduces the introduction of errors, enables a mask to cover an original image more accurately, and improves the detection accuracy of the safety protection tool.

The present invention also provides a system for detecting the wearing of a safety protector, see fig. 6, the system comprising:

an image acquisition module 601 is configured to acquire an image of a target scene.

And the feature extraction module 602 is configured to perform feature extraction on the target scene image by using a res net-101 residual neural network and a feature pyramid network, so as to obtain a feature map matrix to be detected and a mask feature map matrix to be detected.

And an anchor frame generating module 603, configured to generate an anchor frame to be tested with each pixel point in the feature map matrix to be tested as a center, and store coordinates of each anchor frame to be tested by adopting the anchor frame coordinate matrix to be tested.

The foreground probability calculation module 604 is configured to input the feature map matrix to be tested into a trained area candidate network, so as to obtain the foreground probability of each anchor frame to be tested.

And the sample to be detected generating module 605 is configured to reduce coordinates in the coordinate matrix of the anchor frame to be detected based on the foreground probability and the non-maximum suppression method of the anchor frame to be detected, so as to obtain a sample to be detected.

The mask detection module 606 is configured to input the sample to be detected and the mask feature map matrix to be detected into a trained mask convolution model to obtain a worker mask and a safety protection mask in the target scene image;

the safety guard wearing detection module 607 is configured to determine whether the worker wears the safety guard in the target scene image according to an intersection relationship between the worker mask and the safety guard mask of the target scene image.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method of detecting the wearing of a safety brace, comprising:

acquiring a target scene image;

2. The method for detecting the wearing of the safety protection device according to claim 1, wherein the feature extraction is performed on the target scene image by using a res net-101 residual neural network and a feature pyramid network to obtain a feature map matrix to be detected and a mask feature map matrix to be detected, specifically comprising:

Extracting features of the target scene image by adopting a ResNet-101 residual neural network to obtain an initial matrix to be detected; the initial matrix to be tested comprises K stages of feature graphs;

feature pyramid network is adopted to perform feature fusion on the feature graphs of each stage in the initial matrix to be detected, and K fused feature graphs are obtained;

carrying out maximum pooling operation on the fused feature images in the K stage by a set step length to obtain a newly added feature image;

determining a feature map matrix to be detected and a mask feature map matrix to be detected; the feature map matrix to be detected comprises K-stage fused feature maps and the newly added feature maps; the mask feature map matrix to be tested comprises K-stage fused feature maps.

3. The method for detecting wear of a safety brace according to claim 1, wherein the method for reducing coordinates in the coordinate matrix of the anchor frame to be detected based on the foreground probability and the non-maximum suppression of the anchor frame to be detected comprises:

according to the foreground probability, descending order arrangement is carried out on the corresponding anchor frames to be detected in the anchor frame coordinate matrix to be detected, and an anchor frame sequence is obtained;

Determining a first set number of anchor frames to be detected in the anchor frame sequence as first target anchor frames;

reducing the first target anchor frames to a second set number by adopting a non-maximum suppression method to obtain second target anchor frames; the second set number is smaller than the first set number;

carrying out cross-over ratio operation on the largest anchor frame to be detected and other anchor frames to be detected respectively to obtain a plurality of cross-over ratio values; the largest anchor frame to be detected is the anchor frame to be detected with the largest foreground probability in the second target anchor frame; the other anchor frames to be detected are anchor frames to be detected except the largest anchor frame to be detected in the anchor frame sequence;

and screening the second target anchor frame by adopting the intersection ratio to obtain a third target anchor frame, and determining coordinates in the coordinate matrix of the anchor frame to be detected corresponding to the third target anchor frame as a sample to be detected.

4. The method for detecting the wearing of the safety protector according to claim 1, wherein the step of inputting the sample to be detected and the mask feature map matrix to be detected into a trained mask convolution model to obtain a worker mask and a safety protector mask in the target scene image specifically comprises the following steps:

Inputting a sample to be tested and a mask feature map matrix to be tested into a trained classification regression network to obtain a frame matrix and a category matrix of a target scene image, wherein the frame matrix stores coordinate information of frames corresponding to various categories of each sample to be tested, and the category matrix stores category probabilities of frames corresponding to various categories of each sample to be tested;

selecting coordinate information of a frame corresponding to the maximum category probability of each sample to be tested from the frame matrix according to the category matrix;

determining a target matrix; the target matrix comprises coordinate information of frames corresponding to the maximum class probability of all the samples to be tested and corresponding classes;

reducing the number of corresponding samples to be tested in the target matrix by adopting a non-maximum suppression method to obtain a reduced matrix; the reduction matrix comprises coordinate information of a frame corresponding to the maximum class probability of the sample to be detected after reduction and corresponding classes;

and inputting the reduced matrix into a trained mask network to obtain a worker mask and a safety protection mask in the target scene image.

5. The method for detecting the wearing of a safety brace according to claim 1, wherein the method for determining the trained area candidate network is as follows:

Acquiring a training data set; the training data set comprises training photos and corresponding labeling information; the labeling information comprises a photo number, a photo size, mask information and mask category information;

scaling mask information in the labeling information according to a set proportion, and extracting a real boundary frame of the mask to obtain a real boundary frame matrix;

extracting features of the training photos by adopting a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix and a mask feature map matrix of each training photo;

generating an anchor frame by taking each pixel point in the feature map matrix as a center, and storing the coordinates of each anchor frame by adopting an anchor frame coordinate matrix;

calculating a real frame offset frame matrix based on the real boundary frame matrix and the anchor frame coordinate matrix;

constructing a regional generation network; the region generation network comprises a sharing layer, a first convolution layer and a second convolution layer; the first convolution layer and the second convolution layer are parallel; the sharing layer is used for keeping the size of the input characteristic diagram unchanged; the first convolution layer is used for convolving the feature images output by the sharing layer to output a prediction frame and the foreground probability of the prediction frame; the second convolution layer is used for convolving the feature map output by the sharing layer to output a predicted frame offset value;

And inputting the feature map matrix into the sharing layer, calculating a regional network loss value according to the real frame offset frame matrix, the predicted frame output by the first convolution layer and the predicted frame offset value output by the second convolution layer, and training the regional generation network by taking the minimum regional network loss value as a target to obtain a trained regional candidate network.

6. The method for detecting the wearing of a safety brace according to claim 5, wherein the method for determining the trained mask convolution model is as follows:

reducing coordinates in the anchor frame coordinate matrix based on the foreground probability and non-maximum suppression method of the prediction frame to obtain a training sample;

constructing a mask convolution network; the mask convolution network comprises a parallel classification regression network and a mask network;

inputting the feature map matrix, the mask feature map matrix, the labeling information and the real boundary box matrix into the classification regression network, calculating a classification network loss value according to the output predicted mask type and the real type, and training with the minimum classification network loss value as a target to obtain a trained classification regression network;

Inputting the feature map matrix, the mask feature map matrix, the labeling information and the real boundary box matrix into the mask network, calculating a mask loss value according to the output predicted mask and the real mask, and training with the minimum mask loss value as a target to obtain a trained mask convolution model; the trained mask convolution model includes the trained classification regression network and the trained mask convolution model in parallel.

7. The method for detecting the wearing of the safety protector according to claim 5, wherein the feature extraction is performed on the training photographs by using a res net-101 residual neural network and a feature pyramid network to obtain a feature map matrix and a mask feature map matrix of each training photograph, and the method specifically comprises:

scaling and filling all the training photos in the training data set into uniform sizes;

and extracting the characteristics of the training pictures with the uniform size by adopting a ResNet-101 residual neutral network and a characteristic pyramid network to obtain a characteristic map matrix and a mask characteristic map matrix of each training picture with the uniform size.

8. The method for detecting the wearing of the safety protector according to claim 7, wherein the feature extraction is performed on the training photographs with uniform sizes by using a ResNet-101 residual neural network and a feature pyramid network to obtain a feature map matrix and a mask feature map matrix of each training photograph with uniform sizes, and the method specifically comprises the following steps:

Extracting features of the training pictures with uniform sizes by adopting a ResNet-101 residual neural network to obtain an initial training matrix; the initial training matrix comprises training feature graphs of K stages;

feature pyramid network is adopted to perform feature fusion on training feature graphs of each stage in the initial training matrix, and K stages of fused training feature graphs are obtained;

carrying out maximum pooling operation on the fused training feature images in the K stage according to a set step length to obtain a newly added training feature image;

determining a feature map matrix and a mask feature map matrix; the feature map matrix comprises K-stage fused training feature maps and the newly added training feature maps; the mask feature map matrix comprises K-stage fused training feature maps.

9. The method for detecting the wearing of the safety protection device according to claim 6, wherein the method for reducing the coordinates in the anchor frame coordinate matrix based on the foreground probability and the non-maximum suppression of the prediction frame, to obtain training samples, specifically comprises:

the corresponding anchor frames in the anchor frame coordinate matrix are arranged in a descending order according to the foreground probability of the prediction frame, and a training anchor frame sequence is obtained;

Determining a first set number of anchor frames in the training anchor frame sequence as first training target anchor frames;

reducing the first training target anchor frames to a second set number by adopting a non-maximum suppression method to obtain second training target anchor frames; the second set number is smaller than the first set number;

performing cross-correlation operation on the maximum anchor frame and other anchor frames respectively to obtain a plurality of training cross-correlation values; the largest anchor frame is the anchor frame with the largest foreground probability in the second training target anchor frame; the other anchor frames are the anchor frames except the maximum anchor frame in the training anchor frame sequence;

and screening the second training target anchor frame by adopting the training intersection ratio to obtain a third training target anchor frame, and determining coordinates in an anchor frame coordinate matrix corresponding to the third training target anchor frame as training samples.

10. A system for detecting the wearing of a safety brace, comprising:

the image acquisition module is used for acquiring a target scene image;