CN113902958A

CN113902958A - Anchor point self-adaption based infrastructure field personnel detection method

Info

Publication number: CN113902958A
Application number: CN202111186748.1A
Authority: CN
Inventors: 许斌斌; 陈畅; 黄均才; 刘鉴栋; 袁晶
Original assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangzhou Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2022-01-07

Abstract

The invention discloses a method for detecting personnel in a capital construction site based on anchor point self-adaptation. Acquiring a picture of personnel on the site of a infrastructure scene and adding a label; establishing an anchor point self-adaptive deep learning network model by combining a fast R-CNN detection network model with an FPN characteristic pyramid network model and adopting an anchor point self-adaptive method; randomly dividing the training set into a training set and a testing set; after data enhancement is carried out on an input image by the training set, an anchor point self-adaptive deep learning network model is trained, then test adjustment is carried out on the test set, a solidified personnel detection model is input aiming at an image to be detected, and a detection result is output. The invention can realize the automatic detection of personnel in the infrastructure site, has higher accuracy, has the advantages of good stability, strong anti-interference capability, high universality and the like, has good robustness, and can be applied to an intelligent supervision system in the infrastructure site.

Description

Anchor point self-adaption based infrastructure field personnel detection method

Technical Field

The invention relates to a method for detecting personnel in a capital construction field, in particular to a method for detecting personnel in the capital construction field based on anchor point self-adaptation.

Background

A large amount of personnel work exists in the infrastructure scene, and have higher requirements on personnel dressing and work specifications, and need be equipped with personnel to carry out real-time supervision. The existing personnel supervision is time-consuming and labor-consuming on one hand, and on the other hand, the personnel supervision is easily influenced by the mental state and concentration degree of the supervision personnel, so that potential construction risks appear on the operation site. Therefore, intelligent supervision of field personnel by means of deep learning instead of supervision personnel is the trend of automatic supervision. The most important and fundamental problem to be solved is how to obtain accurate and rapid personnel detection results in a complex field environment.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides an anchor point self-adaptive infrastructure field personnel detection method, which can realize the automatic detection of infrastructure field personnel, has the advantages of higher accuracy, good stability, strong anti-interference capability, high universality and the like, has good robustness, and can be applied to an intelligent infrastructure field supervision system.

The technical scheme of the invention comprises the following steps:

the invention comprises the following steps:

1) acquiring sample pictures of personnel in a site of a capital construction scene, and making a corresponding sample label file for each picture;

2) establishing an anchor point self-adaptive deep learning network model by combining a fast R-CNN detection network model with an FPN characteristic pyramid network model and adopting an anchor point self-adaptive method;

3) randomly dividing all acquired onsite personnel sample pictures of the infrastructure scene into a training set and a testing set;

4) performing data enhancement on the training set to obtain a training set after data enhancement;

5) training an anchor point self-adaptive deep learning network model by using a training set after data enhancement to obtain a primarily trained infrastructure field personnel detection model;

6) testing the performance of the detection model of the personnel in the infrastructure site after the initial training by adopting a test set, adjusting a training parameter and a detection confidence coefficient threshold according to a test result, and optimizing and solidifying the detection model of the personnel in the infrastructure site;

7) and inputting the solidified capital construction field personnel detection model aiming at the image to be detected, and outputting to obtain a detection result.

The construction scene site personnel sample picture is a picture acquired by taking personnel at a construction operation site as a target object under various construction scenes and by controlling the left-right deviation of a camera facing the target object to 15 degrees and the shooting distance to be within the range of 5-25 meters.

The step 2) is specifically as follows:

inputting second-layer to fifth-layer feature maps of a backbone network in the Faster R-CNN detection network model into the FPN feature pyramid network model, inputting the output of the FPN feature pyramid network model into the region-of-interest module, replacing a method for manually presetting a generation anchor point in the region-of-interest module in the Faster R-CNN detection network model by using an anchor point self-adaption method, and then establishing a deep learning network model based on anchor point self-adaption.

And step 4) specifically, performing multi-aspect processing of random overturning, random brightness enhancement and color channel standardization on the capital construction site personnel sample pictures in the training set in sequence to obtain the training set after data enhancement.

Specifically, the color channel normalization is performed by using the following formula for each color channel:

in the formula, mu and sigma respectively represent the mean value and variance of the same channel obtained by counting RGB channel values of the capital construction field personnel sample picture on a training set, and x' respectively represent the pixel value of one channel in the RGB channels of the capital construction field personnel sample picture before and after the color channel is standardized.

In the step 5), the trained pictures are uniformly scaled to the same size, parameters of the backbone network are pre-trained by using ImageNet known data, the parameter updating mode during training is SGD, the initial learning rate is 0.01, the momentum term is 0.9, and the weight attenuation coefficient is 1 multiplied by 10^-4The batch training size is 4 training iterations 50000 times; training was started slowly with 2000 iterations and a learning rate phase descent.

The slow start specifically means that training is started in the 1 st stage of training by adopting 0.001 time of the initial learning rate, the initial learning rate is linearly increased in the preset iteration times along with the increase of the iteration times, and then the 2 nd stage is started to recover the initial learning rate for training.

The step-down manner of the learning rate specifically means that the learning rate is scaled 1/10 on the initial learning rate at 35000 iterations and 45000 iterations.

The performance of the detection of the capital construction field personnel after the initial training is tested by adopting the test set in the step 6), which specifically comprises the following steps: and counting the proportion of the number of frames with the overlap ratio between the predicted frame and the real frame exceeding the overlap ratio threshold value to the total number of the real frames, and taking the proportion as a test result.

The invention has the beneficial effects that:

compared with the traditional detection method for the personnel on the construction site, the method has the advantages of high accuracy, good robustness and universality to various construction environments;

in the invention, a characteristic Pyramid structure (FPN) is added on a fast RCNN target detection model taking ResNet50 as a frame, so that the fusion of shallow geometrical information and deep semantic information of an image is realized, and the detection effect of the model on remote personnel is improved; the part of artificially setting the anchor points by the fast RCNN is optimized to generate the anchor points based on the input characteristics in a self-adaptive manner, so that the detection effect of the model on human bodies in different positions and proportions in a complex environment is improved.

The method realizes high detection precision on the premise of high efficiency and has stronger anti-interference capability.

Drawings

Fig. 1 is a picture of an example training sample.

FIG. 2 shows an example of embodiment notation.

Fig. 3 is a diagram showing an overall network structure of the embodiment.

FIG. 4 is a diagram illustrating an anchor adaptive structure according to an embodiment.

FIG. 5 is a diagram of the detection and location of the person in the infrastructure scene according to the embodiment.

Detailed Description

The invention is described in further detail below with reference to the figures and the embodiments.

The implementation process of the complete method implemented according to the content of the invention is as follows:

1) a typical picture of a sample of personnel at the site of a infrastructure scene is taken and shown in FIG. 1. Making a corresponding sample label file for each picture; the sample tag file conforms to the xml tag file standard for the Pascal VOC data set. The label includes the name of the image, the path of the image, the height and width of the image, and the position and width of the center point of the real target frame, and a typical diagram of the markup document is shown in fig. 2.

The infrastructure scene field personnel sample picture is a picture acquired by taking personnel on a infrastructure operation field as a target object under various infrastructure scenes and carrying out left-right deviation of 15 degrees and shooting distance within 5-25 meters just facing the target object through a supervision camera.

2) Establishing an anchor point self-adaptive deep learning network model by combining a fast R-CNN detection network model with an FPN characteristic pyramid network model and adopting an anchor point self-adaptive method, as shown in FIG. 3;

the step 2) is specifically as follows:

inputting second-layer to fifth-layer feature maps of a backbone network in the fast R-CNN detection network model into the FPN feature pyramid network model, inputting the output of the FPN feature pyramid network model into the region-of-interest module, replacing a method for manually presetting a generation anchor point in the region-of-interest module in the fast R-CNN detection network model by using an anchor point self-adaption method, and then establishing a deep learning network model based on anchor point self-adaption;

and the backbone network of the network uses ResNet50 to extract the features in stages, and then uses an FPN structure to mix the features of each stage, so as to supplement the missing semantic information in the bottom layer large-receptive-field low-semantic feature map of the network and supplement the missing accurate position information of the top layer small-receptive-field high-semantic feature map of the network. Namely, the image is input into an input layer of a ResNet50 network frame model, the outputs from the second feature extraction stage to the fifth feature extraction stage of the ResNet50 network frame model are all connected to the input of an FPN feature pyramid network model, and the FPN structure interpolates and fuses the stage feature maps output from the second feature extraction stage to the fifth feature extraction stage of the ResNet50 network frame model and outputs feature maps with different scales. Specifically, a small-size feature map in a high stage layer by layer is subjected to bilinear interpolation to obtain a feature map with the same size as that of the previous stage, training fusion parameters are supervised, information fusion is carried out on the feature maps with different scales, and finally a detection frame is generated on a feature map group obtained through fusion.

A schematic diagram of the anchor point adaptive method in step 2) is shown in fig. 4. The anchor point self-adaptive scheme abandons the idea of artificially setting an anchor point, introduces two new training branches to replace the traditional anchor point generating process, and the anchor point self-adaptive method comprises a position predicting branch and a shape predicting branch which are used for helping a frame to complete the generation of a sparse anchor point according to the local characteristics of a feature map, wherein the frame compares the output results of the two branches with a set threshold value to obtain the central position of a target possibly existing on the feature map at first, and then predicts the most possible anchor point shape according to the local characteristics of the characteristics near the central position.

Specifically, joint distribution of parameters related to anchor points is separated into two independent conditional distributions, and feature information calibration is performed by combining voice information of an input feature map to output more accurate features for detection. The joint condition distribution of the anchor points, the condition distribution of the anchor point positions and the anchor point shapes meet the following formula:

p(x，y，w，h|I)＝p(x，y|I)p(w，h|x，y，I)

where, (x, y) represents the position of an anchor, (w, h) represents the shape of an anchor, I represents an input picture, p (x, y, w, h | I) is the conditional distribution of the position and shape of an anchor for an input picture, p (x, y | I) is the conditional distribution of the position of an anchor for an input picture, and p (w, h | x, y, I) is the conditional distribution of the shape of an anchor for the input picture and the position of an anchor. The anchor point location branch as in FIG. 5 produces a probability map p (-) F of the same size as the input feature map_i) The probability of the (i, j) position on the probability map is p (i, j | F)_i) The size of the coordinate position corresponds to the original image I

And correlating, wherein s represents the downsampling multiple of the current feature map output by the FPN relative to the input original image, and the value on the probability map represents the probability that the position is the center of the anchor point. Specifically, the anchor point positioning branch obtains a score map on the basis of an input feature map through a 1x1 convolution kernel, and then obtains a probability value of the score map by using a Sigmoid function, so as to balance efficiency and accuracy.

The anchor shape branch in fig. 4 produces a 2-channel output that is used to predict the relative length and relative width of the anchor. Specifically, we predict dw and dh in the following equations:

w＝σ·s·e^dw，h＝σ·s·e^dh

in the formula, s is a downsampling multiple of the current feature map output by the FPN with respect to the input original image, σ is an empirical parameter, 8 is selected in the embodiment, w and h are actual width and actual length of the anchor point respectively, and dw and dh are relative width and relative length of the anchor point.

The anchor point self-adaptive method shown in fig. 5 enables each position to predict only one anchor point with dynamic change, but not anchor point matrixes with dense distribution, and has good fitting for the shape distribution of complex anchor points caused by different actions and positions of personnel in a picture in a capital construction scene, and the generated anchor points have higher recall rate than the original scheme and lower calculation cost.

The anchor points generated by the anchor point self-adaption method are used for subsequently extracting region-of-interest pooling, and classification and frame regression are carried out through the subsequent network structure which is the same as that of the fast-RCNN.

The color channel normalization is specifically performed by using the following formula for each color channel:

in step 5), the training pictures are uniformly scaled to the same size, parameters of the backbone network are pre-trained by adopting ImageNet known data, the parameter updating mode during training is SGD, the initial learning rate is 0.01, the momentum term is 0.9, and the weight attenuation coefficient is 1 multiplied by 10^-4The batch training size is 4 training iterations 50000 times; training was started slowly with 2000 iterations and a learning rate phase descent.

The slow start specifically refers to starting training by adopting 0.001 time of the initial learning rate in the 1 st stage of training, linearly increasing to the initial learning rate in the preset iteration times along with the increase of the iteration times, and then starting the 2 nd stage to recover the initial learning rate for training.

The learning rate phase descent manner specifically refers to the learning rate being scaled 1/10 above the initial learning rate at 35000 iterations and 45000 iterations.

The total number of experimental pictures was 13574. 12574 pictures were used for training, and the remaining 1000 were used as test sets. And performing data enhancement before the training picture enters the model training, and adopting a random turning and color channel standardization method. The data enhanced pictures were scaled to 1333 × 800 size uniformly, using ResNet50 model parameters pre-trained on ImageNet. The parameter updating mode is SGD, the initial learning rate is 0.01, the momentum term is 0.9, the weight attenuation coefficient is 1 multiplied by 10 < -4 >, the batch training size is 4, and the training iteration times are 50000 times. The training is started slowly by adopting 2000 iterations, and the learning rate is reduced by 10 times in the stage reduction mode of the learning rate when the iteration times are 35000 and 45000.

in step 6), the test set is adopted to test the detection performance of the personnel in the capital construction field after the initial training, and the method specifically comprises the following steps: the statistical test focuses on the proportion of the number of frames with the overlap ratio between the predicted frame and the real frame exceeding the overlap ratio threshold value to the total number of the real frames and is used as a test result, and generally, the overlap ratio threshold value is selected to be 0.5.

Compared with the traditional model, the method has the advantage that the detection effect is obviously improved. Table 1 shows the comparison results between the method and the detection results of the fast-RCNN detection network which takes ResNet50 as a backbone network and the detection results of the network which combines the fast-RCNN detection network with FPN. The detection precision refers to the proportion of effective detection frames obtained by running on the test set to the total number of target frames. Wherein, the effective detection frame means that the coincidence degree of the detection frame and the marking frame exceeds 0.5.

TABLE 1 comparison of test models

And testing on the test set by using the trained model, framing a prediction box on a test sample picture, and marking the prediction confidence, wherein a typical test result is shown in fig. 5. And then calculating the average accuracy of the detection model, and solidifying the detection model with a better effect.

The foregoing detailed description is intended to illustrate and not limit the invention, which is intended to be within the spirit and scope of the appended claims, and any changes and modifications that fall within the true spirit and scope of the invention are intended to be covered by the following claims.

Claims

1. A method for detecting personnel in a capital construction field based on anchor point self-adaptation comprises the following steps:

2. The anchor point self-adaptive infrastructure field personnel detection method according to claim 1, wherein the method comprises the following steps: the construction scene site personnel sample picture is a picture acquired by taking personnel at a construction operation site as a target object under various construction scenes and by controlling the left-right deviation of a camera facing the target object to 15 degrees and the shooting distance to be within the range of 5-25 meters.

3. The method for detecting personnel in a construction site based on anchor point self-adaptation as claimed in claim 1, wherein the step 2) is specifically as follows:

4. The anchor point self-adaptive based infrastructure field personnel detection method as claimed in claim 1, wherein the method comprises the following steps: and step 4) specifically, performing multi-aspect processing of random overturning, random brightness enhancement and color channel standardization on the capital construction site personnel sample pictures in the training set in sequence to obtain the training set after data enhancement.

5. The anchor point self-adaptive based construction field personnel detection method as claimed in claim 4, wherein the method comprises the following steps: specifically, the color channel normalization is performed by using the following formula for each color channel:

6. The anchor point self-adaptive based infrastructure field personnel detection method as claimed in claim 1, wherein the method comprises the following steps: in the step 5), the trained pictures are uniformly scaled to the same size, parameters of the backbone network are pre-trained by using ImageNet known data, the parameter updating mode during training is SGD, the initial learning rate is 0.01, the momentum term is 0.9, and the weight attenuation coefficient is 1 multiplied by 10^-4The batch training size is 4 training iterations 50000 times; training was started slowly with 2000 iterations and a learning rate phase descent.

7. The anchor point self-adaptive infrastructure field personnel detection method according to claim 6, wherein: the slow start specifically means that training is started in the 1 st stage of training by adopting 0.001 time of the initial learning rate, the initial learning rate is linearly increased in the preset iteration times along with the increase of the iteration times, and then the 2 nd stage is started to recover the initial learning rate for training.

8. The anchor point self-adaptive infrastructure field personnel detection method according to claim 6, wherein: the step-down manner of the learning rate specifically means that the learning rate is scaled 1/10 on the initial learning rate at 35000 iterations and 45000 iterations.

9. The anchor point self-adaptive infrastructure field personnel detection method according to claim 1, wherein the method comprises the following steps: the performance of the detection of the capital construction field personnel after the initial training is tested by adopting the test set in the step 6), which specifically comprises the following steps: and counting the proportion of the number of frames with the overlap ratio between the predicted frame and the real frame exceeding the overlap ratio threshold value to the total number of the real frames, and taking the proportion as a test result.