CN112070075B

CN112070075B - Human body detection method based on collaborative regression

Info

Publication number: CN112070075B
Application number: CN202011264121.9A
Authority: CN
Inventors: 张逸; 何鹏飞; 王军; 徐晓刚; 张文广; 朱岳江
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2021-02-09
Anticipated expiration: 2040-11-12
Also published as: CN112070075A

Abstract

The invention discloses a human body detection method based on collaborative regression, which comprises the steps of pre-training a backbone model of a deep convolutional neural network; then, respectively predicting a head central point, a body central point and a corresponding body width and height on an output characteristic layer of the model; simultaneously, respectively returning a vector pointing to the center point of the human body and a vector pointing to the center point of the human head in an output characteristic layer of the model; training a minimum loss function to obtain a human body detection model; and finally, inputting the pictures of the test set into the trained model, determining the position of the human body by utilizing the human head central point, the human body central point predicted value and the vector pointing to each other, and obtaining a human body detection result by combining the human body width and height predicted values. The human body detection method based on the collaborative regression of the head central point and the human body central point is provided by fully considering the problem of low human body detection rate in the crowd dense scene and utilizing the characteristic that the human head in the crowd dense scene is not easy to be shielded mutually compared with the human body, and the human body detection rate in the crowd dense scene is high.

Description

Human body detection method based on collaborative regression

Technical Field

The invention belongs to the technical field of artificial intelligence and computer vision, and particularly relates to a human body detection method based on collaborative regression.

Background

With the development of intelligence, the safety of production and living has become the focus and the demand of people for increasing attention. Cameras have been installed in industrial production sites and in many corners of cities, creating good objective conditions for automated monitoring using computer vision techniques.

In a security scene, a human body detection method is widely and massively applied as a basic method for intelligent video monitoring, and various fine-grained analyses such as real-time people flow statistics, human body posture recognition, human body behavior recognition and the like can be achieved based on human body detection. Therefore, the human body detection method has a very important position in the field of intelligent video monitoring, and the detection accuracy and recall rate of the human body detection method directly influence the effect of the whole follow-up intelligent application algorithm.

In recent years, with the development of deep learning technology, the performance of the target detection technology has been greatly improved, and the target detection technology represented by Yolo has been widely applied in the industry, wherein the application of the human body detection technology is becoming mature. However, the detection of the dense crowd scene is one of the difficulties of the human body detection technology, because the cameras in the urban scene are often erected in the dense crowd scenes such as subway entrance, square, and road, in the shot image, the pedestrians are densely gathered, and there is mutual shielding between bodies to a certain extent, the mainstream target detection method is directly applied to the human body detection of the dense crowd scene, and a plurality of people close to or having mutual shielding are easily detected as one person, thereby causing the problem of low detection recall rate, and a human body detection technology with high recall rate and high accuracy and more sufficient image information utilization is urgently needed to be provided.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a human body detection method based on collaborative regression, which solves the problem that the ideal recall rate cannot be obtained by performing human body detection on a crowd-dense scene in a video image in the prior art by utilizing the characteristic that human heads are not easy to be mutually shielded compared with human bodies in the crowd-dense scene.

The purpose of the invention is realized by the following technical scheme:

a human body detection method based on collaborative regression comprises the following steps:

s1: firstly, pre-training a deep convolutional neural network backbone model;

s2: respectively predicting a head central point and a body central point in an output characteristic layer of the deep convolutional neural network backbone model;

s3: respectively returning a vector of a human head central point pointing to a corresponding human body central point and a vector of a human body central point pointing to a corresponding human head central point in an output characteristic layer of the deep convolutional neural network backbone model;

s4: predicting the width and height of the human body on an output characteristic layer of the deep convolutional neural network backbone model;

s5: training a minimum loss function to obtain a trained human body detection model;

s6: inputting the pictures of the test set into the trained human body detection model, determining the position of the human body by using the central point of the human head, the predicted value of the central point of the human body and the vectors pointing to each other, and obtaining a human body detection result by combining the predicted value of the width and the height of the human body.

Further, the deep convolutional neural network backbone model adoptsResNetAnd is incorporated inImageNetPre-training is performed on the data set.

Further, the operation of S2 is as follows:

s2.1: the downsampling multiple of the output characteristic layer of the deep convolutional neural network backbone model relative to the size of an input picture isxAfter convolution operation is performed on the output characteristic layer, the convolution operation is performedsigmoidNormalization processing is carried out, the predicted probability value of the head central point and the human body central point corresponding to each pixel position of the output characteristic layer is obtained, and therefore the predicted thermodynamic diagrams of the head central point and the human body central point are obtained;

s2.2: pooling the prediction thermodynamic diagram, namely only reserving the prediction probability values of the pixels with the prediction probability values larger than 8 surrounding pixels, setting the prediction probability values of the rest pixel positions to be zero, and setting the human body center point detection threshold value asT _bThe detection threshold value of the center point of the human head isT _hThen in the head center point prediction thermodynamic diagram after pooling, the prediction probability value is greater thanT _hThe pixel position of (A) is the predicted head central point position, and in the human body central point predicted thermodynamic diagram after pooling, the predicted probability value is greater thanT _bIs the predicted position of the center point of the human body.

Further, the specific operation of S3 is as follows:

performing convolution operation on the output characteristic layer of the deep convolution neural network backbone model, and regressing the position of each predicted head central point to obtain a vector pointing to the nearest human body central point; and for each predicted human body central point position, regression is carried out to obtain a vector pointing to the nearest human head central point.

Further, the specific operation of S4 is as follows:

and carrying out convolution operation on the deep convolution neural network backbone model to obtain a human body width and height value which is predicted by each pixel position of the characteristic layer and takes the pixel as a human body central point.

Further, the step of calculating the loss function in S5 is as follows:

(1) predicting losses from the center points, respectivelyL _kThe predicted loss of the center point of the head is obtained by calculation of the calculation formulaL _khHuman body center point predicted lossL _kb(ii) a Regression loss based on orientation vectorL _vThe calculation formula calculates to obtain the vector regression loss pointing to the center point of the human bodyL _vbVector regression loss of center point of pointing human headL _vh(ii) a Calculating human body width and height predicted lossL _size；

Wherein,Nfor the number of real center points,Y ^pin order to predict the probability value(s),Y ^tis a true probability value, alpha is an adjustment parameter;V ^pin order to predict the orientation vector of the regression,V ^tis a true pointing vector;S ^pin order to predict the width and height of the image,S ^ttrue width and height;

(2) the final loss function is obtained by accumulating the central point prediction loss, the directional vector regression loss and the human body width and height prediction lossL _detThe calculation formula is as follows:

wherein,θandγthe adjustment factors are respectively.

Further, in S6, by setting a lower head center point and human body center point detection threshold, more head center points and human body center point prediction values than the actual number are obtained, on the basis, paired head center points and human body center points are screened and matched by using the pointing vector, the human body position is determined by the human body center point and the head center point pair obtained by matching, and then the human body width height corresponding to the human body center point position is obtained by combining the human body width height prediction value, thereby obtaining the final human body detection result.

Further, the method for matching the center point of the human head with the center point of the human body is as follows:

single body center pointPoint_bPlus head center point pointing vector for its corresponding pixel locationV _bhTo obtain a positionPoint_hp(ii) a Center point of single headPoint_hPlus the human body center point vector of its corresponding pixel locationV _hbTo obtain a positionPoint_bp(ii) a If it isPoint_hAt the threshold valuer1Is present within a circle of radiusPoint_hpAt the same timePoint_bAt the threshold valuer2Is present within a circle of radiusPoint_bpThen, the center point of the human body is consideredPoint_bAnd the center point of human headPoint_hAre matched with each other.

The invention has the beneficial effects that:

(1) according to the method, the characteristic that mutual shielding is not easily generated by the human head compared with the human body is utilized, the mode that the vector of the central point of the human head pointing to the central point of the human body and the vector of the central point of the human body pointing to the central point of the human head are cooperatively regressed is adopted, the position of the human body is determined through matching of a pair of vectors, the difficulty of directly detecting the human body in the crowd dense scene image is overcome, and the detection recall rate and the detection accuracy are effectively improved.

(2) The prediction of the human body central point, the prediction of the human head central point, the vector of the human body pointing central point, the vector regression prediction of the human head pointing central point and the prediction of the width and height of the human body are all based on the output characteristic layer of the backbone network, the whole structure is simple, the reasoning time consumption is low, and the application of an industrial scene is facilitated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart of a collaborative regression-based human detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a matching manner between a human head center point and a human body center point in the human body detection method based on collaborative regression according to the embodiment of the present invention.

Fig. 3 is a schematic diagram of a human body detection result based on collaborative regression in a crowd intensive scene according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, the method for detecting a human body based on collaborative regression of the present invention includes the following steps:

s1: firstly, pre-training a deep convolutional neural network backbone model;

in particular, a deep convolutional neural network backbone model employsResNetAnd is incorporated inImageNetPre-training is performed on the data set.

specifically, the downsampling multiple of the output feature layer of the deep convolutional neural network backbone model relative to the input picture size is 4, and the convolution operation with a set of convolution kernel size division ratios of 3 × 3, 1 × 1, 3 × 3, 1 × 1 and 3 × 3 is performed on the output feature layerThen, proceed withsigmoidNormalization processing is carried out, the prediction probability values of the head central point and the human body central point corresponding to each pixel position of the output characteristic layer are obtained, the prediction thermodynamic diagrams of the head central point and the human body central point are obtained, pooling processing is carried out on the prediction thermodynamic diagrams, namely, the prediction probability values of 8 pixels around each pixel point are only reserved, the prediction probability values of the rest pixel positions are set to be zero, and a human body central point detection threshold value is setT _b0.3, head center point detection thresholdT _h0.5, the prediction probability value in the pooled head center point prediction thermodynamic diagram is greater thanT _hThe pixel position of the image is the predicted head central point position, and the predicted probability value in the pooled human body central point predicted thermodynamic diagram is greater thanT _bThe position of the pixel of (1) is the predicted position of the center point of the human body;

step S3: respectively returning a vector of a human head central point pointing to a corresponding human body central point and a vector of a human body central point pointing to a corresponding human head central point in an output characteristic layer of the deep convolutional neural network backbone model;

specifically, a set of convolution operations with convolution kernel sizes of 3 × 3, 1 × 1, 3 × 3, 1 × 1 and 3 × 3 are performed on an output feature layer of the deep convolutional neural network backbone model, a vector pointing to the nearest human body center point is obtained through regression on each predicted human head center point position, and a vector pointing to the nearest human head center point is obtained through regression on each predicted human body center point position.

Step S4: predicting the width and height of a human body on an output characteristic layer of a deep convolutional neural network backbone model;

specifically, a set of convolution operations with convolution kernel sizes of 3 × 3, 1 × 1, 3 × 3, 1 × 1 and 3 × 3 are performed on the output feature layer of the deep convolutional neural network backbone model, so that a human body width and height value predicted by each pixel position of the output feature layer and using the value as a human body central point is obtained.

Step S5: training a minimum loss function to obtain a human body detection model;

in particular, using a crowd-dense scene picture dataset, employingadamOptimization mode, initial learningThe rate is set to 0.0002,epochis set to be 50 in the above-mentioned order,batchsizeand setting to 64, training a minimum loss function, and obtaining a human body detection model.

The loss function is divided into three parts, namely central point prediction lossL _kPointing vector regression lossL _vAnd human body width and height prediction lossL _sizeEach of which is represented by the following formula.

In the central point predicted loss formula,Nfor the number of real center points,Y ^pin order to predict the probability value(s),Y ^tthe real probability value is alpha, alpha is an adjusting parameter, and alpha is 2;

in the formula of the directional vector regression loss,V ^pin order to predict the orientation vector of the regression,V ^tis a true pointing vector;

in the human body width and height prediction loss formula,S ^pin order to predict the width and height of the image,S ^ttrue width and height;

the final loss function is obtained by accumulating central point prediction loss, directional vector regression loss and human body width and height prediction loss, wherein the central point prediction loss comprises head central point prediction lossL _khHuman body center point predicted lossL _kb(ii) a The directional vector regression loss comprises the directional human body central point vector regression lossL _vbAnd vector regression loss of center point of pointed human headL _vh(ii) a The human body width and height predicted loss isL _size。θAndγrespectively, the adjustment coefficients are the coefficients of the adjustment,θtaking out the mixture of 0.6 percent,γtake 0.4.

Step S6: inputting the pictures of the test set into a trained human body detection model, determining the position of a human body by using the central point of the head of the human body, the predicted value of the central point of the human body and the mutually pointing vectors of the central point of the human body and the human body, and obtaining a human body detection result by combining the predicted value of the width and the height of the human body.

Specifically, nearest neighbor downsampling is carried out on the test set picture, the longest edge is scaled to 608 pixels, padding is carried out on the short edge, the size is adjusted to 608 pixels by 608 pixels, and after normalization, the test set picture is input into a trained human body detection model. In order to improve the human body detection rate of the dense scene, lower human head central points and human body central point detection threshold values need to be set to be 0.5 and 0.3 respectively, so that more human head central points and human body central point predicted values than the actual number are obtained, on the basis, paired human head central points and human body central points are screened and matched by using the directional vectors, the human body position is determined by the matched human body central points and human head central point pairs, and the human body width and height corresponding to the human body central point position are obtained by combining the human body width and height predicted values, so that a final detection result is obtained.

As one embodiment, the center point of the head and the center point of the body are matched as follows, as shown in fig. 2, the center point of a single bodyPoint_bPlus head center point vector of its corresponding pixel locationV _bhTo obtain a positionPoint_hp(ii) a Center point of single headPoint_hPlus the human body center point vector of its corresponding pixel locationV _hbTo obtain a positionPoint_bp. Taking threshold valuer1Is 2, threshold valuer2Is 3, ifPoint_hAt the threshold valuer1Is present within a circle of radiusPoint_ hpAt the same timePoint_bAt the threshold valuer2Is present within a circle of radiusPoint_bpThen, the center point of the human body is consideredPoint_bAnd the center point of human headPoint_hAre matched with each other.

As shown in FIG. 3, the detection method of the present invention is performed in a dense crowd scenarioAnd (5) carrying out human body detection results. In thatCrowd_humanUnder the verification data set of the crowd-dense scene, the invention can reach the Average accuracy (MAP) of 85.7 percent, and compared with the existing method, the invention effectively improves the performance of human body detection under the crowd-dense scene.

Claims

1. A human body detection method based on collaborative regression is characterized by comprising the following steps:

s1: firstly, pre-training a deep convolutional neural network backbone model;

s2.2: pooling the prediction thermodynamic diagram, namely only reserving the prediction probability values of the pixels with the prediction probability values larger than 8 surrounding pixels, setting the prediction probability values of the rest pixel positions to be zero, and setting the human body center point detection threshold value asT _bThe detection threshold value of the center point of the human head isT _hThen in the head center point prediction thermodynamic diagram after pooling, the prediction probability value is greater thanT _hThe pixel position of (A) is the predicted head central point position, and in the human body central point predicted thermodynamic diagram after pooling, the predicted probability value is greater thanT _bThe position of the pixel of (1) is the predicted position of the center point of the human body;

s6: inputting a test set picture into a trained human body detection model, obtaining more human head central points and human body central point predicted values than the actual number by setting lower human head central points and human body central point detection threshold values, screening and matching paired human head central points and human body central points by using a directional vector on the basis, determining the position of a human body by the matched human body central point and human head central point pair, and then obtaining the human body width and height corresponding to the position of the human body central point by combining the human body width and height predicted values so as to obtain a final human body detection result;

the method for matching the human head central point with the human body central point comprises the following steps:

2. The collaborative regression-based human detection method according to claim 1, wherein the deep convolutional neural network backbone model adoptsResNetAnd is incorporated inImageNetPre-training is performed on the data set.

3. The collaborative regression-based human body detection method according to claim 1, wherein the specific operation of S3 is as follows:

4. The collaborative regression-based human body detection method according to claim 1, wherein the specific operation of S4 is as follows:

5. The collaborative regression-based human detection method according to claim 1, wherein the loss function in S5 is calculated as follows:

wherein,θandγthe adjustment factors are respectively.