CN111783683A

CN111783683A - Human body detection method based on feature balance and relationship enhancement

Info

Publication number: CN111783683A
Application number: CN202010634855.5A
Authority: CN
Inventors: 安玉山
Original assignee: Beijing Shizhen Intelligent Technology Co ltd
Current assignee: Beijing Shizhen Intelligent Technology Co ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-10-16

Abstract

The invention discloses a human body detection method based on feature balance and relationship enhancement, which is applied to detecting human body targets of different scenes under multiple scales, balances feature information of each scale by fusing the multiple-scale feature information, further enhances feature expression by using implicit relationship between a background and a human body, and has better feature extraction capability and detection performance for human bodies of different scales and postures. Compared with the prior art, the human body detection method with enhanced feature balance and relationship is used, the semantic information of multilayer features is fused and enhanced, the feature extraction capability of the model on the multi-scale human body is improved, diversified human body prediction tasks under different scenes are responded, the number of human body training samples needed is less by utilizing various pre-training algorithms and a balance training sampling technology compared with other human body detection methods, the generalization capability and robustness of the model on human body detection under different scenes are improved, and the method is suitable for wide popularization.

Description

Human body detection method based on feature balance and relationship enhancement

Technical Field

The invention relates to the technical field of human body detection in computer vision, in particular to a human body detection method based on feature balance and relationship enhancement.

Background

The human body detection aims at detecting whether a human body target exists in an image scene or not and giving the position of the target. In various application scenes, the difficulty of the human body detection technology mainly lies in the problems of the diversity of human body actions and postures, the complexity of the background, the shielding of buildings, vehicles and the like on human bodies, the mutual shielding among human bodies, the change of illumination, the change of visual angles, the diversity of human body dimensions caused by the shooting distance and the scene change and the like. The difficulties make the research of the human body detection technology face a plurality of challenges, and reduce the human body detection precision of the existing algorithm in diversified scenes.

At present, the human body detection method mainly takes people as a whole, uses a rectangular frame to represent the human body, carries out traditional manual feature extraction on the human body, such as wavelet features, histogram features of directional gradients and the like, and then uses a classifier to classify the human body. In a complex background, the human body features under different scales are fitted by using a common image pyramid or feature pyramid so as to improve the detection accuracy of the multi-scale human body target, however, for some extremely small targets, the problems of boundary blurring and appearance blurring still occur, and the extremely small targets are difficult to distinguish from a disordered background and other overlapped targets. The human body trunk is divided into a plurality of parts, the occlusion condition of each part is analyzed or predicted to solve the occlusion problem, and the method needs more complex labeling data and a model reasoning process, so that the model cost is higher.

Aiming at the defects of the prior art, a human body detection method which solves the problems that the human body detection precision is low, the boundary and the appearance of a tiny target are fuzzy, the model cost is high, the generalization capability and the robustness of the model to human body detection in different scenes are too low, and the deception attack by means of copying and replaying and the like cannot be dealt with needs to be designed urgently.

Disclosure of Invention

In view of the above defects, the technical problem to be solved by the present invention is to provide a human body detection method based on feature balance and relationship enhancement, so as to solve the problems of low human body detection accuracy in diversified scenes, fuzzy boundary and fuzzy appearance of an extremely small target, high model cost, too low generalization ability and robustness of the model for human body detection in different scenes, and incapability of coping with means of spoofing attacks such as reproduction and the like in the prior art.

The invention provides a human body detection method based on feature balance and relationship enhancement, which comprises the following specific steps:

step 1, performing model pre-training on a detection model to obtain the detection model with better extraction capability and sensitivity on the characteristic features of a human body;

step 2, performing multi-scale feature fusion on the detection model to obtain a multi-scale feature pyramid;

step 3, enhancing the relation of image features of the image features after the multi-scale feature fusion to obtain fusion features after the relation enhancement;

step 4, performing multi-scale feature redistribution on the features of the detection model based on the feature result after the previous fusion and enhancement;

step 5, carrying out balanced sampling on the real sample frame data in the detection model by adopting a negative sample sampling and positive sample sampling method;

and 6, predicting and training the characteristics of the human bodies with different scales on different levels respectively by using a detection model according to the adjusted characteristic pyramid.

Preferably, the step 1 specifically comprises the following steps:

step 1.1, carrying out first-round pre-training on a detection model by adopting a huge universal object detection data set to obtain a detection model with higher generalization characteristic extraction capability;

step 1.2, after the first round of pre-training is finished, adjusting and detecting the top layer structure of the model;

and step 1.3, performing secondary pre-training by mixing a sample containing a human body target in a general scene to obtain a detection model with better extraction capability and sensitivity on the human body characteristic features.

Preferably, the step 2 specifically comprises the following steps:

2.1, selecting a proper intermediate scale, wherein the selection rule of the intermediate scale is as follows: when the number of the feature pyramid layers is n, selecting the scale of the feature of the first floor (n/2) (rounding down) layer as the middle scale;

step 2.2, zooming the image of the detection model by using a bilinear interpolation value to obtain a characteristic pyramid which keeps as much information as possible in the original characteristics;

2.3, simply stacking on the channel, and fusing to obtain a new characteristic diagram containing all levels of information;

and 2.4, compressing the number of channels of the new feature map to the number of channels before fusion by using an additional module to obtain a fused multi-scale feature pyramid.

Preferably, the specific steps of step 2.2 include:

step 2.2.1, setting relevant parameters, wherein the parameters comprise coefficients which need to be multiplied by the central value;

step 2.2.2, performing linear interpolation in two directions by using the gray levels of four adjacent pixels of the pixel to be obtained, and obtaining the gray level according to the linear relation of the gray level change from f (i, j) to f (i, j + 1):

for (i, j + v), f (i, j + v) ([ f (i, j +1) -f (i, j) ] × v + f (i, j),

for (i +1, j + v), f (i +1, j + v) ═ f (i +1, j +1) -f (i +1, j) ] × v + f (i +1, j);

step 2.2.3, obtaining a pixel gray value of bilinear interpolation according to the fact that gray level changes from f (i, j + v) to f (i +1, j + v) are also in a linear relation:

preferably, the specific method for relationship enhancement in step 3 includes using a relationship metric function in a training process

It is derived that the relation between the 256 one-dimensional vectors H and the 256 one-dimensional vectors G measures the function value as:

further obtaining a relation-enhanced fused characteristic F'_mComprises the following steps:

wherein α, β are parameters that can be learned in training, F_mThe fused features are shown, the 256 one-dimensional vector H is the pooled features of the human instance, and the 256 one-dimensional vector G is the pooled features of a certain area around.

Preferably, the step 4 comprises:

step 4.1, fusing and enhancing the characteristic F'_mCarrying out corresponding scaling operation;

step 4.2, for the characteristic that the original scale is smaller than the intermediate scale after the fusion, adopting a pooling layer to carry out down-sampling;

4.3, performing up-sampling on the characteristics of which the original scale is larger than the intermediate scale after fusion by adopting bilinear interpolation;

and 4.4, adjusting the characteristics that the original scale is equal to the intermediate scale after fusion by adopting a convolution layer with unchanged size.

Preferably, the step 5 specifically comprises the following steps:

step 5.1, representing the training difficulty of the negative sample box by using the size of the cross-over ratio (IOU) of the negative sample box and the real sample box, and carrying out a negative sample sampling method based on cross-over ratio balance on the sample data of the detection model;

and 5.2, measuring the representativeness of the positive sample sampling by using the positive sample matching number corresponding to the real sample frame, and carrying out the positive sample sampling method based on example balance on the sample data of the detection model.

Preferably, the method in step 5.1 specifically divides all negative samples into K levels by using the label information of the image, and the number of the negative samples randomly sampled in the K-th level is:

wherein N is the total number of negative samples to be extracted, if the negative samples of the k-th level appear, the number N_kLess than M_kIf yes, all negative samples in the kth level are extracted, the negative samples in the (k +1) th level are sorted according to IOU, and M is selected according to ascending order_k-N_kThe negative examples are supplemented as negative examples for the kth level.

Preferably, the method in step 5.2 specifically includes corresponding all positive samples to P human body labeling boxes, where the number of the positive samples randomly extracted around each human body labeling box is:

wherein M is the number of positive samples needing to be extracted for training, and P is the total number of the personal labeling frames in the image.

Preferably, the step 6 is specifically:

6.1, in the training process, independently matching human bodies with different scales at a certain optimal level for training, wherein the human bodies are independent of each other;

and 6.2, in the prediction process, sequencing the prediction results of all layers according to the equal priority by using a multi-layer result fusion mode, and obtaining a final prediction result by using a single-class NMS algorithm.

According to the scheme, the human body detection method based on the feature balance and the relationship enhancement overcomes the defects of the existing general human body detection technology, the detection method based on the multi-scale feature fusion and the feature fusion enhancement is applied to detecting human body targets of different scenes under multi-scale, feature information of each scale is balanced by fusing the multi-scale feature information, the feature expression is further enhanced by utilizing the implicit relationship between the background and the human body, and the human body detection method has better feature extraction capability and detection performance for the human bodies of different scales and postures. Compared with the prior art, the human body detection method with enhanced feature balance and relationship is used, the semantic information of multilayer features is fused and enhanced, the feature extraction capability of the model on the multi-scale human body is improved, diversified human body prediction tasks under different scenes are responded, the number of human body training samples needed is less by utilizing various pre-training algorithms and a balance training sampling technology compared with other human body detection methods, the generalization capability and robustness of the model on human body detection under different scenes are improved, and the method is suitable for wide popularization.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a process block diagram i of a human body detection method based on feature balance and relationship enhancement according to an embodiment of the present invention;

fig. 2 is a process block diagram of a human body detection method based on feature balance and relationship enhancement according to an embodiment of the present invention;

fig. 3 is a process block diagram of a human body detection method based on feature balance and relationship enhancement according to an embodiment of the present invention;

fig. 4 is a process block diagram of a human body detection method based on feature balance and relationship enhancement according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 4, a specific embodiment of a human body detection method based on feature balance and relationship enhancement according to the present invention will now be described. The human body detection method based on feature balance and relationship enhancement balances feature information of each scale by fusing multi-scale feature information, and further enhances feature expression by using implicit relationship between a background and a human body, and the method specifically comprises the following steps:

and S1, performing model pre-training on the detection model to obtain the detection model with better extraction capability and sensitivity on the characteristic features of the human body.

Pre-training a model: the same or similar learning task is carried out through a large-scale data set, and all parameters or part of parameters of the used model are initialized more reasonably.

The method solves the problems that the human body sample in a single data set is single in scene, sample diversity is low and the like, so that overfitting is caused when training is directly carried out, and a detection model with high generalization and robustness is difficult to obtain. The feature extraction capability of similar classes or other classes of objects is learned through pre-training, so that the problem that the finally learned data set is small in scale or has high sample similarity is avoided for the detection model, the features with better generalization and robustness are finally extracted, overfitting is prevented, and the performance of the detection model is improved.

The specific implementation steps of the step can be as follows:

s1.1, performing first-round pre-training on a detection model by adopting a huge universal object detection data set to obtain model parameters with better generalization performance, improving the feature extraction capability of the model, and finally obtaining the detection model with higher generalization feature extraction capability;

specific universal object detection datasets are, for example, PASCAL VOC, MS COCO and ImageNet.

S1.2, adjusting the top layer structure of the model after the first round of pre-training is finished;

the following are exemplary: when the number of categories of the general object detection data set is 80, the number of classification convolution channels at the top layer of the model is 80. At this time, the parameter value of the classified convolution is deleted, the number of the convolution channels is changed into the number of the classes required by human body detection, and the parameter value of the new convolution is initialized randomly.

S1.3, performing secondary pre-training by mixing a sample containing a human body target in a general scene, improving the extraction capability and sensitivity of the model to the human body characteristic features, and obtaining a detection model with better extraction capability and sensitivity to the human body characteristic features.

And S2, performing multi-scale feature fusion on the detection model, and obtaining a multi-scale feature pyramid based on the feature pyramid frame.

The feature pyramid is a basic component in a recognition system for detecting objects of different dimensions.

The specific implementation steps of the step can be as follows:

s2.1, selecting a proper intermediate scale, wherein the selection rule of the intermediate scale is as follows: when the number of the feature pyramid layers is n, selecting the scale of the feature of the first floor (n/2) (rounding down) layer as the middle scale;

s2.2, scaling by using bilinear interpolation to obtain a characteristic pyramid which retains as much information as possible in the original characteristics;

the step solves the problem that the influence caused by different feature sizes of the features of each layer is caused by the fact that the multi-scale feature pyramid obtained based on the feature pyramid frame, so that as much information as possible in the original features is reserved as much as possible.

The bilinear interpolation is used in the up-sampling process of images or features, the bilinear interpolation method has no defect of discontinuous gray scale, the result is satisfactory, and the specific calculation steps are as follows:

s2.2.1, setting relevant parameters, wherein the parameters comprise coefficients to be multiplied by the central value;

s2.2.2, linear interpolation is carried out in two directions by utilizing the gray scales of four adjacent pixels of the pixel to be solved, and the gray scale change from f (i, j) to f (i, j +1) is a linear relation, and the result is that:

for f (i, j + v), f (i, j + v) ═ f (i, j +1) -f (i, j) ] × v + f (i, j)

For (i +1, j + v), f (i +1, j + v) ═ f (i +1, j +1) -f (i +1, j) ] × v + f (i +1, j)

S2.2.3, obtaining the pixel gray value of bilinear interpolation according to the linear relationship of the gray change from f (i, j + v) to f (i +1, j + v):

s2.3, simply overlapping on the channel, fusing to obtain a new feature map containing all levels of information, and not destroying the original semantic and spatial information of the zoomed features as far as possible;

and S2.4, compressing the number of channels of the new feature graph to the number of channels before fusion by using an additional module, and completing the whole feature fusion process.

The features are all three-dimensional matrixes, the dimensions are (C, H and W) respectively, the direction represented by C represents a channel, the value of C is 256, and the features have 256 channels.

There are two ways in which features may be merged, exemplary: 2 features are fused together, and the method I comprises the following steps: directly adding element by element, and still obtaining the characteristics of (C, H, W) after adding; the second method comprises the following steps: put together in the channel dimension, resulting in a feature of size (n × C, H, W), the additional module employed at this time is (3 × 3 convolution + ReLU activation function), reconverting the feature of (n × C, H, W) back to the feature of size (C, H, W).

And S3, performing image feature relationship enhancement on the image features subjected to multi-scale feature fusion to obtain fusion features subjected to relationship enhancement.

In particular, use is made of F_mRepresenting the fused features, utilizing the correlation between the human body examples hidden in the image and the appearance of the surrounding background, and using a relation weighing function in the training process

And (4) calculating.

Using a relationship metric function in a training process

further, a relation-enhanced fused feature F 'is obtained'_mComprises the following steps:

The implicit relationship between the background and the human body is calculated by assuming that the feature is F (C, H, W), and the relationship value at the (H, W) position is the average cosine value of F (H, W) (the vector of C × 1) and 15 vectors (F (H-2, W-2), F (H-2, W-1), …, F (H +2, W +2)) within the range of 4 × 4 around. When some peripheral vector does not exist, if h is 1, a zero vector is used instead.

And S4, performing multi-scale feature reallocation on the features of the detection model based on the feature results after the previous fusion and enhancement.

The specific implementation steps of the step can be as follows:

s4.1, based on the size of the feature pyramid before the previous fusion, and according to the scale of each layer of features in the previous feature pyramid, fusing and enhancing the feature F'_mCarrying out corresponding scaling operation;

s4.2, for the characteristic that the original scale is smaller than the intermediate scale after fusion, adopting a pooling layer to perform down-sampling;

s4.3, performing up-sampling on the characteristic that the original scale is larger than the intermediate scale after fusion by adopting bilinear interpolation;

in upsampling bilinear interpolation, an exemplary: when a certain pixel value of the original feature scale is f (i, j), the adjacent four pixel values of the (i, j), (2 × i +1, 2 × j), (2 × i, 2 × j +1), (2 × i +1, 2 × j +1) positions after upsampling are equal to f (i, j).

And S4.4, adjusting the characteristics that the original scale is equal to the intermediate scale after fusion by adopting a convolution layer with unchanged size.

And S5, carrying out balanced sampling on the real sample frame data in the detection model by adopting a negative sample sampling method and a positive sample sampling method.

Illustratively, of the 256 samples in each training, 128 negative samples and 128 positive samples are taken.

The specific implementation steps of the step can be as follows:

s5.1, representing the training difficulty of the negative sample box by using the size of the cross-over ratio (IOU) of the negative sample box and the real sample box, and carrying out a negative sample sampling method based on the cross-over ratio balance on the sample data of the detection model.

Specifically, the selection probability of the hard samples is increased to improve the representativeness of the sampled negative samples, and all the negative samples are used for detecting the accuracy of corresponding objects in a specific data set according to the IOU (interaction over Unit) measurement of a real human body frame by utilizing the marking information of the imageOne standard) into K levels, the number of negative samples randomly sampled at the K level is:

wherein N is the total number of negative samples to be extracted, if the negative samples of the k-th level appear, the number N_kLess than M_kIf yes, all negative samples in the kth level are extracted, the negative samples in the (k +1) th level are sorted according to IOU, and M is selected according to ascending order_k-N_kThe negative examples are supplemented as negative examples for the kth level. The finally extracted negative samples have different IOU from the real frames and are distributed uniformly.

S5.2, carrying out a positive sample sampling method based on example balance on the sample data of the detection model by using a method for measuring the representativeness of the positive sample sampling by using the matching number of the positive samples corresponding to the real sample frame.

Specifically, all positive samples are corresponding to P human body labeling boxes, and the number of the positive samples randomly extracted around each human body labeling box is as follows:

And S6, predicting and training the characteristics of the human body with different scales on different levels respectively by using a detection model according to the adjusted characteristic pyramid.

The specific implementation steps of the step can be as follows:

s6.1, in the training process, independently matching human bodies with different scales at a certain optimal level for training, wherein the human bodies are independent of each other;

the optimal level is related to anchor point frame design of the model, anchor point frames of different levels are different in size, after the size of the anchor point frames is compared with the size of a marking frame of a specific human body, the level with smaller frame size difference is selected as the optimal level used in training, and training is not carried out on other levels.

S6.2, in the prediction process, sequencing the prediction results of all layers according to the equal priority by using a multi-layer result fusion mode, and obtaining the final prediction result by using a single-class NMS algorithm (Non-maximum suppression algorithm).

The multilayer result fusion and the equal priority are that the prediction results of all the layers are put together and directly sorted according to the prediction scores, and the final prediction result only selects the prediction box with the highest score.

The method is applied to detecting human body targets of different scenes under multiple scales, balances the characteristic information of each scale by fusing the multi-scale characteristic information, further enhances the characteristic expression by utilizing the implicit relation between the background and the human body, has better characteristic extraction capability and detection performance for the human bodies of different scales and postures, overcomes the defects of the existing human body detection technology, improves the robustness of human body detection by fusing the semantic information and the peripheral background information of different depth characteristics based on the characteristic balance and relation enhancement human body detection method, and further has better human body detection performance.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A human body detection method based on feature balance and relationship enhancement is characterized by comprising the following specific steps:

2. The human body detection method based on feature balance and relationship enhancement as claimed in claim 1, wherein the specific steps of step 1 include:

3. The human body detection method based on feature balance and relationship enhancement as claimed in claim 2, wherein the step 2 comprises the following specific steps:

4. The human body detection method based on feature balance and relationship enhancement as claimed in claim 3, wherein the specific steps of step 2.2 include:

step 2.2.1, setting relevant parameters, wherein the relevant parameters comprise coefficients which need to be multiplied by the central value;

for (i, j + v), f (i, j + v) ([ f (i, j +1) -f (i, j) ] × v + f (i, j),

5. the human body detection method based on feature balance and relationship enhancement as claimed in claim 4, wherein the specific method of relationship enhancement in step 3 comprises using a relationship metric function in a training process

further obtaining a fusion characteristic F after the relationship is enhanced_m' is:

6. The human body detection method based on feature balance and relationship enhancement as claimed in claim 5, wherein the step 4 comprises:

step 4.1, fusing and enhancing the characteristics F_m' performing corresponding scaling operation;

7. The human body detection method based on feature balance and relationship enhancement as claimed in claim 6, wherein the specific steps of the step 5 include:

8. The human body detection method based on feature balance and relationship enhancement as claimed in claim 7, wherein the method of step 5.1 is specifically to divide all negative samples into K levels by using the labeling information of the image, and the number of the negative samples randomly sampled on the K level is:

9. The human body detection method based on feature balance and relationship enhancement according to claim 8, wherein the method of step 5.2 is specifically that all positive samples are corresponding to P human body labeling boxes, and the number of the randomly extracted positive samples around each human body labeling box is:

10. The human body detection method based on feature balance and relationship enhancement according to claim 9, wherein the step 6 is specifically: