CN114677707A

CN114677707A - Human body posture estimation method based on multi-resolution feature fusion network

Info

Publication number: CN114677707A
Application number: CN202210262826.XA
Authority: CN
Inventors: 段维柏; 吴小杰; 王啸; 周奂斌; 王超; 伍腾飞; 黄海徽
Original assignee: Hubei Sanjiang Aerospace Wanfeng Technology Development Co Ltd
Current assignee: Hubei Sanjiang Aerospace Wanfeng Technology Development Co Ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-06-28

Abstract

The application discloses a human body posture estimation method based on a multi-resolution characteristic fusion network, which mainly comprises the steps of constructing a multi-resolution characteristic fusion network model with a plurality of parallel network structure characteristic extraction network branches, acquiring image characteristics with different resolutions, obtaining a corresponding single posture estimation result, embedding a multi-receptive field fusion module in the multi-receptive field fusion network model, wherein the multi-receptive field fusion network model comprises a plurality of convolution kernels with different sizes, the convolution kernels are arranged in parallel, the image characteristics are subjected to convolution operation through the convolution kernels of the first layer, the first layer of receptive field characteristics are extracted, the image characteristics and the receptive field characteristics output by the convolution kernels of the previous layer are respectively obtained through the convolution kernels of the next layer, and the receptive field characteristics output by each layer of convolution kernels are fused; the invention adopts a multi-receptive-field fusion module, fully utilizes context information, enhances the discriminability of each resolution characteristic, realizes the detection of 'difficult key points' and improves the detection precision.

Description

Human body posture estimation method based on multi-resolution feature fusion network

Technical Field

The application relates to the technical field of deep learning and computer vision, in particular to a human body posture estimation method based on a multi-resolution feature fusion network.

Background

Human Position Estimation (HPE) has been the focus of researchers as one of the very challenging issues in the field of computer vision. How to let a computer understand human behaviors is a very important research problem, and human posture estimation can be regarded as a basic task of the problem.

The human body posture estimation target is to accurately detect key points of human body parts such as wrists, elbows, shoulders and the like from a given image; and connecting the detected key points according to the structure of the human skeleton to form the posture of the human body. Human posture estimation based on a deep learning method provides two ideas. One is a Top-down (Top-down) based idea, and the human body posture estimation of the idea firstly detects all human body example frames in an input image and then independently estimates the posture of a human body in each frame; the other is based on a Bottom-up (Bottom-up) idea, which is to perform all detection on all key points of all human body examples contained in an input image at one time, and then to divide the detected key points into corresponding human body examples according to a certain class algorithm.

Aiming at the human posture estimation of the Top-down thought, the main task of the human posture estimation is to accurately position the position of a key point. Therefore, on the aspect of characteristics, two key problems of human body posture estimation for realizing high precision are: (1) the requirement of the positioning key point on the spatial sensitivity is high; (2) the human body postures are complex and various, and the key points are difficult to distinguish.

Aiming at the key problem (1), the higher requirement of the positioning key point on the spatial sensitivity is one of the challenges of the human body posture estimation task. With the rise of deep learning, various deep convolutional neural networks are used for feature extraction of images, and most of the networks are based on the structure of a traditional classification network to extract deep semantic information of the images so as to be used for corresponding tasks. More spatial information is lost in the process, and the obtained low-resolution characteristic diagram is not suitable for the human body posture estimation task.

Aiming at the key problem (2), for images (with difficult key points) of difficult scenes such as complex backgrounds, complex human body motions or occlusion, the task of estimating the human body posture of the images is difficult. The existing method is mostly based on a multi-scale feature fusion or a method for adding corresponding additional data set training to improve the detection precision of 'difficult key points'. Multi-scale feature fusion, while enabling richer learned features, may lack contextual information as a guide. Adding additional data sets is expensive, and results in large resource consumption (labeling of data sets is time-consuming and labor-consuming).

Human body posture estimation is widely applied to multiple fields including man-machine interaction, behavior recognition, video monitoring and the like, and the fields are closely related to the life of people. Along with the development of science and technology, the demand of people on the quality of life is continuously increased, so that the research on human posture estimation has a good prospect. In the long-term development process, the performance of the human posture estimation algorithm can not reach the actual application level, but the appearance of the deep learning technology changes the situation. The current methods mainly aim at improving the existing network model, for example, migrating the network in the classification task to the human body posture estimation task. However, the network of classification tasks generally ignores the importance of spatial information, and the requirement of human posture estimation on accurate positioning of each joint point is very high.

Disclosure of Invention

In view of at least one defect or improvement requirement of the prior art, the present invention provides a human body posture estimation method based on a multi-resolution feature fusion network, which aims to improve the detection accuracy of human body key points (including "difficult key points") in an image of a complex scene.

In order to achieve the above object, according to an aspect of the present invention, there is provided a human body posture estimation method based on a multi-resolution feature fusion network, including the following steps:

Acquiring an image to be detected and inputting the image to be detected into a human body detection network model to acquire a suggestion frame of each human body example in the image to be detected;

inputting each suggestion box into a trained multi-resolution feature fusion network model respectively to obtain a corresponding single posture estimation result;

the multi-resolution feature fusion network model is provided with a plurality of feature extraction network branches forming a parallel network structure, the feature extraction network branches are respectively used for acquiring image features with different resolutions in a suggestion frame, and one or more multi-receptive field fusion modules are embedded in each feature extraction network branch; carrying out attitude estimation based on the fused features of the image features with different resolutions to obtain a single attitude estimation result;

each multi-receptive-field fusion module comprises a plurality of convolution kernels with different sizes, the convolution kernels of the first layer carry out convolution operation on image features, and the first layer of receptive field features are extracted; the next layer of convolution kernel respectively acquires the image characteristics and the receptive field characteristics output by the previous layer of convolution kernel and carries out convolution operation; and fusing the receptive field characteristics output by each layer of convolution kernel.

Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the multi-receptive field fusion module includes four layers of convolution kernels, which are convolution kernels with sizes of 1 × 1, 3 × 3, 5 × 5 and 7 × 7 respectively, and are used for extracting the receptive field features with corresponding sizes respectively.

Further, in the method for estimating a human body pose based on a multi-resolution feature fusion network, the multi-field fusion module further includes a preceding 1 × 1 convolution kernel for obtaining the input image features and adjusting the number of channels thereof, and outputting the image features with the adjusted number of channels to each layer of convolution kernel.

Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the multi-receptive field fusion module further includes a post-positioned 1 × 1 convolution kernel for obtaining the receptive field features output by each layer of convolution kernels and restoring to the original channel number.

Furthermore, in the human body posture estimation method based on the multi-resolution feature fusion network, in a plurality of feature extraction network branches of the multi-resolution feature fusion network model,

a first layer of feature extraction network branches to obtain a suggestion frame of each human body example, and image features with preset resolution are extracted and maintained from the suggestion frame;

and the next layer of feature extraction network branch performs downsampling on the image features maintained by the previous layer of feature extraction network branch, restores the generated image features into the image features with the preset resolution through upsampling after the multi-sensing-field fusion, and fuses the image features with the image features maintained by the first layer of feature extraction network branch.

Further, in the above method for estimating a human body pose based on a multi-resolution feature fusion network, the first layer of feature extraction network branches includes a convolution module, and the other layers of feature extraction network branches include a convolution module and a down-sampling module;

the convolution module consists of a convolution layer, a BN layer and an activation function; the down-sampling module consists of a stepping convolution layer, a BN layer and an activation function.

Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the training process of the multi-resolution feature fusion network model is as follows:

acquiring a training image sample;

inputting training image samples into a human body detection network model to obtain a suggestion box of each human body example in an image;

cutting and zooming the suggestion box of each human body example, and inputting the cut and zoomed suggestion box into a multi-resolution feature fusion network model for single posture estimation;

and calculating a target loss function according to the prediction result of the single attitude estimation and the label truth value of the training image sample, and performing iterative training on the multi-resolution feature fusion network model based on the target loss function to obtain the trained multi-resolution feature fusion network model.

Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the target loss function is a minimum mean square error function.

Further, in the human body posture estimation method based on the multi-resolution characteristic fusion network, the human body detection network model adopts a Faster R-CNN network model.

Further, in the human body posture estimation method based on the multi-resolution feature fusion network, the training and testing of the multi-resolution feature fusion network model are completed under a deep learning Pytorch framework.

In general, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:

(1) the invention provides a human body posture estimation method based on a multi-resolution feature fusion network, which comprises the steps that a first layer of convolution kernels carries out convolution operation on image features and extracts first layer of receptive field features in a mode that a plurality of convolution kernels with different sizes are arranged in parallel; the next layer of convolution kernel respectively acquires image characteristics and the receptive field characteristics output by the previous layer of convolution kernel to carry out convolution operation, and the receptive field characteristics in different ranges are fused to enhance the information content of the learning characteristics; the process of multi-receptive-field fusion fully utilizes context information, enhances the discriminability of each resolution characteristic, realizes the detection of 'difficult key points' in difficult scenes such as complex background, complex human body action or occlusion and the like, and improves the detection precision;

(2) The invention provides a human body posture estimation method based on a multi-resolution feature fusion network, which is characterized in that a plurality of feature extraction network branches of a parallel network structure are arranged, a first layer of feature extraction network branches acquires a suggestion frame of each human body example, image features with preset resolution are extracted and maintained from the suggestion frame, and shallow spatial information is reserved; the next layer of feature extraction network branch performs down-sampling on the image features maintained by the previous layer of feature extraction network branch, and the generated image features are restored to be the image features with the first resolution through up-sampling after multi-sensing-field fusion and are fused with the image features maintained by the first layer of feature extraction network branch, so that the expression capability of semantic information is enhanced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of human body pose estimation for the top-down approach of the present embodiment;

FIG. 2 is a flowchart of a human body posture estimation method based on a multi-resolution feature fusion network according to the present embodiment;

FIG. 3 is a block diagram of the multi-field fusion module according to the present embodiment;

FIG. 4 is a diagram illustrating a multi-resolution feature fusion network according to the present embodiment;

fig. 5 is a flowchart of a multi-resolution feature fusion network model training process in this embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

In other instances, well-known or widely used techniques, elements, structures and processes may not be described or shown in detail in order to avoid obscuring the understanding of the invention by the skilled artisan. Although the drawings represent exemplary embodiments of the present invention, the drawings are not necessarily to scale and certain features may be exaggerated or omitted in order to better illustrate and explain the present invention.

The application provides a human body posture estimation method based on a top-down thought. Fig. 1 is a schematic flow chart of human body posture estimation from top to bottom according to the top-down idea of the embodiment of the present invention, please refer to fig. 1, in which the top-down human body posture estimation first uses a human body detector to detect all human body examples in an input image, then cuts the detected human body examples into a single human body example, and sends the single human body example into a single human body posture estimation network to obtain posture estimation results of all human body examples. In particular, the human body detector may use an existing advanced object detection network.

Fig. 2 is a diagram illustrating a human body posture estimation method based on a multi-resolution feature fusion network according to an embodiment of the present invention, referring to fig. 2, the method includes the following steps;

(1) acquiring an image to be detected and inputting the image to be detected into a human body detection network model to acquire a suggestion frame of each human body example in the image to be detected;

In one specific example, the human detection network model employs the Faster R-CNN network model.

(2) Inputting each suggestion box into a trained multi-resolution feature fusion network model respectively to obtain a corresponding single posture estimation result;

Fig. 3 is a structural diagram of the multi-receptive field fusion module in this embodiment, please refer to fig. 3, the multi-receptive field fusion module includes four layers of convolution kernels, which respectively adopt convolution kernels with sizes of 1 × 1, 3 × 3, 5 × 5, and 7 × 7 to respectively extract the receptive field features with corresponding sizes, and the module is embodied and described by a mathematical expression as the following expression group:

X₁＝X₂＝X₃＝X₄＝f(X) (1)

Y＝concat[Y₁；Y₂；Y₃；Y₄] (3)

Wherein F (X) and F represent convolution layer functions of a 1X 1 convolution kernel, X₁、X₂、X₃、X₄The receptive field characteristic of the output after the 1 × 1 convolution operation, Y_iY is a concat function representing the channel combination of the features, which is a convolution layer composite function corresponding to convolution kernels of four sizes, namely 1 × 1, 3 × 3, 5 × 5 and 7 × 7 respectively.

Further, the fusion process of multiple receptive fields is as follows: resolution characteristic images in a plurality of characteristic extraction network branches of a parallel network structure are input to a multi-receptive-field fusion module, and in order to reduce the calculated amount,performing 1 × 1 convolution kernel operation on the original input characteristic X to compress the channel number to one fourth of the original channel number, outputting the receptive field characteristic, and dividing the characteristic into four branches X₁、X₂、X₃、X₄To each layer of convolution kernel; first layer convolution kernel to check image features X₁Performing 1 × 1 convolution operation to output the first layer of receptive field characteristics Y₁(ii) a Second layer convolution kernel acquisition image feature X₂And the reception field characteristic Y output by the upper layer convolution kernel₁Performing 3 × 3 convolution operation to output the receptive field characteristic Y of the second layer₂(ii) a Third layer convolution kernel obtaining image characteristic X₃And the reception field characteristic Y output by the upper layer convolution kernel₂5 x 5 convolution operation is carried out, and the receptive field characteristic Y of the third layer is output₃(ii) a Obtaining image feature X by using a fourth layer of convolution kernel ₄And the receptive field characteristic Y output by the upper convolution kernel₃Performing 7 × 7 convolution operation to output the receptive field characteristic Y of the fourth layer₄；

The output receptive field characteristics Y of each layer₁、Y₂、Y₃、Y₄Merging different receptive field characteristics based on channels to obtain receptive field characteristics Y through concat operation, and fusing the receptive field characteristics Y with original input characteristics X to enhance the characteristics discrimination; finally, performing 1 × 1 convolution kernel operation to increase learning parameters of the network and recover the number of channels same as the original input characteristic X; and finally, the number of channels for outputting image features is reduced to be the same as the number of key detection points of the human body. In this embodiment, an element-wise add algorithm is used to fuse the original input feature image X and the receptive field feature Y, and other fusion methods may also be used, which is not specifically limited in this embodiment.

Fig. 4 is a structural diagram of a multiresolution feature fusion network in this embodiment, please refer to fig. 4, the multiresolution feature fusion network has a plurality of feature extraction network branches forming a parallel network structure, the first layer of feature extraction network branches obtains a suggestion box of each human body instance, and extracts and maintains image features with a preset resolution therefrom; and the next layer of feature extraction network branch performs down-sampling on the image features maintained by the previous layer of feature extraction network branch, and the generated image features are restored to the image features with the first resolution through up-sampling after being fused with multiple sensing fields and are fused with the image features maintained by the first layer of feature extraction network branch.

In a particular embodiment, the first-level feature extraction network branches to extract and maintain image features having an original image resolution 1/4; the next-layer feature extraction network branch downsamples the image features of which the original image resolution 1/4 is extracted from the previous-layer feature extraction network branch, and the image features are the image features of the original feature image resolutions 1/8, 1/16 and 1/32 in sequence; however, in the process of reducing the resolution of the feature image, more spatial information is lost, so that the next-layer feature extraction network branch performs downsampling on 1/4 image features with the original image resolution maintained by the previous-layer feature extraction network branch, and the generated image features are restored to the image features with the first resolution through upsampling after being fused by multiple sensing fields; the multi-receptive-field fusion module is embedded into a plurality of characteristic extraction network branches, and after the resolution characteristics pass through the multi-receptive-field fusion module, the context information is fully utilized, and the discriminativity of each resolution characteristic is enhanced. And finally, predicting the position information of the key points in a heat map mode by the multi-resolution characteristic image through a multi-receptive-field fusion module to finally obtain a human body posture result predicted by a network. In this embodiment, the upper branch includes a convolution module, and the lower branch includes a convolution module and a down-sampling module. The convolution module consists of a convolution layer Conv, a BN layer Batch Normalization and an activation function; the down-sampling module consists of a stepping convolution layer, a BN layer and an activation function. In this embodiment, the upsampling uses a linear interpolation method, and may also use methods such as transposed convolution and void convolution, without specific limitation.

Fig. 5 is a flowchart of a training process of the multi-resolution feature fusion network model in this embodiment, please refer to fig. 5, where the training process includes the following steps:

(1) acquiring a training image sample;

further, the training image sample is a sample image carrying the labeling information;

the COCO dataset contains over 200000 pictures, which collectively contain 250000 human instances. For the complete human body example, the label contains information for 17 key points. The model was trained on the COCO training set. The training set included 57000 pictures that collectively contained 150000 personal instances. After model training, the model was evaluated on a COCO validation set containing 5000 pictures and a test set containing 20000 pictures.

The MPII human pose estimation dataset collects pictures from various activities in the real world, approximately 25000 pictures, which contain 40000 human instances in total. For the complete human body example, the full body posture label is contained, and the label contains 16 key point information of the full body. The model was trained on the MPII training set, which contained about 28000 human instances. After model training, the models were evaluated on an MPII test set containing 12000 human instances.

In this embodiment, the training image samples and the test image samples are selected from the COCO and MPII data sets. Of course, other data sets may be used, and the embodiment is not limited in particular.

(2) Inputting the training image samples into a human body detection network model to obtain a suggestion box of each human body example in the image;

in one particular example the human detection network model employs the Faster R-CNN network model.

(3) Cutting and zooming the suggestion box of each human body example, and inputting the cut and zoomed suggestion box into a multi-resolution feature fusion network model for single posture estimation;

(4) and calculating a target loss function according to the prediction result of the single attitude estimation and the label truth value of the training image sample, and performing iterative training on the multi-resolution feature fusion network model based on the target loss function to obtain the trained multi-resolution feature fusion network model.

In this embodiment, the set target loss function is a minimum Mean square Error function Mean Squared Error, MSE, or other functions may be used, which is not limited in this embodiment. And the training and testing of the resolution characteristic fusion network model are completed under a deeply-learned Pythrch framework.

According to the human body posture estimation method based on the multi-resolution feature fusion network, through the mode that a plurality of convolution kernels with different sizes are arranged in parallel, the first layer of convolution kernels carries out convolution operation on image features, and the first layer of receptive field features are extracted; the next layer of convolution kernel respectively acquires image characteristics and the receptive field characteristics output by the previous layer of convolution kernel to carry out convolution operation, and the receptive field characteristics in different ranges are fused to enhance the information content of the learning characteristics; the multi-resolution feature fusion enables the network to have rich space and semantic information, the process of multi-receptive field fusion fully utilizes context information, the discriminativity of each resolution feature is enhanced, the detection of 'difficult key points' is realized under the difficult scenes of complex background, complex human body action or shielding and the like, and the detection precision is improved.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A human body posture estimation method based on a multi-resolution feature fusion network is characterized by comprising the following steps:

acquiring an image to be detected and inputting the image to the human body detection network model to acquire a suggestion frame of each human body example in the image to be detected;

inputting each suggestion box into a trained multi-resolution feature fusion network model respectively to obtain a corresponding single-person attitude estimation result;

the multi-resolution feature fusion network model is provided with a plurality of feature extraction network branches forming a parallel network structure, the feature extraction network branches are respectively used for acquiring image features with different resolutions in a suggestion frame, and one or more multi-receptive field fusion modules are embedded in each feature extraction network branch; performing attitude estimation based on the feature obtained by fusing the image features with different resolutions to obtain a single-person attitude estimation result;

Each multi-receptive-field fusion module comprises a plurality of convolution kernels with different sizes, wherein the convolution kernels are arranged in parallel, the first layer of convolution kernels perform convolution operation on image features, and the first layer of receptive field features are extracted; the next layer of convolution kernel respectively obtains the image characteristics and the receptive field characteristics output by the previous layer of convolution kernel and carries out convolution operation; and fusing the receptive field characteristics output by each layer of convolution kernel.

2. The method as claimed in claim 1, wherein the multi-domain fusion module comprises four layers of convolution kernels, each of which is of 1 × 1, 3 × 3, 5 × 5 and 7 × 7 size, for extracting the domain features of the corresponding size.

3. The method for estimating human body pose based on multi-resolution feature fusion network according to claim 1 or 2, wherein the multi-field fusion module further comprises a preposed 1 x 1 convolution kernel for obtaining the input image features and adjusting the number of channels thereof, and outputting the image features after the adjustment of the number of channels to each layer of convolution kernel.

4. The method according to claim 1 or 2, wherein the multi-field fusion module further comprises a post-positioned 1 x 1 convolution kernel for obtaining the field features output by each layer of convolution kernels and restoring the field features to the original number of channels.

5. The method for estimating human body pose based on multi-resolution feature fusion network according to claim 1, wherein in a plurality of feature extraction network branches of the multi-resolution feature fusion network model,

and the next layer of feature extraction network branch performs down-sampling on the image features maintained by the previous layer of feature extraction network branch, and the generated image features are restored to the image features with the preset resolution through up-sampling after multi-sensing-field fusion and are fused with the image features maintained by the first layer of feature extraction network branch.

6. The multi-resolution feature fusion network-based human body posture estimation method according to claim 5, wherein the first layer of feature extraction network branches comprises a convolution module, and the other layers of feature extraction network branches comprise a convolution module and a down-sampling module;

7. The method for estimating the human body posture based on the multi-resolution feature fusion network as claimed in claim 1, wherein the training process of the multi-resolution feature fusion network model is as follows:

Acquiring a training image sample;

inputting the training image samples into a human body detection network model to obtain a suggestion box of each human body example in the image;

8. The method of claim 7, wherein the objective loss function is a minimum mean square error function.

9. The multi-resolution feature fusion network-based human body posture estimation method of claim 1, wherein the human body detection network model adopts a Faster R-CNN network model.

10. The multi-resolution feature fusion network-based human body pose estimation method of claim 1, wherein the training and testing of the multi-resolution feature fusion network model is performed under a deep learning Pytorch framework.