CN113221626A

CN113221626A - Human body posture estimation method based on Non-local high-resolution network

Info

Publication number: CN113221626A
Application number: CN202110241318.9A
Authority: CN
Inventors: 何宁; 孙琪翔
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-08-06
Anticipated expiration: 2041-03-04
Also published as: CN113221626B

Abstract

The invention discloses a human body posture estimation method based on a Non-local high-resolution network, wherein a Non-local network module is newly designed, residual connection allows any new Non-local network module to be inserted into any network without damaging the original network structure, namely, the algorithm can use the initial pre-training weight of the high-resolution network. The size of the input parameter can be maintained through non-local operation, the input value of the non-local network module can be changed, and the parameters are the same at the input end and the output end of the non-local network. In the 4 th stage of the high-resolution network, a non-local module is added to the module with the minimum resolution, and the number of channels is set to be 256, the reason why the non-local module is added in the stage is that the feature with the minimum resolution can be obtained in the stage, the smaller resolution includes the low-level feature with strong semantic information, and the non-local module is added to the module with the small resolution to highlight the main feature, so that the better experiment result is obtained.

Description

Human body posture estimation method based on Non-local high-resolution network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a human body posture estimation method based on a non-local high-resolution network, which can be applied to human body posture estimation and human body skeleton extraction.

Background

Human body Pose Estimation (Human position Estimation) is a process of detecting key parts or main joints of a Human body in a given image or video and finally outputting all or partial limb related parameters (relative position relation of all joint points) of the Human body, such as a Human body outline, the position and orientation of a head, the position and part type of joints of the Human body, and the like. Human behavior recognition based on computer vision technology is widely applied to numerous fields of human life, such as tasks of video monitoring, motion retrieval, man-machine interaction, intelligent home, medical care and the like. Human pose estimation is a fundamental problem in computer vision.

There are two main approaches considered from the perspective of achieving attitude estimation: top-down methods and bottom-up methods. The top-down approach relies on detectors to detect human instances, one bounding box for each person, and then reduces the problem to a single person's pose estimate. Fang et al propose a local Multi-person position Estimation (RMPE) framework to mitigate the impact of erroneous detection results on single-person position Estimation. ChenY et al propose a Cascade Pyramid Network (CPN) method, which combines RefineNet to further optimize the key points difficult to predict on the basis of a Feature Pyramid Network (Feature Pyramid Network). Similarly, Huang et al propose a Coarse-Fine network, which uses multi-scale supervision and fuses Coarse and Fine features to obtain the final prediction result. Xiao et al propose a very simple but very efficient network for multi-person pose estimation and tracking. When the human body posture estimation is carried out by the top-down method, detection frames with different sizes are unified to one scale for learning, so that the method is insensitive to the scale and relatively easy to predict for small-scale people; however, the requirement for target detection is high, and if detection has errors, the detection is difficult to recover; in addition, the calculation amount is in direct proportion to the number of people in the picture during operation, and the more people, the larger the calculation amount.

The bottom-up approach first finds all key points in the picture, such as all heads, left hand, knees, etc. These key points are then assembled into an individual. After the introduction of deep learning, Pishchulin et al proposed depcut to model the allocation problem as an Integer Linear Programming (ILP) problem in a Fully Connected Graph (full Connected Graph), and then much work was expanded around solving the ILP problem, and Iqbal et al proposed that the ILP problem could be solved locally, but not well. Insafutdinov et al propose Deepercut to estimate keypoints with deeper ResNet and to improve inference efficiency using image-dependent pairwise terms and an incremental optimization strategy. Levinkov et al propose that two local search algorithms can monotonically converge to local optimum, providing a feasible solution for ILP problems. Varadarajan et al propose a greedy allocation algorithm to reduce the complexity of the ILP problem in combination with the intrinsic structural information of the human body. Openpos proposed by Cao Z et al models the relationship between key points into joint gravity Fields (PAFs) to help better and faster assign key points, which greatly improves accuracy or speed compared with previous methods, and makes the speed of multi-person posture estimation reach real-time level for the first time. Some work follows to learn the keypoint locations and how to assign the keypoints at the same time to avoid complex post-assignment processing. Newell a et al propose that associated Embedding can predict a Tag Heatmap (Tag Heatmap) at the same time as the keypoints to help with keypoint group assignment. Nie et al propose a general Partition Network that uses global information obtained from the gravity embedding space to make local judgment, improving the accuracy and efficiency of the inference. The PersonLab proposed by Papandreou et al learns the keypoints and their relative displacements at the same time, and serves to directly aggregate the keypoints of the same person. The bottom-up method is not affected by other tasks, namely errors of target detection; the calculated amount is irrelevant to the number of people in the image, so that the efficiency is greatly improved; however, the method is sensitive to scale and difficult to predict for small-scale people.

The deep learning algorithm is based on a convolutional neural network, and the convolution operation aiming at the image at the present stage is to perform local convolution operation on the space. For example, in the commonly used 3 × 3 convolution, the convolution filter has 9 pixels, and the value of the target pixel is calculated with reference to only itself and the surrounding 8 pixels. This means that convolution can only use local information to compute the target pixel, which causes some bias because global information is not visible. Typically to alleviate this problem, larger convolution filters or deep networks with more convolution layers are used. Although this brings improvements, it also leads to problems of computational inefficiency and difficulty in optimization. Aiming at the problems, the non-local neural network module is designed firstly, then the module is fused with the high-resolution network of the current optimal Human body posture estimation backbone network, finally 3 NLHR network structures are designed, and the validity of a correlation algorithm is verified on MPII (MPII Human dose dataset) and COCO Human body key point data sets.

Disclosure of Invention

Aiming at the problems and the defects of the traditional convolutional neural network, an NLHR network structure utilizing a Non-local network module is provided, and a novel network structure NLHR (Non-local High-Resolution) network is designed, and the accuracy of human body posture estimation can be greatly improved under the condition that relatively small parameters are kept by the network structure. A human body posture estimation method based on a Non-local high-resolution network is characterized by comprising the following steps:

step 1, acquiring an image, and directly acquiring a local image through a function, wherein the RGB image is required to be removed.

And 2, detecting the human body by using the YOLOv3 network to obtain a human body surrounding frame bounding box (bbox).

Step 3, expanding the height or width of the human body detection frame to a fixed length-width ratio: high: the width is 4:3, and then the human detection frame is cut out of the image and readjusted to a fixed size of 256 × 192.

And 4, extracting human skeletons, putting the skeletons in the step 3 into an NLHR network, and running a statement of python tools/train py-cfg experiments/coco/hrnet/w 32-256 x192_ adam _ lr1e-3. yaml. Extracting human skeleton. There are 17 key points. The main contribution of the invention is to design a novel network NLHR network, and the network parameters and the flow are as follows:

the initial part of the network is a convolution with two layers of step size 2 and a convolution kernel of 3 x 3, so that the resolution is reduced to 1/4 at the input, the number of channels becomes 64, followed by the body structure of the network, which contains 4 stages, 4 parallel convolution branches. The resolutions are 1/4, 1/8, 1/16, 1/32, respectively.

The 1 st stage comprises 4 Bottleneck residual error units, the 1 st residual error unit is subjected to three-layer convolution to increase the channel number from 64 to 256, and then is subjected to 3 Bottleneck residual error units with the channel number of 256. Then, the signal enters a transition1 module, the module is divided into two branches, one branch passes through the convolution with the step length of 1 and the convolution kernel of 3 multiplied by 3, the resolution keeps 1/4 at the input time, the number of channels is changed to 32, and the channel is recorded as x₀. The other one passes through the convolution with step size 2 and convolution kernel 3 × 3, the resolution is reduced to 1/8 at the time of input, the number of channels becomes 64, and is recorded as x₁。

Stage 2, x₀4 consecutive Basicblock, x with 32 incoming channels₁4 consecutive basicblocks with 64 lanes entered. Then enter a fusion phase, x₀The split into two branches remains unchanged, the other branch passes through a convolution with step size 2 and convolution kernel 3 × 3, at which time the resolution is reduced to 1/8 and the number of channels becomes 64. x is the number of₁Also, two branches remain unchanged, the other branch passes through a convolution with step size 1 and convolution kernel 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then two channels with the same number are merged into a new x₀And x₁. Next, the flow proceeds to transition2 block, where block x₀Remains unchanged, x₁Dividing into two branches, one branch is kept unchanged, the other branch passes through convolution with step length of 2 and convolution kernel of 3 multiplied by 3, at this time, resolution is reduced to 1/16, channel number is changed to 128, and is recorded as x₂。

Stage 3, x₀4 consecutive Basicblock, x with 32 incoming channels₁4 consecutive Basi with 64 entry channelscblock，x₂4 consecutive basicblocks with 128 channels in. Then enter a fusion phase, x₀Dividing into 3 branches, keeping the 1 st branch unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step size is 2 and the convolution kernel is a convolution of 3 x 3, at which time the resolution drops to 1/16 and the number of passes becomes 128. x is the number of₁Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/16, and the number of channels becomes 128. x is the number of₂Dividing the channel into 3, wherein the step length of the 1 st pass is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged. Then the channels with the same number are fused with each other to form a new x₀，x₁And x₂. The transition3 phase is entered next, in which phase x₀And x₁Remains unchanged, x₂Dividing into two branches, one branch is kept unchanged, the other branch passes through the convolution with the step length of 2 and the convolution kernel of 3 multiplied by 3, at this moment, the resolution is reduced to 1/32, the number of channels is changed to 256 and is recorded as x₃。

Stage 4, x₀4 consecutive Basicblock, x with 32 incoming channels₁4 consecutive Basicblock, x with 64 incoming lanes₂4 consecutive Basicblock, x with 128 incoming channels₃4 continuous Basicblocks with 256 channels enter the non-local network module, and the input and output sizes of the non-local network module are kept unchanged. Then enter a fusion phase, x₀The number of the branches is 4, and the 1 st branch is kept unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution drops to 1/32, and the number of channels becomes 256. x is the number of₁Is divided into 4 branches and 1 st branchBy the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of₂Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of₃Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/8 at this time, and the number of channels is changed to 64; the 3 rd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is increased to 1/32 at this time, and the number of channels becomes 128; the 4 th branch remains unchanged. Then the channels with the same number are fused with each other to form a new x₀，x₁，x₂And x₃. Next x₀Keeping the same; x is the number of₁By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of₂By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of₃With step size 1, the convolution kernel is a convolution of 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then 4 were fused to each other.

And the last layer is next, the fused features are convolved by a step length of 1 and a convolution kernel of 1 multiplied by 1, and the number of channels corresponding to the number of the key points of the data set is output. In summary, the corresponding network structure parameter table can be referred to in table 1, the square brackets in the square brackets are the module structures, the number multiplied by the module structures is the number of the residual error units, and the last number is the number of the modules.

TABLE 1 high resolution network architecture parameters

A non-local network module:

essentially high resolution networks are standard methods of using computer vision tasks, as are conventional networks: CNN (Convolition Neural networks) convolutional Neural networks. Convolutional neural networks are limiting, for example in a 5 x 5 convolution, the convolution filter has 25 pixels, and the value of the target pixel is computed with reference to itself and the surrounding 24 pixels. This means that convolution can only use local information to compute the target pixel, which can cause errors because global information is not visible. There are of course many ways to alleviate this problem, such as using larger convolution filters or depth networks of more convolution layers. However, these methods tend to result in larger quantities and the improvement in results is limited. To address this problem, the algorithm herein introduces the concept of non-local mean.

The non-local mean value is a classic filtering algorithm proposed by Buades et al, and is a novel image denoising technology essentially, the method fully utilizes redundant information in an image, can maintain the detail characteristics of the image to the maximum extent while denoising, and the core idea is that the estimated value of the current pixel is obtained by weighted averaging of pixels with similar neighborhood structures in the image. When calculating the output of each pixel position, the correlation is calculated with all positions in the image instead of only the neighborhood, and then the correlation is used as a weight to represent the similarity between other positions and the current position to be calculated. The non-local mean is defined as follows: given a discrete noise image v ═ { v (I) | I ∈ I }, for pixel I, the estimate NL [ v ] (I) is computed as a weighted average of all pixels in the image, as shown in equation (1):

wherein the weight family w (i, j) }_jDependent on between pixels i and jSimilarity and satisfies w (i, j) 1 and sigma_jThe similarity between two pixels i and j, w (i, j) ═ 1, depends on the intensity gray level vector v (N)_i) And v (N)_j) In which N is_kRepresenting a square neighborhood of fixed size centered on pixel k. The similarity is defined as a decreasing function of a weighted Euclidean excitation

Wherein a is>0 is the standard deviation of the sum of gaussians.

In order to combine the concept of non-local mean with deep learning, important concept variance and covariance in statistics and machine learning are introduced. They are concepts defined for random variables, and variance describes the deviation of a single random variable from its mean; while covariance describes the similarity between two random variables, if the distributions of two random variables are similar, their covariance is large. Otherwise, their covariance is small. If the pairing covariance between all pixels is calculated by taking each pixel in the feature map as a random variable, the value of each predicted pixel can be enhanced or reduced according to the similarity between each predicted pixel and other pixels in the image.

In order to realize global reference to each pixel, Wang et al, in combination with the above characteristics of the non-local mean, propose a generalized, simple non-local network module that can be directly embedded into the current network, and can capture the wide-range dependence in the image, first give a specific formula for the non-local network module:

where i is an index of an output location (in space, time or space-time) whose response is to be computed. j is an index that enumerates all possible locations. x is the input signal (image, sequence, video; usually their characteristics) and y is the output signal of the same size as x. The binary function f calculates the correlation coefficient (representing the relationship, e.g. near) between i and all jSimilarity). The unary function g computes a representation of the input signal at the j position, the response being normalized by a factor c (x). Non-local operations, as opposed to convolution operations, take into account all position factors

While convolution operations accumulate the input weights in a local neighborhood.

The function g uses a linear embedding function g (x)_j)＝W_gx_jWherein W is_gIs a learnable weight matrix, implemented in experiments with 1 × 1 convolutional layers. The structure of the non-local network module is shown as the non-local network module

In the 4 th stage of the high-resolution network, a non-local module is added to the module with the minimum resolution, and the number of channels is set to be 256, and the reason why the non-local module is added in this stage is that the feature with the minimum resolution can be obtained in this stage, the smaller resolution includes the low-level feature with strong semantic information, and the non-local module is added to the module with the small resolution feature, so that the main feature can be highlighted, and the better experimental result can be obtained. The specific structure is shown in a NLHRv3 network diagram.

Step 4, the mean square error between the key point heat map obtained through the network and the result of the truth heat map is obtained, wherein P represents the key point, l represents the position information of the key point, and y'_pHeat map information, y, representing predicted keypoints, p_pHeat map information representing real keypoints. As shown in formula (1):

and 5, generating a json file containing the corresponding heat map and the key point information.

Step 6, operating a statement: python visualization/plot _ co.py- -rendering output/co/pool _ once _ hrnet/w32_256x192_ adam _ lr1e-3/results/keypoints _ val2017_ results _0.json- -save visualization/results-

And 7, obtaining a human body skeleton diagram, and overlapping and displaying the human body skeleton diagram and the original image as a result diagram.

Compared with the prior art, the invention has the following advantages:

1. the invention designs a non-local network module, which is specifically defined as a formula (4):

z_i＝W_zy_i+x_i(4)

wherein y is_iGiven by equation (3), "+ x_i"denotes residual concatenation [9]When W is_zWhen the initialization of (1) is 0, the residual connection allows any new non-local network module to be inserted into any network without destroying the original network structure, i.e., the present algorithm can use the initial pre-training weights of the high-resolution network. It can be known from formula (2) that the non-local operation can keep the size of the input parameter, the non-local network module can change the input value, and the input end and the output end of the non-local network, the parameter is the same.

2. The invention designs and verifies the NLHR network, a non-local module is added on the module with the minimum resolution at the 4 th stage of the high-resolution network, the number of channels is set to be 256, the reason why the non-local module is added at the stage is that the stage can obtain the feature with the minimum resolution, the smaller resolution comprises the low-level feature with strong semantic information, and the non-local module is added on the module with the small resolution feature, so that the main feature can be highlighted, and the better experimental result can be obtained. The specific structure is shown in an NLHR network.

Drawings

FIG. 1 flow chart of the present invention

FIG. 2 is a non-local network block diagram

FIG. 3 is a simplified diagram of a non-local network block diagram

Fig. 4NLHR network structure diagram

FIG. 5 is a diagram of the result of estimation of human body pose according to the present invention

Detailed Description

step 1, acquiring an image, directly acquiring a local image through a function, and removing an RGB image

And 2, detecting the human body by using the YOLOv3 network to obtain a bounding box (bbox) of the human body.

Stage 2, x₀4 consecutive Basicblock, x with 32 incoming channels₁Access channel4 consecutive basicblocks, number 64. Then enter a fusion phase, x₀The split into two branches remains unchanged, the other branch passes through a convolution with step size 2 and convolution kernel 3 × 3, at which time the resolution is reduced to 1/8 and the number of channels becomes 64. x is the number of₁Also, two branches remain unchanged, the other branch passes through a convolution with step size 1 and convolution kernel 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then two channels with the same number are merged into a new x₀And x₁. Next, the flow proceeds to transition2 block, where block x₀Remains unchanged, x₁Dividing into two branches, one branch is kept unchanged, the other branch passes through convolution with step length of 2 and convolution kernel of 3 multiplied by 3, at this time, resolution is reduced to 1/16, channel number is changed to 128, and is recorded as x₂。

Stage 3, x₀4 consecutive Basicblock, x with 32 incoming channels₁4 consecutive Basicblock, x with 64 incoming lanes₂4 consecutive basicblocks with 128 channels in. Then enter a fusion phase, x₀Dividing into 3 branches, keeping the 1 st branch unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step size is 2 and the convolution kernel is a convolution of 3 x 3, at which time the resolution drops to 1/16 and the number of passes becomes 128. x is the number of₁Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/16, and the number of channels becomes 128. x is the number of₂Dividing the channel into 3, wherein the step length of the 1 st pass is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged. Then the channels with the same number are fused with each other to form a new x₀，x₁And x₂. The transition3 phase is entered next, in which phase x₀And x₁Remains unchanged, x₂Two branches, one of which remains unchanged and the other of which passes a convolution with a step size of 2 and a convolution kernel of 3 × 3, are resolvedThe rate drops to 1/32, and the number of channels becomes 256, which is denoted as x₃。

Stage 4, x₀4 consecutive Basicblock, x with 32 incoming channels₁4 consecutive Basicblock, x with 64 incoming lanes₂4 consecutive Basicblock, x with 128 incoming channels₃4 continuous Basicblocks with 256 channels enter the non-local network module, and the input and output sizes of the non-local network module are kept unchanged. Then enter a fusion phase, x₀The number of the branches is 4, and the 1 st branch is kept unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution drops to 1/32, and the number of channels becomes 256. x is the number of₁Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of₂Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; the 4 th pass step size is 2, the convolution kernel is a convolution of 3 × 3, the resolution is reduced to 1/32, and the number of channels becomes 256. x is the number of₃Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/8 at this time, and the number of channels is changed to 64; the 3 rd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is increased to 1/32 at this time, and the number of channels becomes 128; the 4 th branch remains unchanged. Then the channels with the same number are fused with each other to form a new x₀，x₁，x₂And x₃. Next x₀Keeping the same; x is the number of₁By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of₂By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4 and the number of channels becomes 32; x is the number of₃With step size 1, the convolution kernel is a convolution of 1 × 1, at which time the resolution is raised to 1/4 and the number of channels becomes 32. Then 4 were fused to each other.

TABLE 1 high resolution network architecture parameters

A non-local network module:

wherein the family of weights w (i, j) }_jDepends on the similarity between pixels i and j and satisfies 0 ≦ w (i, j ≦ 1 and Σ_jThe similarity between two pixels i and j, w (i, j) ═ 1, depends on the intensity gray level vector v (N)_i) And v (N)_j) In which N is_kRepresenting a square neighborhood of fixed size centered on pixel k. The similarity is defined as a decreasing function of a weighted Euclidean excitation

Wherein a is>0 is the standard deviation of the sum of gaussians.

where i is an index of an output location (in space, time or space-time) whose response is to be computed. j is an index that enumerates all possible locations. x is the input signal (image, sequence, video; usually their characteristics) and y is the output signal of the same size as x. The binary function f calculates the correlation coefficient (representing the relationship, e.g., the degree of approximation) between i and all j. The unary function g computes a representation of the input signal at the j position, the response being normalized by a factor c (x). Non-local operations, as opposed to convolution operations, take into account all position factors

Step 4, mixingMean square error between the key point heat map and the truth heat map obtained through the network, wherein P represents the key point, l represents the key point position information, y'_pHeat map information, y, representing predicted keypoints, p_pHeat map information representing real keypoints. As shown in formula (1):

Further experiments prove that:

here, experiments were first performed on the MPII dataset with 3 versions of NLHR networks, respectively, and then the optimal version of the network was verified again on the COCO dataset.

PCKh evaluation criteria [17 ]. Detection accuracy is obtained by making explicit boundary definitions for each person in the test image. And (3) giving a candidate area in a boundary frame (h, w) containing the coordinate position of the original key point, controlling related thresholds to obtain different accuracy rates to judge whether the predicted key point is reasonably positioned, and selecting the threshold r to be 0.5. PCKh is used for replacing the size of a head frame with the size of a human body trunk, the scale is used for normalizing the distance of other parts, and the distance adopts an Euclidean distance. If the Euclidean distance between the detection key point and the label key point is within the threshold value range, the detection result is correct. Taking the k-th human body key point as an example, the calculation method of PCKh is as follows:

wherein, PCKh (k)) Is the value of the kth key point PCKh, the average value of the accumulated sums is the final result, and the label value of the kth class human body key point in the ith picture is

The predicted result corresponding to the key point is

Total number of samples N in the size of the header box s_hCombining a distance normalization coefficient r as a judgment condition, the smaller the threshold value is, the stricter the evaluation criterion is, and PCKh @0.5 means that when r is 0.5, the distance threshold value is 0.5s_hA comparison of the true value to the predicted value distance is made.

OKS evaluation criteria [18 ]. Based on the COCO evaluation index OKS, AP is the normalization of different key point types and human sizes, and is the average key point similarity among key points, and between [0 and 1], the prediction tends to 1 as the prediction is closer to the original value, and otherwise tends to 0. OKS is defined as formula (6)

Wherein d is the true value coordinate θ^(p)And the predicted coordinates

The euclidean distance between them,

s is the area occupied by the human body in the image, k_iTo normalize the factor, δ (v)_i>0) Indicating that the keypoint visibility is greater than 0. For the estimation of the human body posture AP as the average accuracy, the calculation method is as formula (7):

wherein t is taken as (0.50,0.55,. multidot.0.90, 0.95) of threshold processing of given OKS, and the prediction accuracy is calculated by OKS indexes of people in all pictures in the test set.

Results of the experiment

Table 1 experimental results on MPII validation set

Table 2 experimental results on COCO validation set

TABLE 3 parameters, GFLOPs and experimental results of human body posture estimation network

Claims

1. A human body posture estimation method based on a Non-local high-resolution network is characterized by comprising the following steps: the method comprises the following steps:

step 1, acquiring an image, directly acquiring a local image through a function, and removing RGB of the image;

step 2, detecting a human body by using a YOLOv3 network to obtain a human body surrounding frame bbox;

step 3, expanding the height or width of the human body detection frame to a fixed length-width ratio: high: cutting the human body detection frame from the image, and readjusting the human body detection frame to be 256 × 192 in fixed size;

step 4, extracting the human skeleton, namely putting the human skeleton in the step 3 into an NLHR network, and operating a sentence to extract the human skeleton; there are 17 key points; the NLHR network parameters and flow are as follows:

the initial part of the network is a convolution with two layers of step length of 2 and convolution kernel of 3 multiplied by 3, so that the resolution is reduced to 1/4 when in input, the number of channels is changed to 64, and then the main structure of the network is obtained, and the network main body comprises 4 stages and 4 parallel convolution branches; resolutions of 1/4, 1/8, 1/16 and 1/32;

step 4, the mean square error between the key point heat map and the truth value heat map obtained through the network is calculated;

step 5, generating a json file containing the corresponding heat map and the key point information;

and 6, operating the sentence to obtain a human skeleton diagram, and displaying the human skeleton diagram and the original image in a superposition mode.

2. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: in the NLHR network of the step 4, the 1 st stage comprises 4 Bottleneck residual error units, the 1 st residual error unit is subjected to three-layer convolution to increase the number of channels from 64 to 256, and then is subjected to 3 Bottleneck residual error units with the number of channels being 256; the data then enters a transition1 block, which is divided into two branches, one branch passes through a convolution with a step size of 1 and a convolution kernel of 3 × 3, the resolution remains 1/4 at the input, the number of channels becomes 32, and is marked as x₀(ii) a The other one passes through the convolution with step size 2 and convolution kernel 3 × 3, the resolution is reduced to 1/8 at the time of input, the number of channels becomes 64, and is recorded as x₁。

3. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: stage 2, x, in the NLHR network of step 4₀4 consecutive Basicblock, x with 32 incoming channels₁4 continuous Basicblocks with 64 entering channels; then enter a fusion phase, x₀Dividing the two branches into two branches and keeping the two branches unchanged, wherein the step length of the other branch is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/8 at the moment, and the number of channels is changed to 64; x is the number of₁Also two branches remain unchanged, the other oneThe branch passing step length is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; then two channels with the same number are merged into a new x₀And x₁(ii) a Next, the flow proceeds to transition2 block, where block x₀Remains unchanged, x₁Dividing into two branches, one branch is kept unchanged, the other branch passes through the convolution with the step length of 2 and the convolution kernel of 3 multiplied by 3, at this moment, the resolution is reduced to 1/16, the number of channels is changed to 128, and the number is recorded as x₂。

4. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: stage 3, x in the NLHR network of step 4₀4 consecutive Basicblock, x with 32 incoming channels₁4 consecutive Basicblock, x with 64 incoming lanes₂4 continuous basicblocks with 128 channels; then enter a fusion phase, x₀Dividing into 3 branches, keeping the 1 st branch unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; x is the number of₁Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; x is the number of₂Dividing the channel into 3 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; then the channels with the same number are fused with each other to form a new x₀，x₁And x₂(ii) a The transition3 phase is entered next, in which phase x₀And x₁Remains unchanged, x₂Dividing into two branches, one branch is kept unchanged, the other branch passes through the convolution with the step length of 2 and the convolution kernel of 3 multiplied by 3, at this moment, the resolution is reduced to 1/32, the number of channels is changed to 256 and is recorded as x₃。

5. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: in the NLHR network of step 4, stage 4, x₀4 consecutive Basicblock, x with 32 incoming channels₁4 consecutive Basicblock, x with 64 incoming lanes₂4 consecutive Basicblock, x with 128 incoming channels₃4 continuous basic blocks with 256 channels enter, then pass through a non-local network module, and keep the input and output sizes unchanged after passing through the network module; then enter a fusion phase, x₀The number of the branches is 4, and the 1 st branch is kept unchanged; the 2 nd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/8, and the number of channels is changed to 64; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/32, and the number of channels is changed to 256; x is the number of₁Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd branch remains unchanged; the 3 rd pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, the resolution is reduced to 1/16 at the moment, and the number of channels is changed to 128; the 4 th pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/32, and the number of channels is changed to 256; x is the number of₂Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd branch remains unchanged; the 4 th pass step is 2, the convolution kernel is convolution of 3 multiplied by 3, at this time, the resolution is reduced to 1/32, and the number of channels is changed to 256; x is the number of₃Dividing the channel into 4 branches, wherein the step length of the 1 st branch is 1, the convolution kernel is convolution of 1 multiplied by 1, the resolution is increased to 1/4 at the moment, and the number of channels is changed to 32; the 2 nd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is raised to 1/8 at this time, and the number of channels becomes 64; the 3 rd pass step is 1, the convolution kernel is convolution of 1 × 1, the resolution is increased to 1/32, and the number of channels is changed to 128; the 4 th branch remains unchanged; then the channels with the same number are fused with each other to form a new x₀，x₁，x₂And x₃(ii) a Next x₀Keeping the same; x is the number of₁By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; x is the number of₂By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; x is the number of₃By the convolution with step size 1 and convolution kernel 1 × 1, the resolution is raised to 1/4, and the number of channels becomes 32; then 4 were fused to each other.

6. The human body posture estimation method based on Non-local high resolution network according to claim 1, characterized in that: in the NLHR network of the step 4, the last layer of fused features outputs the number of channels corresponding to the number of the key points of the data set through convolution with the step length of 1 and the convolution kernel of 1 multiplied by 1; in summary, the corresponding network structure parameter table can be referred to in table 1, the square brackets indicate the module structure, the number multiplied by the module structure is the number of residual error units, and the last number is the number of modules.

7. The human body posture estimation method based on Non-local high resolution network of claim 5, characterized in that: in the non-local network module: the reason why the non-local module is added at this stage is because the lowest resolution feature is obtained at this stage, the smaller resolution includes the low-level feature with strong semantic information, and the non-local module is added to the module with the low resolution feature.