CN114694176A

CN114694176A - Lightweight human body posture estimation method based on deep learning

Info

Publication number: CN114694176A
Application number: CN202210220002.6A
Authority: CN
Inventors: 陆大鹏; 闫胜业
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-07-01

Abstract

The invention discloses a light human posture estimation method based on deep learning, which comprises the following steps of 1: carrying out data concentration on the specified pictures; step 2: preprocessing the pictures in the data set; and step 3: the improved high-resolution network HRNet is used as a backbone network, and the problem of network degradation of a deep neural network is solved through a residual structure, so that the overall parameters and calculated amount of the model are reduced; and 4, step 4: the method comprises the steps of conducting scale fusion on four parallel subnets with different scales, conducting up-sampling on a low-resolution feature map, then fusing the feature map of the high-resolution subnet, generating Gaussian heatmaps of key points of different types, improving the original human body posture estimation model, reducing parameter quantity and calculation quantity of the model on the premise of not reducing too much precision, enabling the model to run on a low-calculation-force platform, and verifying the effect of the human body posture estimation model based on the enhanced high-resolution network.

Description

Lightweight human body posture estimation method based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a lightweight human posture estimation method based on deep learning.

Background

At present, mainstream human body posture estimation methods are realized through deep learning, in 2014, toshiev and other people have introduced a human body posture estimation method deep pose researched by people on CVPR, the deep learning method is applied to a human body posture estimation task for the first time, manual feature estimation used in the traditional method is abandoned, a multi-stage deep convolutional network is used for extracting global features in a picture and carrying out direct regression on key points of a human body, but because a convolutional neural network cannot well regress long-distance offset, a rough offset is finally regressed by a model, and the generalization capability of the model is poor. In view of the above problems, Thompson et al propose a method for performing a heatmap-based regression on human body key points using a network and a graph structure model by using a convolution spirit. The deep learning method for estimating the human body posture basically continues the modes so far, and each link in the deep learning method is continuously optimized.

So far, through the continuous efforts of numerous scholars, the human posture estimation task has achieved a relatively good experimental effect, but the task also has a problem at present, with the improvement of the precision of the model, the parameter quantity of the model is also increased in multiples, and with the continuous improvement of the calculation cost, the development and the application of the model have higher hardware requirements.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a lightweight human posture estimation method based on deep learning, which is improved on the basis of an original human posture estimation model, reduces the parameter quantity and the calculated quantity of the model on the premise of not reducing too much precision, and enables the model to run on a low-computation-force platform.

In order to achieve the purpose, the invention adopts the technical scheme that: a lightweight human body posture estimation method based on deep learning comprises the following steps:

step 1: randomly extracting pictures containing human bodies from a video website to form pictures in a data set, naming the pictures in sequence by using numbers, putting the pictures into a picture folder, manually marking the pictures, and writing marking information and picture name information into a json object representation file.

Step 2: before training the pictures in the data set, preprocessing the pictures in the data set in the step 1, and erasing and shielding the small scale near the key points of the pictures by an HR random erasing method.

And step 3: adopting a high-resolution network HRNet, wherein the high-resolution network HRNet comprises an HR bottleneck module HRBottle Block and an HR basic module HRbasic Block, and the pictures are preprocessed and randomly erased and then enter the HRNet; the high-resolution network HRNet is divided into 4 flows, a transition layer is connected behind each flow, and the first flow comprises 4 HR bottleneck modules HRBottle Block; the flow two comprises two branches, and each branch comprises 4 HR basic modules HRbasic Block; the third flow comprises 3 branches, each branch comprises 4 HR basic modules HRbasic Block, the third flow is repeated for 4 times in the whole network, and then the fourth flow is connected; flow four contains 4 branches, each containing 4 HR base modules HRBasic Block, and flow four is repeated 3 times throughout the network.

In the HRbasic module, depth separable convolution is adopted, a standard convolution is decomposed into a depth convolution and a point-by-point convolution, the depth convolution is used for filtering, a filter is independently arranged for each input channel, the point-by-point convolution is used for converting channels, and the output of the depth convolution is combined according to the channels by using convolution of 1 multiplied by 1.

And 4, step 4: and performing scale fusion of four parallel subnets with different scales at the tail part of the HRNet, and fusing the low-resolution feature map and the high-resolution feature map of the subnet after up-sampling to generate Gaussian heatmaps of different types of key points.

As a preferred embodiment of the present invention: in the

step

1, 16 key points for manual labeling of the human body are respectively a right ankle joint, a right knee joint, a right hip joint, a left knee joint, a left ankle joint, a pelvis, a chest, an upper neck, a vertex, a right wrist joint, a right elbow joint, a right shoulder joint, a left elbow joint and a left wrist joint.

As a preferred embodiment of the present invention: the pictures fed into the model in step 2 are unified to a size of 256 × 256.

As a preferred embodiment of the present invention: in step 2, data is preprocessed by random rotation, inversion and other modes, and the access variables in the HR random erasing method comprise: inputting a picture input; picture sizes C, H, W; the area S of the picture is H × W; erasing probability P of the picture; probability p of erased area being circular_c(ii) a Rectangular erasure area to picture area ratio range (s1, s 2); an aspect ratio range (r1, r2) of the rectangular erase box; the ratio of the circular erased area to the overall image area R.

The first step is that random numbers p1 and p2 between two (0, 1) are taken, the two random numbers are compared with p and pc to judge whether the picture needs to be erased or not and whether the erased shape is a rectangle or a circle, if p1 is larger than or equal to p, the picture does not need to be erased and is directly output; if p1 is not more than p, performing the next calculation; s_eFor the calculated rectangular erase region area, R_eThe length and width H of the rectangular erase region can be determined by the above two quantities_eAnd W_eSimilar methods can find the area of a fixed circular erase region; next, p2 and p are compared_cIf p2 ≧ p_cIt is shown that the erasing area is rectangular, the coordinates (x1, y1) of the lower left corner point of the rectangular erasing area are randomly selected on the image, the position of the erasing frame on the image is fixed, and the area (I) of the original image selected by the erasing frame is fixed_e) The pixel value becomes a random number of (0, 255), thereby realizing random erasure; if p2 ≦ p_cThe erasing area is circular, one point is randomly selected on the image as the central point of the circular erasing area (x2, y2), the position of the circular erasing frame on the image is fixed, and the area (I) of the original image selected by the erasing frame is fixed_e) The pixel value becomes a random number of (0, 255).

As a preferred embodiment of the present invention: the output of the shallow convolution in step 3 is connected to the output of the last convolution by a skip connection.

As a preferred embodiment of the present invention: the convolution is divided into a two-stage process in step 3, resulting in a compression reduction in the amount of parameters and calculations,

the calculation method of the parameter quantity is shown in the following formula (1):

the calculation method of the calculated amount is shown in the following formula (2):

h multiplied by W multiplied by M is the size of the input feature map, H multiplied by W multiplied by N is the size of the generated output feature map, W and H are the width and height of the input feature map and the output feature map respectively, K multiplied by K is the convolution kernel size of the standard convolution, and as can be seen from formula (1) and formula (2), after the ordinary standard convolution is converted into the deep separable convolution, the overall parameters and calculated amount of the model are reduced.

As a preferred embodiment of the present invention: in step 3 the HRBottle module replaces the Relu activation function by using the hash activation function.

As a preferred embodiment of the present invention: in step 4, the subnets with different scales are subjected to scale fusion by using a channel cascading method.

Compared with the prior art, the invention has the following beneficial effects:

the method is improved on the basis of an original human body posture estimation model, the parameter quantity and the calculated quantity of the model are reduced on the premise of not reducing too much precision, the model can run on a low-calculation-force platform, the effect of the human body posture estimation model based on an enhanced high-resolution network is verified, experiment comparison and verification are carried out on MPII and COCO data sets which are published internationally, and the result shows that the parameter quantity and the calculated quantity are respectively reduced by 63% and 40% under the condition that the precision and the best method are almost the same.

Drawings

FIG. 1 is a diagram of a pseudo code for the HR random erase method;

FIG. 2 is a schematic diagram of HR random erase;

FIG. 3 is a diagram of HRBottle module and HRbasic module;

FIG. 4 is a view showing an overall structure of a model;

FIG. 5 is an optimization diagram of the network output structure;

fig. 6 is a diagram of the network structure of the HRnet.

Detailed Description

The present invention is further illustrated in the accompanying drawings and described in the following detailed description, it is to be understood that such examples are included solely for the purposes of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present specification, and it is intended to cover all such modifications as fall within the scope of the invention as defined in the appended claims.

As shown in fig. 5, a method for estimating a light human body posture based on deep learning includes the following steps:

step 1: randomly extracting pictures containing human bodies from a video website to form pictures in a data set, naming the pictures in sequence by using numbers, putting the pictures into a picture folder, and manually marking the pictures, wherein 16 key points (0-right ankle joint, 1-right knee joint, 2-right hip joint, 3-left hip joint, 4-left knee joint, 5-left ankle joint, 6-pelvis, 7-chest, 8-upper neck, 9-vertex, 10-right wrist joint, 11-right elbow joint, 12-right shoulder joint, 13-left shoulder joint, 14-left elbow joint and 15-left wrist joint) of the human bodies are marked on the original pictures by manually marking the data by using a marking tool, and information such as marking information and picture names is written into a json file, and storing the information into a labeling folder, and writing information such as labeling information, picture names and the like into a json object representation method file for subsequent training of the model.

Step 2: before the pictures in the data set are sent into the model for training, the pictures in the data set in the step 1 are preprocessed, and the pictures sent into the model are uniformly formatted into the size of 256 multiplied by 256.

In order to enhance the complexity of data, the data is processed by random rotation, inversion and the like, and a new random erasure method is designed by using a new data enhancement method to enhance the small-scale erasure near the key point. In the estimation of the human body posture, many occlusions are occluded by small-scale objects such as balls and wheels, the HR random erasing mode is small-scale erasing near key points designed for key points of the human body, as shown in fig. 2, the new erasing mode not only performs occlusion for the key design of small-scale circular erasing, but also reserves large-scale square area erasing as occlusion on a larger scale.

The access variables of the algorithm are: inputting a picture input; picture sizes C, H, W; the area S of the picture is H × W; erasing probability P of the picture; probability p of erased area being circular_c(ii) a Rectangular erasure area to picture area ratio range (s1, s 2); an aspect ratio range (r1, r2) of the rectangular erase box; the ratio of the circular erased area to the overall image area R.

The algorithm takes random numbers p1 and p2 between two (0, 1) in the first step, and compares the two random numbers with p and pc to judge whether the picture needs to be erased and whether the erased shape is a rectangle or a circle, if p1 is more than or equal to p, the picture does not need to be erased and is directly output. If p1 is less than or equal to p, the next calculation is carried out. S. the_eFor the calculated rectangular erase region area, R_eThe length and width H of the rectangular erase region can be determined by the above two quantities_eAnd W_eA similar approach can find the area of a fixed circular erase region. Next, p2 and p are compared_cIf p2 ≧ p_cIt is shown that the erasing area is rectangular, the coordinates (x1, y1) of the lower left corner point of the rectangular erasing area are randomly selected on the image, the position of the erasing frame on the image is fixed, and the area (I) of the original image selected by the erasing frame is fixed_e) The pixel value becomes a random number of (0, 255), thereby realizing random erasure. If p2 ≦ p_cThe erasing area is circular, one point is randomly selected on the image as the central point of the circular erasing area (x2, y2), the position of the circular erasing frame on the image is fixed, and the area (I) of the original image selected by the erasing frame is fixed_e) The pixel value becomes a random number of (0, 255), thereby realizing random erasure.

Fig. 1 shows a flow of the HR random erasing method, which can simulate many occlusion in reality to achieve the effect of enhancing data. The preprocessed data can be sent to a backbone network for feature extraction and subsequent key point prediction.

And step 3: the improved high-resolution network HRNet is used as a backbone network, and the improved method is to provide two lightweight HRblocks to replace ResBlock inherited from ResNet in the original network through design. In ResNet, the network degradation problem of a deep neural network is solved by introducing a residual structure, and the problem of gradient disappearance or explosion is effectively solved by specification initialization. HRNet introduces basic modules BasicBlock and bottleneck modules BottleBlock from a ResNet network in an original network as basic units for constructing the whole network, and by designing HR modules including HR bottleneck modules HRBottle Block and HR basic modules HRbasic Block, as shown in FIG. 3, the HRNet can be directly embedded and replaced into the HRNet network, so that the extraction efficiency of human posture estimation network features is improved.

The original HRNet can be divided into four stages, the network with the highest resolution ratio is formed by four bottleneck modules and is taken as a first stage, and when the first stage is transited to a second stage, a parallel subnet with half resolution ratio reduced is added into the existing network. By analogy, each time the subnet with half-reduced resolution is added when the previous stage is transited to the next stage, the parallel subnets in the second, third and fourth stages are used by the basic module to extract the characteristic information. According to the subnet statistics of different stages, a total of 108 combination modules of ResNet are included in the whole HRNet structure, including 4 bottleneck modules and 104 basic modules. These modules serve as feature extraction in the network, connecting different stages, however, too many original ResNet modules are arranged in parallel in the original network, which results in that the model becomes very large and the data processing becomes more redundant.

One 3 × 3 standard convolution and two 1 × 1 convolutions need to be calculated at each bottleneck module, while two 3 × 3 standard convolutions need to be calculated at each basic module. A 3 x 3 standard convolution with the input signature generates a signature of the same size when the step size is 1. So when inputting a high resolution picture, the standard convolution can easily bring huge parameters and calculation amount.

As shown in FIG. 6, the HRnet network improvement module is used, so that the human body posture estimation task is more efficient, the provided HR module comprises an HR bottleneck module HRBottle Block and an HR basic module HRbasic Block, the HR bottleneck module HRBottle Block and the HR basic module HRbasic Block can be directly embedded into and replaced into an HRNet network, the human body posture estimation network feature extraction efficiency is improved, the HRbasic module adopts deep separable convolution to replace the original 3 x 3 convolution, a standard is decomposed into a deep convolution and a point-by-point convolution, and the calculation method of the parameter is shown in the following formula (1):

the deep convolution is responsible for filtering, one filter for each input channel, the point-by-point convolution is responsible for converting the channels, and the outputs of the deep convolution are combined by channel using a convolution of 1 x 1. H multiplied by W multiplied by M is the size of the input feature map, H multiplied by W multiplied by N is the size of the generated output feature map, W and H are the width and height of the input feature map and the output feature map, respectively, and K multiplied by K is the convolution kernel size of the standard convolution. The convolution is divided into a two-stage process that results in a reduction in the amount of parameters and calculations. As can be seen from the formula (1) and the formula (2), after the common standard convolution is converted into the deep separable convolution, the parameter quantity and the calculated quantity are reduced, so that the parameter quantity and the calculated quantity of the whole model are reduced.

In the structural design of the HRBottle module, the output of the shallow convolution is connected with the output of the last layer of convolution by using a layer jump connection method. The human body posture estimation needs to detect key points with different scales, so that visual information with different scales needs to be acquired from feature maps with different receptive fields. Meanwhile, the feature maps of the first layers are connected by a channel cascade method, so that the feature maps of a plurality of receptive fields can be reserved and accumulated, and the feature representation capability of the HRBottle module is improved.

In addition, the HRBottle module uses a Mish activation function to replace a Relu activation function to improve the optimization and generalization capability of the HRBottle module. The HRBottle module performs better in the effect of feature extraction than the original bottleneck module.

However, in the output structure of the original HRNet, feature fusion is performed by adding corresponding elements to four feature maps with different scales. This results in no increase in dimensions when describing the image, but an increase in the amount of information under each dimension. As shown in fig. 4, different from the HRNet network, when the last layer of the output of the entire network is designed, a channel-based cascade (coordination) method is adopted to connect the last layer of output feature maps of subnets with different scales for organic fusion. The new network design up-samples the following three low-resolution subnet feature maps and then concatenates them according to channels, and the combination of the channels not only makes the features reused but also increases the features describing the image itself. Finally, the features are organically fused by utilizing convolution of 1 multiplied by 1, semantic information of different scales is fully utilized to improve the output performance of the tail end of the network, and the structure is shown in figure 5.

The design of the lightweight network structure is firstly started from a high-resolution network to the most basic modules, and in order to reduce the parameter quantity and the calculated quantity of the whole model, brand-new lightweight HRBlocks are designed. Then in the network intermediate feature exchange stage, in order to reduce feature loss, a more efficient pixel re-assembly upsampling module is introduced. And finally, in the final output stage of the network, in order to improve the output capability of the model, a method of cascading according to channels is used for organically fusing feature maps with different scales, and the three designs act together from points and planes to form the efficient and light-weight network.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A lightweight human body posture estimation method based on deep learning is characterized by comprising the following steps:

step 1: randomly extracting pictures containing human bodies from a video website to form pictures in a data set, naming the pictures in sequence by using numbers, putting the pictures into a picture folder, manually marking the pictures, and writing marking information and picture name information into a json object representation method file;

step 2: before training the pictures in the data set, preprocessing the pictures in the data set in the step 1, and erasing and shielding small scales near key points of the pictures by an HR random erasing method;

and step 3: adopting a high-resolution network HRNet, wherein the high-resolution network HRNet comprises an HR bottleneck module HRBottle Block and an HR basic module HRbasic Block, and the pictures are preprocessed and randomly erased and then enter the HRNet; the high-resolution network HRNet is divided into 4 flows, and a transition layer is connected behind each flow, wherein the first flow comprises 4 HR bottleneck modules HRBottle Block; the flow two comprises two branches, and each branch comprises 4 HR basic modules HRbasic Block; the third flow comprises 3 branches, each branch comprises 4 HR basic modules HRbasic Block, the third flow is repeated for 4 times in the whole network, and then the fourth flow is connected; the fourth flow comprises 4 branches, each branch comprises 4 HR basic modules HRbasic Block, and the fourth flow is repeated for 3 times in the whole network;

2. The method for estimating the lightweight human body posture based on the deep learning of claim 1, wherein: in the step 1, 16 key points for manual labeling of the human body are respectively a right ankle joint, a right knee joint, a right hip joint, a left knee joint, a left ankle joint, a pelvis, a chest, an upper neck, a vertex, a right wrist joint, a right elbow joint, a right shoulder joint, a left elbow joint and a left wrist joint.

3. The method for estimating the lightweight human body posture based on the deep learning of claim 2, wherein: the pictures fed into the model in step 2 are unified to a size of 256 × 256.

4. The method for estimating the lightweight human body posture based on the deep learning of claim 3, wherein: in step 2, data is preprocessed by random rotation, inversion and other modes, and the access variables in the HR random erasing method comprise: inputting a picture input; picture sizes C, H, W; the area S of the picture is H × W; erasing probability P of the picture; probability p of erased area being circular_c(ii) a Rectangular erasure area to picture area ratio range (s1, s 2); an aspect ratio range (r1, r2) of the rectangular erase box; the ratio R of the area of the circular erased area to the overall image area;

the first step takes the random numbers p1 and p2 between two (0, 1), usuallyComparing the two random numbers with p and pc to judge whether the picture needs to be erased or not and whether the erased shape is rectangular or circular, if p1 is more than or equal to p, the picture does not need to be erased and is directly output; if p1 is not more than p, performing the next calculation; s_eFor the calculated rectangular erase region area, R_eThe length and width H of the rectangular erase region can be determined by the above two quantities_eAnd W_eSimilar methods can find the area of a fixed circular erase region; next, p2 and p are compared_cIf p2 ≧ p_cIt is shown that the erasing area is rectangular, the coordinates (x1, y1) of the lower left corner point of the rectangular erasing area are randomly selected on the image, the position of the erasing frame on the image is fixed, and the area (I) of the original image selected by the erasing frame is fixed_e) The pixel value becomes a random number of (0, 255), thereby realizing random erasure; if p2 ≦ p_cThe erasing area is circular, one point is randomly selected on the image as the central point of the circular erasing area (x2, y2), the position of the circular erasing frame on the image is fixed, and the area (I) of the original image selected by the erasing frame is fixed_e) The pixel value becomes a random number of (0, 255).

5. The method for estimating the lightweight human body posture based on the deep learning of claim 4, wherein: the output of the shallow convolution in step 3 is connected to the output of the last convolution by a skip connection.

6. The method for estimating a lightweight human body posture based on deep learning according to claim 5, characterized in that: in the step 3, the process is carried out,

the convolution is divided into a two-stage process, which results in a reduction in the amount of parameters and calculations, which are described in equation (1):

7. The method for estimating the lightweight human body posture based on the deep learning of claim 6, wherein: in step 3 the HRBottle module replaces the Relu activation function by using the hash activation function.

8. The method for estimating a lightweight human body posture based on deep learning according to claim 7, characterized in that: in step 4, the subnets with different scales are subjected to scale fusion by using a channel cascading method.