CN112633220A

CN112633220A - Human body posture estimation method based on bidirectional serialization modeling

Info

Publication number: CN112633220A
Application number: CN202011610311.1A
Authority: CN
Inventors: 刘振广; 封润洋; 陈豪明; 王勋; 钱鹏
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-09
Anticipated expiration: 2040-12-30
Also published as: CN112633220B

Abstract

The invention discloses a human body posture estimation method based on bidirectional serialization modeling, which takes continuous 3 frames as input, fully utilizes time sequence information of a video to calculate the approximate space range of each joint, and then regresses the specific position of each joint from a smaller range, thereby better solving the problems of inherent shielding, motion blurring and the like in a human body posture estimation task, leading the generalization of a model to be stronger and having higher accuracy. The method fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate key parts of the human body, and has important significance in industries needing to extract the posture in real time for analysis, such as security protection, short video platforms and the like.

Description

Human body posture estimation method based on bidirectional serialization modeling

Technical Field

The invention belongs to the technical field of human body posture estimation, and particularly relates to a human body posture estimation method based on bidirectional serialization modeling.

Background

Human body posture estimation is a leading research field in computer vision, and aims to locate human body key parts (such as wrists and ankles) in pictures or videos so as to realize human body posture estimation. Human body posture estimation is a bridge for communicating a machine and a person, has great practical significance, is widely applied to a plurality of fields, such as the field of stage animation, and can generate real-time interactive animation effect by identifying the posture action of the person; in the field of automatic driving, traffic accidents can be avoided in advance by predicting the motion trend of pedestrians; in the field of security protection, abnormal behaviors can be detected by identifying a specific attitude sequence.

Currently, human posture estimation methods are mainly classified into two types: (1) from top to bottom, all human body positions in the picture are detected firstly, and the human body is marked by a rectangular bounding box usually; then identifying the joints of each human body through a human body joint part detector; and then mapping the cut character posture information back to the original picture by affine transformation, thereby realizing the estimation of all human body postures in the picture. The top-down method separates the human position detection task from the human joint detection task, and focuses on the posture estimation method, so that the method has high accuracy, but the detection time is positively correlated with the number of human beings in the picture, and the method needs to use a target detection technology, and the detection quality of the position coordinate directly influences the final result of the posture estimation. (2) From bottom to top, firstly detecting joint position information of all human bodies in the picture, and then clustering joint coordinates belonging to the same person, thereby carrying out posture estimation on all human bodies in the picture. The bottom-up method has high efficiency, the detection time is less influenced by the number of the human objects in the picture, but the accuracy is slightly behind.

The mainstream human body posture estimation methods include a network architecture designed for a static picture from top to bottom and from bottom to top, which is good at human body posture estimation in a single-frame picture. Generally, 1/25 seconds, which is 1 frame, is very short, so that the image between two frames of the video does not change very much, and has very high similarity, and because of the rich geometric consistency between adjacent frames of the video, such extra clues can be used to correct key points which are difficult to predict, such as occlusion or motion blur.

The traditional image-based attitude estimation method cannot effectively utilize the extra information, so that the situations of frequent high entanglement, mutual occlusion, motion blur and the like of people in a video sequence cannot be processed, and a good result is difficult to obtain in video attitude estimation. To address this problem, the document [ floating ConvNets for Human dose Estimation in Videos- [ CODE ] -pfister.t, charles.j & zisserman.a (ICCV 2015) ] proposes to compute dense optical flow information between each two frames and then correct the initial Pose estimate using flow-based time information; when the optical flow can be correctly calculated, the method achieves a good effect, however, the calculation of the optical flow is greatly affected by the picture quality, the shading and the like, all optical flow information cannot be accurately calculated in the video, and the calculation of the optical flow information often needs a large amount of calculation support. Some researchers also propose to use a Long Short-Term Memory network (LSTM) to directly model a video to capture timing information, however, due to the structural limitation of the LSTM network, this method can only obtain a good effect when people in a video frame are sparse, and when the method is used in a complex scene, the situations of occlusion, motion blur, etc. cannot be handled.

Disclosure of Invention

In view of the above, the present invention provides a human body posture estimation method based on bi-directional serialization modeling, which takes continuous 3 frames as input, calculates the approximate spatial range of each joint by fully utilizing the time sequence information of the video, and then regresses the specific position of the joint from a smaller range, thereby better solving the problems of occlusion, motion blur, etc. inherent in the human body posture estimation task, and making the generalization of the model stronger and have higher accuracy.

A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:

(1) collecting a video data set for estimating the human body posture and preprocessing the video data set;

(2) regarding a section of complete video in a video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;

(3) constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;

(4) and inputting continuous 3 frames of video images to be estimated into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the 2 nd frame of video image, namely the coordinates of each key part of the human body.

Further, in the step (1), for each frame of video image in the video data set, the position coordinates of a human body ROI (region of interest, i.e. human figure position bounding box) in the image are detected by the YOLOv5 algorithm, and the ROI is enlarged by 25%.

Further, the bidirectional continuous convolution neural network is composed of a Backbone network, a posture time merging network, a posture residual error merging network and a posture correction network, wherein the Backbone network is used for preliminarily calculating a posture characteristic vector h of a human body in three frames of video images of input samples_i-1、h_i、h_i+1The three feature vectors are superposed to obtain a vector phi (h) which is respectively input to an attitude time merging network and an attitude residual error fusion network, the attitude time merging network is used for coding the approximate space range of each joint of the human body to obtain a feature vector xi (h), the attitude residual error fusion network is used for calculating an attitude residual error vector psi (h) of the human body, and the xi (h) and the feature vector eta superposed with the psi (h) are input to the attitude correction network to be calculated to obtain a human body attitude prediction result.

Further, the attitude time combination network is formed by stacking three Residual blocks (Residual blocks), a vector phi (h) is recombined according to a joint sequence and then serves as the input of the network, and a feature vector xi (h) is output; the attitude residual fusion network is formed by stacking five residual blocks, firstly, attitude characteristic vectors of a second frame and a first frame and attitude characteristic vectors of the second frame and a third frame in a sample are respectively subjected to difference, meanwhile, tensor zeta is obtained through cascade (association) with weight and is used as input of the network, an attitude residual vector psi (h) is output, and a specific expression of the tensor zeta is as follows:

further, the residual block is formed by sequentially connecting a convolution layer with the size of 3 × 3, a batch normalization layer and a Relu activation layer, the residual block in the attitude time merging network adopts packet convolution, and the packet number groups is 17 (according to the key point standard of the COCO data set, there are 17 key points in total); the residual block in the posture residual fusion network does not use packet convolution, and the packet number groups is 1.

Furthermore, the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are respectively 3, 6, 9, 12 and 15, each deformable convolution takes the result of stacking the feature vector xi (h) and the eta as input, predicted Gaussian heat maps are output, and the five Gaussian heat maps output by the five convolutions respectively are averaged to obtain the human body posture prediction result.

Further, the process of training the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing Backbone network parameters, training a posture time merging network, a posture residual error merging network and a posture correction network.

Further, the specific process of training the backhaul network is as follows: inputting human body ROI in all video images of a sample into a Backbone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of the Backbone network through back propagation according to a loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:

wherein: n is the number of labeled human body key parts, H_{gt_i}Transforming the coordinates of manually marked ith key part of ROI in a group of samples to generate the result of superposition of Gaussian heatmaps H_{pred_i}Generating a Gaussian heatmap superimposed result by transforming coordinates of the ith key part of the ROI of all human bodies in a group of samples and predicting output by a bidirectional continuous convolutional neural network₂Denotes the L2 norm, v_iAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.

Further, the specific process of training the posture time merging network, the posture residual error merging network and the posture correction network is as follows: firstly, fixing trained Backbone network parameters, then inputting human body ROI in all video images of a sample into the Backbone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correction network through back propagation according to a loss function L2 until a loss function L2 converges, wherein the expression of the loss function L2 is as follows:

wherein: n is the number of labeled human body key parts, G_{gt_i}Artificially marking coordinates of the ith key part of human ROI in the 2 nd frame video image of a group of samples to generate a Gaussian heat map G through conversion_{pred_i}Predicting a Gauss heatmap generated by transformation of output coordinates by a bi-directional continuous convolutional neural network for an ith key part of a human ROI in a 2 nd frame video image of a set of samples, | |₂Denotes the L2 norm, v_iAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.

Further, the specific implementation process of the step (4) is as follows: inputting the human body ROI of the same person in continuous 3 frames of video images to be estimated into a human body posture estimation model to output to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain the coordinate information of key parts of the same person in a 2 nd frame of video image, mapping the coordinate information into the 2 nd frame of video image, and sequentially linking the key parts to generate a prediction result of a human body skeleton, thereby realizing human body posture estimation.

The invention relates to a human body posture estimation method based on bidirectional continuity, which mainly uses deformable convolution networks with different voidage rates as a prediction model; the deformable convolution network is a variant of the traditional convolution neural network, convolution kernels of the traditional convolution neural network are square, general objects such as human bodies and the like are not square, the traditional convolution network has certain limitation, and the deformable convolution network can obtain convolution kernels in any shapes by learning offset parameters of each pixel of the convolution kernels, so that the deformable convolution network can be better suitable for objects in various shapes; each convolutional layer adopts different void ratios corresponding to different receptive fields, the larger the void ratio is, the larger the receptive field is, information biased to the whole situation can be captured, otherwise, the smaller the void ratio can capture more exquisite local information; therefore, the design of the deformable convolution network is more reasonable for estimating the human body posture in the video.

The invention fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate the key parts of the human body, has important significance in industries needing to extract the posture in real time for analysis, such as security protection, short video platforms and the like, and has the following beneficial technical effects:

1. the method better estimates the key points of the shielded and motion blurred image through an accurate attitude estimation algorithm, and has the characteristics of more accuracy and rapidness in detection.

2. The method is designed aiming at the video, better accords with various application scenes, and simultaneously adopts packet convolution, cavity convolution and the like, so that better effect can be obtained by measuring fewer parameters, and the attitude estimation can be applied in real time.

Drawings

FIG. 1 is a flowchart illustrating a human body posture estimation method according to the present invention.

FIG. 2 is a schematic diagram of a Residual Block structure and its stacking method.

FIG. 3 is a schematic structural diagram of a bi-directional continuous convolutional neural network according to the present invention.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

As shown in FIG. 1, the human body posture estimation method based on bidirectional continuity of the invention comprises the following steps:

(1) and collecting and selecting a human body posture estimation video data set, and preprocessing the data set.

In the embodiment, the Posetrack data set is adopted as training data and is used for a human body posture tracking task, and people can be shielded and the motion of many videos is blurred, so that the difficulty of estimating the human body posture of the videos is greatly increased. This embodiment is a top-down approach, so the data set needs to be preprocessed: firstly, detecting the position boundary box of each person in the frame to be estimated through a YOLO V5 detection algorithm, and then enlarging each boundary box by 25% to cut the front frame and the rear frame to obtain three frame images of the same person.

(2) And constructing a bidirectional continuous convolution neural network model as a human body posture estimation model.

As shown in fig. 3, a bidirectional continuous convolutional neural network (dcpos) is mainly composed of the following parts: the system comprises a backhaul network module, a posture time merging module PTM, a posture residual error fusion module PRF and a posture correction network module PCN. In the embodiment, the backhaul network module adopts a high-resolution network HRNet to preliminarily calculate the human postures in the three input pictures to obtain the characteristic vector h_i-1、h_i、h_i+1The three vectors are superposed to obtain a vector phi (h), and two parallel branches are input; and the attitude time merging module encodes the approximate spatial range xi (h) of each joint, the attitude residual fusion module obtains an attitude residual vector psi (h), and then the characteristic vector xi (h), the characteristic vector eta formed by superposing the characteristic vector xi (h) and the characteristic vector psi (h) are input into the attitude correction network to obtain a final attitude prediction result.

The attitude time merging module consists of three stacked Residual blocks (Residual blocks), a group of samples obtain a characteristic vector phi (h) through a backhaul network, and the characteristic vector phi (h) is recombined according to the joint sequence and is used as the input of the module to output a characteristic vector xi (h); each residual block adopts packet convolution, and the parameter groups are 17 (according to the key point standard of the COCO data set, there are 17 key points in total).

The attitude residual fusion module is composed of five stacked residual blocks, firstly attitude feature vectors of a first frame and a second frame and attitude feature vectors of a third frame and a second frame of the group of samples are respectively subjected to difference, simultaneously, a tensor zeta is obtained through weight cascade and is used as the module input, and an attitude residual vector psi (h) is output, wherein the tensor xi can be formalized as:

as shown in fig. 2, the residual block is composed of a 3 × 3 convolution layer, a batch normalization layer, and a Relu activation layer; the difference lies in that the groups parameter in the three residual block convolution layers forming the PTM module is 17, the corresponding PRF module does not use the packet convolution, and the groups parameter in the convolution layers is 1 at the moment.

The posture correction network consists of five parallel deformable convolutions and the expansion rates are set as follows: 3. 6, 9, 12 and 15, each deformable convolution takes the feature vector xi (h) and eta stack as input, a predicted Gaussian heat map is output, and finally five heat maps are averaged to obtain a final prediction result.

(3) Inputting the data preprocessed in the step (1) into a model, and updating parameters and training the model by taking the L distance as a loss function.

The DCPose adopts a separate training method, firstly trains a Backbone network, then fixes the Backbone network, and trains other part of networks.

In the DCPose, each frame of a video is taken as a current frame to be estimated, one frame is taken from front to back to be divided into a plurality of sub-picture sequences, the length of each sub-picture sequence is 3, label information of key point positions of all human bodies exists in each sub-picture sequence, and then each divided sub-picture sequence is taken as the input of the DCPose.

The Backbone network firstly loads official pre-training model parameters, then inputs a group of sub-picture sequences, outputs attitude characteristic vectors, and calculates the mean square error with the real attitude vectors to obtain the loss value of each frame, wherein the expression of a loss function L is as follows:

wherein: h_{gt_i}The result obtained by superposing Gaussian heatmaps generated by converting real coordinates of the ith key part of all people in the subsequence is H_{pred_i}Superimposed gaussian heatmap results generated for coordinate transformation of all person's ith key site predictions in subsequences |₂Represents L2 norm, N is the number of key parts marked by human body, v_iAnd whether the coordinate has a label is shown, if so, the value is 1, otherwise, the value is 0.

After the training of the Backbone network is finished, the parameters of the Backbone network are fixed, each sub-picture sequence is input into a DCPose network, and an attitude feature vector phi (h) with the dimensionality of [4,51,96 and 72] is obtained through the Backbone network; then inputting a PTM network to obtain a feature vector xi (h) with the dimensionality of [4,17,96 and 72], inputting a PRF network to obtain a feature vector psi (h) with the dimensionality of [4,128,96 and 72 ]; then inputting the superposed vector eta of the feature vector xi (h) and the feature vectors xi (h) and psi (h) into the PCN network together, wherein the dimensionality is [4,145,96 and 72], each deformable convolution layer outputs a posture feature vector, and the dimensionality is [4,17,96 and 72 ]; the final gaussian heatmap was obtained by averaging 5 different pose feature vectors.

During DCPose training, L2 Loss is mainly adopted, and in each picture sequence input into the bidirectional continuous convolutional neural network, the frame 2 is really needed to estimate the posture, so that the Loss value cannot be calculated through the

frames

1 and 3; the 2 nd frame loss function calculation is basically the same as the loss function in the backhaul network training, and the only difference is that H_{gt_i}The result of Gaussian heat map generated by transforming the real coordinates of the ith key part of the person in the 2 nd frame of the sample H_{pred_i}The result of the Gaussian heatmap generated by the coordinate transformation of the i key parts of the human object in the 2 nd frame is predicted; by making full use of the bidirectional information of the previous and subsequent frames, the network has more accurate prediction capability.

(4) After the model training is finished, inputting a test set and outputting a human body posture estimation result, wherein the specific implementation process is as follows:

4.1 the test set is input into the trained model to obtain the Gaussian heatmap of each frame.

4.2 through a Gaussian heat map coordinate conversion algorithm, calculating from the final Gaussian heat map in the step 4.1 to obtain the coordinate information of key parts of the human body, then mapping the coordinate information to an original picture to obtain the positions of the key parts, and finally linking the key parts according to the sequence to generate a prediction result of the human skeleton so as to achieve the target of human posture estimation.

The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:

2. The human body posture estimation method according to claim 1, characterized in that: in the step (1), for each frame of video image in the video data set, detecting the position coordinates of the human body ROI in the image by using the YOLOv5 algorithm, and enlarging the ROI by 25%.

3. The human body posture estimation method according to claim 1, characterized in that: the bidirectional continuous convolution neural network consists of a Backbone network, a posture time merging network, a posture residual error merging network and a posture correction network, wherein the Backbone network is used for preliminarily calculating a posture characteristic vector h of a human body in three frames of video images of input samples_i-1、h_i、h_i+1The three feature vectors are superposed to obtain a vector phi (h) which is respectively input to an attitude time merging network and an attitude residual error fusion network, the attitude time merging network is used for coding the approximate space range of each joint of the human body to obtain a feature vector xi (h), the attitude residual error fusion network is used for calculating an attitude residual error vector psi (h) of the human body, and the xi (h) and the feature vector eta superposed with the psi (h) are input to the attitude correction network to be calculated to obtain a human body attitude prediction result.

4. The human body posture estimation method according to claim 3, characterized in that: the attitude time merging network is formed by stacking three residual blocks, a vector phi (h) is recombined according to a joint sequence and then is used as the input of the network, and a characteristic vector xi (h) is output; the attitude residual error fusion network is formed by stacking five residual error blocks, firstly, attitude characteristic vectors of a second frame and a first frame and attitude characteristic vectors of a second frame and a third frame in a sample are respectively subjected to difference, simultaneously, a tensor zeta is obtained through weight cascade and is used as an input of the network, an attitude residual error vector psi (h) is output, and a specific expression of the tensor zeta is as follows:

5. the human body posture estimation method according to claim 4, characterized in that: the residual block is formed by sequentially connecting a convolution layer with the size of 3 multiplied by 3, a batch normalization layer and a Relu activation layer, the residual block in the attitude time merging network adopts packet convolution, and the packet number groups is 17; the residual block in the posture residual fusion network does not use packet convolution, and the packet number groups is 1.

6. The human body posture estimation method according to claim 3, characterized in that: the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are respectively 3, 6, 9, 12 and 15, each deformable convolution takes the result of stacking the feature vector xi (h) and eta as input, predicted Gaussian heat maps are output, and the five Gaussian heat maps output by the five convolutions respectively are averaged to obtain a human body posture prediction result.

7. The human body posture estimation method according to claim 3, characterized in that: the process of training the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing Backbone network parameters, training a posture time merging network, a posture residual error merging network and a posture correction network.

8. The human body posture estimation method according to claim 7, characterized in that: the specific process of training the backhaul network is as follows: inputting human body ROI in all video images of a sample into a Backbone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of the Backbone network through back propagation according to a loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:

wherein: n is the number of labeled human body key parts, H_{gt_i}Transforming the coordinates of manually marked ith key part of ROI in a group of samples to generate the result of superposition of Gaussian heatmaps H_{pred_i}Converting coordinates predicted and output by all human body ROI (region of interest) key parts in a group of samples through a bidirectional continuous convolutional neural network to generate a result after superposition of Gaussian heatmaps, | | | | sweet wind₂Denotes the L2 norm, v_iAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.

9. The human body posture estimation method according to claim 7, characterized in that: the specific processes of training the posture time merging network, the posture residual error merging network and the posture correcting network are as follows: firstly, fixing trained Backbone network parameters, then inputting human body ROI in all video images of a sample into the Backbone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial marking information corresponding to the sample, and repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correction network through back propagation according to a loss function L2 until a loss function L2 converges, wherein the expression of the loss function L2 is as follows:

wherein: n is the number of labeled human body key parts, G_{gt_i}Artificially marking coordinates of the ith key part of human ROI in the 2 nd frame video image of a group of samples to generate a Gaussian heat map G through conversion_{pred_i}Predicting a Gauss heat map generated by converting output coordinates of an ith key part of a human body ROI in a 2 nd frame video image of a group of samples through a bidirectional continuous convolution neural network, | | | | | ventilation₂Denotes the L2 norm, v_iAnd whether the ith key part has a label in the sample image or not is shown, if so, the value is 1, otherwise, the value is 0.

10. The human body posture estimation method according to claim 1, characterized in that: the specific implementation process of the step (4) is as follows: inputting the human body ROI of the same person in continuous 3 frames of video images to be estimated into a human body posture estimation model to output to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain the coordinate information of key parts of the same person in a 2 nd frame of video image, mapping the coordinate information into the 2 nd frame of video image, and sequentially linking the key parts to generate a prediction result of a human body skeleton, thereby realizing human body posture estimation.