CN116129051A

CN116129051A - Three-dimensional human body posture estimation method and system based on graph and attention interleaving

Info

Publication number: CN116129051A
Application number: CN202310074209.1A
Authority: CN
Inventors: 刘宏; 王体; 李文豪; 游盈萱; 丁润伟
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-05-16

Abstract

The invention relates to a three-dimensional human body posture estimation method and system based on graph and attention interleaving. The system extracts two-dimensional skeleton information of a human body from image features through a pre-trained two-dimensional gesture detector; embedding a two-dimensional skeleton into a high-dimensional space; mining local and global information of the skeleton by using the network modules of graph and attention interleaving; the multi-layer information of the framework is captured by utilizing a multi-layer perceptron module with a U-shaped structure; the high-dimensional data is regressed to a three-dimensional framework by using a regression head module; the average error of the articulation points is used as a loss function for model training. The invention combines the advantages of the graph rolling and the attention mechanism in capturing the local and global information of the skeleton, allows the bidirectional communication between the graph rolling module and the attention module to complement the advantages, can effectively strengthen the modeling capability of the model on the human skeleton, and can estimate and obtain the result which is closer to the real three-dimensional gesture.

Description

Three-dimensional human body posture estimation method and system based on graph and attention interleaving

Technical Field

The invention belongs to the field of target recognition and intelligent human-computer interaction in machine vision, and particularly relates to a three-dimensional human body posture estimation method and system based on graph and attention interleaving.

Background

The human body posture estimation aims at describing human body forms in objects such as pictures, videos and the like, and comprises various tasks such as target recognition, image segmentation, regression detection and the like. Compared with two-dimensional posture estimation, the three-dimensional human posture estimation has more accurate expression on human posture than the two-dimensional posture, and has higher research value. At present, three-dimensional human body posture estimation tasks become research hotspots in the field of computer vision, are the basis of many research works, and human body three-dimensional postures extracted from images or videos can be further used for tasks such as motion recognition, three-dimensional grid reconstruction and the like.

Existing three-dimensional human body posture estimation methods can be broadly divided into two categories: (1) direct regression-based three-dimensional human body pose estimation. The method can directly predict the three-dimensional gesture coordinates from the two-dimensional graph without two-dimensional gesture representation. Such methods have the advantage that end-to-end network training can be achieved, but have high requirements on network architecture and data preprocessing. (2) three-dimensional human body pose estimation based on a two-dimensional skeleton. The method is generally divided into two stages, wherein a pre-trained two-dimensional gesture estimation network is used for extracting a skeleton sequence, and the obtained skeleton is input into a three-dimensional gesture estimation network for dimension improvement. The three-dimensional human body posture estimation method based on the two-dimensional skeleton greatly reduces the complexity of the whole task and has better performance than a method based on direct regression due to the maturity of the existing two-dimensional posture estimation algorithm, and becomes the main stream. The scheme can greatly reduce the complexity of a network structure and is easier to deploy in a real environment. A typical case is to use a network built from fully connected layers (Martinez J, hossain R, romero J, et al a simple yet effective baseline for 3D human pose estimation.in Proceedings of theIEEE International Conference on Computer Vision (ICCV) 2017:2640-2649.) to promote a two-dimensional pose to a three-dimensional pose, the feasibility of which is demonstrated by a series of experiments, illustrating that a simple lightweight network can be used for mapping from a two-dimensional pose to a three-dimensional pose.

Although three-dimensional human body pose estimation has been a long-standing development in recent years, some research difficulties are faced, mainly including occlusion, depth ambiguity inherent in two-to-three-dimensional mapping, data set starvation, and other challenges.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention aims to provide a three-dimensional human body posture estimation method and system based on graph and attention interleaving. The invention utilizes graph rolling and attention mechanisms to pay attention to the local and global information of the human skeleton at the same time, so that the two can be further communicated, the advantages are complementary, and more robust human skeleton modeling is realized. In addition, the U-shaped structure multi-layer perceptron designed by the invention is simple and efficient, and can be used for capturing multi-layer information of a skeleton structure.

The technical scheme adopted by the invention is as follows:

a three-dimensional human body posture estimation method based on graph and attention interleaving comprises the following steps:

taking an image in the three-dimensional human body posture estimation dataset as a training image;

extracting two-dimensional skeleton information of a human body from an input training image by using a two-dimensional gesture detector;

mapping the extracted two-dimensional skeleton information to a high-dimensional space by using a skeleton embedding module to obtain a high-dimensional vector;

digging local and global information of human skeleton contained in the high-dimensional vector obtained by the skeleton embedding module by using a network module of graph and attention interleaving;

the multi-layer perceptron module with the U-shaped structure is utilized to extract multi-layer information of the human skeleton from the output of the network module of the graph and the attention interweaving;

the regression head module is utilized to carry out regression on the extracted multilevel information of the human skeleton, and the multilevel information is output to obtain a three-dimensional skeleton;

using the mean square error of the joint point as a loss function, performing supervised learning on the three-dimensional skeleton estimated by the regression head module to train a three-dimensional human body posture estimation model;

and taking the two-dimensional skeleton information extracted from the image to be estimated by the two-dimensional gesture detector as the input of a trained three-dimensional human body gesture estimation model, sequentially passing through a skeleton embedding module, a network module with interweaved diagrams and attention, a multi-layer perceptron module and a regression head module with U-shaped structures, and finally outputting to obtain a three-dimensional human body gesture estimation result.

Further, the three-dimensional human body posture estimation is performed for a human body that can be detected in an image.

Further, the first two steps in the method belong to a preprocessing stage, including acquisition of training images in a dataset and extraction of a two-dimensional skeleton.

Further, the two-dimensional skeleton information is a result obtained by directly estimating the two-dimensional skeleton information from the image by using the existing two-dimensional human body posture estimation algorithm.

Further, the skeleton embedding module comprises a multi-layer fully-connected network, and the two-dimensional skeleton is mapped to the high-dimensional space step by step.

Further, the graph and attention interweaving network module combines graph convolution networks and attention mechanisms to capture global and local information of the human skeleton. The network module of the graph and attention interleave contains two strategies: 1) Attention from the figure (Graph 2Attention, G2A): the human body topological structure information extracted by the graph convolution block is injected into the attention block, so that the attention block can learn the structure information of the human body skeleton better under the guidance of the graph convolution block; 2) Attention to the Graph (Attention 2Graph, A2G): the global association between the skeleton nodes captured by the attention block is sent to the graph convolution block, so that the graph convolution block has better perception on global information while focusing on the neighbor nodes.

Further, the structure of the graph convolution block is combined with the prior topological structure of the human skeleton, and is used for capturing local information of the human skeleton. The local information means that each key point focuses on the nodes adjacent to the key point, and for the nodes far away, the connection with the key point tends to be ignored.

Further, the prior of the topological structure of the human skeleton refers to that in the adjacent matrix for representing the human skeleton, each joint point is not only connected with the human skeleton, but also adjacent to the adjacent joint point, and symmetrical joint points in the skeleton are connected. The intrinsic characteristics of the skeleton structure are expressed by means of an adjacency matrix.

Further, the attention block is used for capturing global information of the human skeleton. The global information means that each joint point establishes a connection with all the joint points, and each joint point has global perception on the whole skeleton.

Further, the multi-layer perceptron module of the U-shaped structure is composed of a 3-layer full-connection network. The output of the first layer fully-connected network is halved in the channel dimension compared with the input, the input and output of the second layer fully-connected network keep the channel dimension unchanged, and the third layer fully-connected network improves the channel dimension of the output to be consistent with the input. And the place with consistent dimensions keeps the short cut connection, for example, the short cut connection is adopted between the input of the first layer fully-connected network and the output of the 3 rd layer fully-connected network, and the short cut connection is kept between the input and the output of the second layer fully-connected network.

Further, the regression head module comprises a 2-layer fully connected network for regressing the high-dimensional characteristics to specific joint point coordinates.

A three-dimensional human body pose estimation system based on graph and attention interleaving, comprising:

the preprocessing unit is used for acquiring training images in the three-dimensional human body posture estimation dataset and extracting two-dimensional skeleton information from the input training images by adopting a two-dimensional human body posture detector;

the model training unit is used for mapping the extracted two-dimensional skeleton information into a high-dimensional space by using a skeleton embedding module, capturing local and global information of the skeleton by using a network module with interweaved graphs and attentions, capturing multi-level information of the skeleton by using a multi-layer perceptron module with a U-shaped structure, finally, regressing the high-dimensional characteristics by using a regression head module to obtain a three-dimensional skeleton, and training a three-dimensional human body posture estimation model by using the mean square error of an articulation point as a loss function of three-dimensional human body posture estimation supervision learning;

the three-dimensional human body posture estimation unit adopts a pre-trained two-dimensional posture estimation detector to extract two-dimensional skeleton information of a human body in an image to be estimated, the extracted two-dimensional skeleton information is sequentially input into a skeleton embedding module after training, a network module with interweaved pictures and attention, a multi-layer perceptron module with a U-shaped structure, a regression head module and an output result of three-dimensional human body posture estimation.

The beneficial effects of the invention are as follows:

the invention can solve the problem that the existing network structure is insufficient in mining the local and global information of the human skeleton by innovatively combining the graph convolution network and the attention mechanism. In the network module of graph and attention interweaving, the graph volume block structure combined with the prior of the human skeleton topological structure is used for capturing local information of the human skeleton, the attention block is used for capturing global information of the human skeleton, mutual communication advantages of the graph volume block structure and the attention block are complementary, and the perception capability of the model to the local and global of the skeleton is enhanced. Further, the multi-layer perceptron module with the U-shaped structure is used for capturing multi-layer information contained in the framework.

The effect diagram of the invention is shown in fig. 2, and it can be seen that the invention can accurately estimate three-dimensional human body gestures corresponding to various complex human body actions. Our method is able to estimate results that are closer to a true three-dimensional pose than the MGCN method (Zhiming Zou and Wei Tang, modulated graphconvolutional network for 3D human pose estimation, "in Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2021, pp.11477-11487.). The invention can be introduced into a target recognition system and a man-machine interaction system to realize a more complete intelligent monitoring technology.

Drawings

Fig. 1 is a flow chart of a three-dimensional human body posture estimation method based on graph and attention interleaving of the present invention.

Fig. 2 is a three-dimensional human body posture estimation effect diagram of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is to be understood that the embodiments described are merely some, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a three-dimensional human body posture estimation method based on graph and attention interleaving of the present invention, comprising the steps of:

step 1: and inputting the training set image and the corresponding data label. In an actual training process, the input image data is usually a batch of data to ensure that model parameters can be stably optimized in the training optimization process.

Step 2: the human body posture in the input training image is extracted by a two-dimensional posture estimation detector. In this embodiment, the two-dimensional human body posture of the image in step 1 is estimated by using the existing method CPN (Chen Y, wang Z, peng Y, et al, canced pyramid network for multi-person position estimation, proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018:7103-7112), so as to obtain two-dimensional joint point coordinates

Where N is the number of nodes, set to 17.

Step 3: encoding the two-dimensional key point coordinates obtained in the step 2 by using a skeleton embedding module (consisting of a plurality of layers of fully-connected networks) to obtain a high-dimensional vector

Wherein the number of channels C is set to 512.

Step 4: the network module of graph and attention interleaving obtains local and global information of the skeleton. The invention combines the Graph volume network and the Attention mechanism, and has two guiding strategies (Graph 2Attention and Attention2 Graph), so that the Graph volume and the Attention mechanism can better learn the representation of the human skeleton.

Wherein, from Graph2Attention (Attention) strategyIs to guide the attention block to learn the topology prior of the human skeleton. Skeleton information f to be captured by a picture volume block _graph Injecting into the attention block, wherein the specific calculation formula is as follows:

wherein s is _G2A Is f _graph Softmax is an activation function that normalizes a numerical vector to a probability distribution vector, Q, K, V are the query matrix, key matrix, and value matrix in the attention mechanism, respectively, d represents the dimensions of the matrices Q, K, V, X _G2A Representing the result of introducing local information from the graph convolution to the matrix product in the attention block. With the guidance of skeleton information from the graph convolution block, the capability of the attention mechanism for capturing human skeleton related information is enhanced.

Among them, attention2Graph (Attention 2 Graph) strategy is to remedy the deficiency of Graph convolution blocks in capturing global dependencies. Human skeleton global information f to be captured by the attention block _global And feeding back to the graph convolution block, so that the graph convolution block has better knowledge of the global association of the skeleton. The specific calculation formula is as follows:

X _A2G ＝G ₁ +s _A2G ·f _global ,

wherein G is ₁ Representing the first picture volume layer, s, in a picture convolution block _A2G Is aimed at human skeleton global information f _global Scaling factor, X of _A2G Denoted by G ₁ Results after global information from the attention block is introduced. In this way, the global information of the skeleton can be better perceived by the graph convolution block.

The perceptibility of both the picture convolution block and the attention block is enhanced under the guidance of interleaving from the complementary information of both parties. Finally, the outputs of the picture scroll block and the attention block are added. The calculation formula can be expressed as:

X _IGA ＝G ₂ (X _A2G )+Proj(X _G2A ),

wherein,,G ₂ (-) represents the second ply in the ply block, proj (·) is the projection head comprising 2 linear layers, X _IGA Representing the sum of the outputs of the picture convolution block and the attention block under the guidance of the complementary information.

Step 5: and (3) inputting the characteristics obtained in the step (4) into a multi-layer perceptron module with a U-shaped structure, and further extracting multi-layer information of the skeleton. The module performs up-down sampling along the channel dimension. First input X _IGA Downsampled projection layer X with halving in channel dimension _down This is followed by an intermediate layer X with the channel dimensions remaining unchanged _mid Finally, the up-sampling projection layer X doubling the dimension of the output channel _up . The specific formula is calculated as follows:

X _down ＝MLP _down (LN(X _IGA )),

X _mid ＝MLP _mid (X _down )+X _down ,

X _up ＝MLP _up (X _mid )+X _IGA ,

where MLP (·) is a block of MLPs containing one linear layer, LN represents the layer normalization operation (Layer Normalization). Through up-and-down sampling in the channel dimension, semantic information contained in the skeleton is effectively captured.

Step 6: regression is carried out on the obtained characteristics in the step 5 by using a regression head module, namely a two-layer fully connected network, so as to obtain predicted three-dimensional attitude joint points

Step 7: and (3) calculating the error between the three-dimensional skeleton predicted in the step (6) and the true value of the three-dimensional posture by using the mean square error loss function of the joint points, so as to train a skeleton embedding module, a network module with interweaved graphs and attentiveness, a multi-layer perceptron module with a U-shaped structure and a regression head module. The mean square error is defined as:

wherein n=17 represents the number of the articulation points, J _i Coordinates of the ith three-dimensional joint point true value, X _i Is the predicted coordinates of the ith three-dimensional node.

Step 8: three-dimensional human body posture estimation is carried out on the image to be estimated: firstly, extracting two-dimensional skeleton information of a human body from an image to be estimated by using a two-dimensional gesture detector, inputting the extracted two-dimensional information into a skeleton embedding module, a network module with interweaved pictures and attention, a multi-layer perceptron module and a regression head module with U-shaped structures, and outputting to obtain a three-dimensional human body gesture estimation result.

The effect diagram of the invention is shown in fig. 2, and it can be seen that the invention can realize accurate three-dimensional human body posture estimation for various human body actions.

Based on the same inventive concept, another embodiment of the present invention is a three-dimensional human body posture estimation system based on graph and attention interleaving, comprising:

the three-dimensional human body posture estimation unit adopts a pre-trained two-dimensional posture estimation detector to extract two-dimensional skeleton information of a human body in an image, the extracted two-dimensional skeleton information is sequentially input into a skeleton embedding module after training, a network module with interweaved pictures and attention, a multi-layer perceptron module with a U-shaped structure, a regression head module and an output result of three-dimensional human body posture estimation.

Wherein the specific implementation of each unit and each module is referred to the previous description of the method of the invention.

Based on the same inventive concept, another embodiment of the present invention provides a computer device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

The foregoing examples are merely illustrative of the present invention, and although the preferred embodiments of the present invention and the accompanying drawings have been disclosed for illustrative purposes, it will be understood by those skilled in the art that: various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the present invention should not be limited to the preferred embodiments and the disclosure of the drawings.

Claims

1. A three-dimensional human body posture estimation method based on graph and attention interleaving comprises the following steps:

mining local and global information of the skeleton contained in the high-dimensional vector obtained by the skeleton embedding module by using a network module of graph and attention interleaving;

using the mean square error of the joint point as a loss function of supervised learning, and performing supervised learning on the three-dimensional skeleton estimated by the regression head module to train a three-dimensional human body posture estimation model;

and taking the two-dimensional skeleton information extracted from the image to be estimated by the two-dimensional gesture detector as the input of a trained three-dimensional gesture estimation model, sequentially passing through a skeleton embedding module, a network module with interweaved diagrams and attention, a multi-layer perceptron module and a regression head module with U-shaped structures, and finally outputting and obtaining the three-dimensional human gesture estimation result.

2. The method of claim 1, wherein the graph and attention interweaving network module comprises a graph convolution block and an attention block, and combines the advantages of graph convolution and attention mechanisms in capturing local and global features of a human skeleton and allowing communication between the two to enhance modeling capabilities of the skeleton by the model.

3. The method of claim 2, wherein the graph volume block incorporates a topology prior of the human skeleton for capturing local information of the human skeleton; the local information means that each key point focuses on the nodes adjacent to the key point, and for the nodes far away, the connection with the key point tends to be ignored.

4. A method according to claim 3, wherein the prior of the topology of the human skeleton means that in the adjacency matrix representing the human skeleton, each node is not only connected to itself but also adjacent to its adjacent node, and symmetrical nodes in the skeleton are connected, and the inherent characteristics of the skeleton structure are characterized by means of the adjacency matrix.

5. The method of claim 2, wherein the attention block is used to capture global information of a human skeleton; the global information means that each joint point establishes a connection with all the joint points, and each joint point has global perception on the whole skeleton.

6. The method of claim 1, wherein the multi-layer perceptron module of U-shaped structure is comprised of a 3-layer fully connected network; the output of the first layer of fully-connected network is halved in the channel dimension compared with the input, the input and the output of the second layer of fully-connected network are kept unchanged in the channel dimension, and the third layer of fully-connected network improves the channel dimension of the output to be consistent with the input; the method comprises the steps that a short cut connection is kept at a place with consistent dimensions, the short cut connection is adopted between input of a first-layer fully-connected network and output of a 3 rd-layer fully-connected network, and the short cut connection is kept between input and output of a second-layer fully-connected network.

7. The method of claim 1, wherein the skeleton embedding module comprises a multi-layer fully connected network that progressively maps two-dimensional input to a high-dimensional space; the regression head module comprises a 2-layer fully connected network and is used for regressing high-dimensional characteristics to specific joint point coordinates.

8. A three-dimensional human body pose estimation system based on graph and attention interleaving, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.