CN112926475B

CN112926475B - Human body three-dimensional key point extraction method

Info

Publication number: CN112926475B
Application number: CN202110251506.XA
Authority: CN
Inventors: 刘晞; 刘勇国; 李巧勤; 杨尚明; 朱嘉静
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2022-10-21
Anticipated expiration: 2041-03-08
Also published as: CN112926475A

Abstract

The invention discloses a human body three-dimensional key point extraction method, which is applied to the field of human body three-dimensional key point detection and aims at solving the problem of poor estimation precision in the prior art, firstly, double view angles are adopted to collect human body action data; then, a double-branch multi-stage structure is adopted to respectively detect two-dimensional key point confidence maps of the human body on the data of the two visual angles; further establishing a three-dimensional key point generation model; finally, inputting a human body two-dimensional key point confidence map corresponding to the detected human body action behavior data to be detected into a three-dimensional key point generation model to obtain three-dimensional key point coordinates; the method can effectively improve the estimation precision of the three-dimensional key points of the human body.

Description

Human body three-dimensional key point extraction method

Technical Field

The invention belongs to the field of image processing, and particularly relates to a three-dimensional key point detection technology.

Background

At present, human body motion capture technology is widely adopted in aspects of health monitoring, movie and television production and the like, and the motion of a virtual character is rendered to be more real according to the real motion of a reproduced human body, wherein human body key point detection is the basis for realizing human body motion reproduction. And two-dimensional key point detection and three-dimensional key point detection can be divided according to whether the detection result contains three-dimensional depth information. The two-dimensional key point detection is researched more, but false detection and missed detection are easily caused due to reasons such as shielding or light and shadow change, and the detection precision is influenced.

At present, three-dimensional key point detection is mainly divided into two types: the method is characterized in that three-dimensional key point detection is directly carried out from an image, and Chinese invention patent 'a combined target classification and three-dimensional attitude estimation method (CN 108280481A) based on a residual error network' carries out key point feature extraction and classification based on a residual error network ResNet-50 to realize three-dimensional key point detection; the other method is to acquire two-dimensional coordinates of key points from an image and generate three-dimensional coordinates based on the two-dimensional coordinates of the key points, and the Chinese patent 'a method for estimating human body three-dimensional posture based on structural information (CN 110427877A)' firstly inputs a monocular RGB image into a two-dimensional posture detector to acquire the two-dimensional coordinates of the key points, and then constructs a graph convolution network based on the structural information of the two-dimensional key points and outputs the three-dimensional coordinates of the key points.

The existing three-dimensional key point detection method has the following defects: 1) The method for directly checking the three-dimensional key point coordinates through the images usually depends on other parameters, such as a camera projection matrix, and the parameters are usually not marked in the video data; 2) The direct three-dimensional key point labeling is difficult, the existing training data basically come from a motion capture system, the scene and the object are single, and the generalization capability of the trained model is limited; 3) The detection of three-dimensional key points of a video is generally carried out according to frames, a static image of each frame is processed, and time sequence information between continuous frames and action change of the previous frame and the next frame are ignored.

Disclosure of Invention

In order to solve the technical problems, the invention provides a human body three-dimensional key point extraction method which comprises the steps of firstly, respectively extracting features of two visual angle original images, preliminarily generating two-dimensional key point coordinates through two-dimensional key point confidence detection, simultaneously establishing a three-dimensional key point generation model, cooperatively predicting the three-dimensional key point coordinates through double visual angles, and improving the detection precision of the human body three-dimensional key points.

The technical scheme adopted by the invention is as follows: a human body three-dimensional key point extraction method comprises the following steps:

s1, collecting human body action behavior data by adopting double visual angles;

s2, detecting the confidence maps of the two-dimensional key points of the human body by adopting a double-branch multi-stage structure;

s3, establishing a three-dimensional key point generation model;

and S4, processing the human body action behavior data to be detected acquired in the step S1 in the step S2 to obtain a corresponding two-dimensional key point confidence map, and inputting the two-dimensional key point confidence map into the three-dimensional key point generation model established in the step S3 to obtain three-dimensional key point coordinates.

The step S1 specifically comprises the following steps: two cameras are adopted and marked as a camera A and a camera B, human action behavior data acquisition is carried out simultaneously, and synchronous frame sampling is carried out on acquired video data.

The double-branch multi-stage structure in step S2 is specifically: the upper branch is used for learning the positions of key points in the camera A, the lower branch is used for learning the positions of key points in the camera B, and the upper branch and the lower branch comprise a plurality of stages, wherein 3 layers of 3 × 3 convolution and two layers of 1 × 1 convolution are adopted in the stage 1, and 5 layers of 7 × 7 convolution and two layers of 1 × 1 convolution are adopted in the rest stages.

The method also comprises the steps of extracting original image features by adopting a layer of three-dimensional CNN, wherein the input of the first stage is the original image features extracted by the layer of three-dimensional CNN; the input of the subsequent stage is the original image characteristics extracted by the three-dimensional CNN of the layer and the confidence map prediction result of the previous stage.

The layer of three-dimensional CNN is used for extracting image features of the current frame and frames before and after the current frame.

The three-dimensional CNN convolution kernel size of this layer is 3 × 3 × 3.

S3, the three-dimensional key point generation model specifically comprises the following steps: and the three convolutional layers and the one full connection layer adopt a sigmoid function as an output unit and use a ReLU function as an activation function of the convolutional layers.

The method further comprises the following step of carrying out weak supervision training on the three-dimensional key point generation model, specifically: and the optimization target is a minimum loss function, a gradient descent method is adopted for carrying out back propagation weight training, and the parameters of the three-dimensional key point generation model are updated in an iterative manner.

The loss function expression is:

Loss＝L _D +L _TD +f

wherein L is _D Representing the distance error loss function, L _TD Indicating inter-frame errorsAnd f represents a two-dimensional confidence loss function of the whole double-branch multi-stage structure.

The invention has the beneficial effects that: according to the method, video data of two visual angles are obtained through a common camera, characteristics are extracted through CNN, and two-branch multi-stage structures are utilized to respectively detect confidence maps of two-dimensional key points of a human body on the data of the two visual angles; designing a three-dimensional CNN model, generating three-dimensional key point coordinates based on a two-dimensional key point confidence map of the detected human body action behavior data to be detected, and performing weak supervision training on the model through double-view combination; the method of the invention has the following advantages:

1) The three-dimensional convolution neural network simultaneously extracts the spatial characteristics and the track information of the key points and the time characteristics of the whole activity process, and reduces the confidence map error of the two-dimensional key points by utilizing the interframe correlation information.

2) The method generates the three-dimensional coordinates based on the detected two-dimensional confidence map, reduces the influence brought by the process of converting the confidence map into the two-dimensional key point coordinates, establishes the loss function training model by combining the consistency of the time point and the human body posture through the double visual angles, solves the problem of lack of three-dimensional key point marking, simultaneously repairs the influence brought by missing detection of the key points of a single visual angle, and obviously improves the estimation precision of the human body three-dimensional key points.

Drawings

FIG. 1 is a flow chart of three-dimensional keypoint coordinate estimation;

FIG. 2 is a two-branch multi-stage structure provided by the present invention;

fig. 3 is a schematic diagram of three-dimensional coordinate generation.

Detailed Description

In order to facilitate understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

As shown in fig. 1, the method of the present invention mainly includes the steps of video data acquisition, two-dimensional key point detection, three-dimensional key point coordinate generation, etc., and the specific steps are as follows:

1. data acquisition

The invention adopts two common cameras, the direction of the camera A and the camera B forms an angle of 90 degrees, the human body action data acquisition is carried out simultaneously, the synchronous frame sampling is carried out on the acquired video data, the sampling frequency is set to be 30Hz, namely 30 frames of images are sampled per second, and the size of each frame of image is represented as w multiplied by h pixels.

The azimuth angles of the camera a and the camera B are not limited to 90 degrees, and may be other angles.

2. Video-based three-dimensional keypoint detection

According to the invention, three-dimensional key point detection is carried out by combining a three-dimensional CNN (Convolutional Neural Networks) with a two-layer CPM (Convolutional attitude Machine) network, a network frame is designed into a multi-stage double-view branch structure, as shown in FIG. 2, each branch corrects a confidence map through multiple stages, and each stage has supervision training, so that the problem that an over-deep network is difficult to optimize is solved. In which stage 1 (indicated by block1 in fig. 2) employs 3 layers of 3 × 3 convolution and two layers of 1 × 1 convolution, and the remaining stages (indicated by block2 in fig. 2) employ 5 layers of 7 × 7 convolution and two layers of 1 × 1 convolution. The upper branch is used for learning the position of a key point in the visual angle A, and is represented as a confidence map, namely the probability of the key point at a certain point on the image; the lower branch is used to learn the location of the keypoints in view B. And finally, realizing three-dimensional key point prediction by the cooperation of the upper branch and the lower branch. C within the box in fig. 2 is used to represent convolution.

The specific process is as follows:

1) Two-dimensional confidence map detection

Inputting images with the size of w multiplied by h multiplied by T collected in a visual angle A and a visual angle B into a double-branch multi-stage structure network, wherein T represents the number of image frames, firstly extracting image characteristics F of a corresponding frame and frames before and after the corresponding frame through a layer of three-dimensional CNN ^A And F ^B The sizes of convolution kernels are all set to be 3 multiplied by 3 so as to extract time sequence characteristics between three frames; the data size after convolution is (w-2) × (h-2) × T. F ^A And F ^B Corresponding to the two branches respectively, and taking the branches as the input of the first stage. The first stage of the network being based on image features F ^A And F ^B Generating two sets of detection confidence maps

And

wherein the content of the first and second substances,

r represents a real number, i.e.

Representing a confidence map of the jth key point in the first stage in a visual angle A by a w multiplied by h matrix, wherein J belongs to {1 … J }, and J represents the number of the key points;

and J belongs to {1 … J }, and J represents the number of key points.

Then, each stage is similar to the first stage, and the confidence map prediction result and the original image characteristic F from the previous stage are input ^A And F ^B To produce more accurate prediction results.

And

represents the network structure of the nth stage (it should be noted by those skilled in the art that the network structure here is equivalent to a process function),

and

representing the set S obtained in the n-th stage ^A And set S ^B ：

Wherein the content of the first and second substances,

representing a confidence coefficient graph of the jth key point at the nth stage in the visual angle A, wherein J belongs to {1 … J }, J represents the number of the key points, N belongs to {1 … N }, and N represents the number of the model stages;

a confidence map representing the jth keypoint of the nth stage in view B.

The confidence map loss function for stage n is calculated as follows:

where P represents each pixel location in the image,

is a prediction confidence map for the jth keypoint in the nth stage in view a,

is the true confidence map of the jth keypoint in view A;

is a prediction confidence map for the jth keypoint in the nth stage in view B,

the confidence coefficient map is a real confidence coefficient map of the jth key point in the view angle B, W is a binary mask matrix and is used for reducing errors caused by label value missing, when a label at the position P is missing, the value of the binary mask matrix is 0, otherwise, the value of the binary mask matrix is 1.

The overall two-dimensional confidence loss function is expressed as:

and N represents the total number of stages of the network structure, the confidence coefficient graph of the key points is analyzed through a greedy algorithm, and the coordinates of the two-dimensional key points of the human body are output.

Due to the fact that the image has problems of blocking or light rays and the like, missing detection may occur through the two-dimensional key point coordinates detected in the steps, a threshold th =0.4 is set, and if the maximum probability value of a certain point in the confidence map generated in the last stage does not exceed th, the point is judged to be a missing detection point. Processing respective missing key points in the two visual angles, if the same key point is judged to be missing in the visual angle A and successfully detected in the visual angle B, setting the coordinates of the key point in the visual angle A to be consistent with those in the visual angle B, and if the same key point is judged to be missing in the visual angle B and successfully detected in the visual angle A, setting the coordinates of the key point in the visual angle B to be consistent with those in the visual angle A.

The threshold th =0.4 is based on a compromise principle, and filters the missing detection points based on the threshold th, and if the threshold is too large, the complexity of the algorithm for filtering the missing detection points is increased and the algorithm is not favorable for repairing, and if the threshold is too small, part of the missing detection points may be omitted, and the error is increased, so that the threshold is set in terms of detection accuracy and algorithm efficiency in a compromise manner.

2) Three-dimensional coordinate generation:

the two-dimensional confidence map set obtained in the last stage is collected

And collections

Inputting a three-dimensional CNN network, extracting information of front and rear key points and outputting three-dimensional coordinates of corresponding key points, wherein the network structure is shown in figure 3. The input size is w × h × J, J representing the number of key points. The CNN includes three convolutional layers, each of which is set to a size of 3 × 3 × 3. And (3) introducing nonlinear mapping for the model by using a ReLU function as an activation function of the convolutional layer, wherein the ReLU activation function is as follows:

ReLU(x)＝max(0,x)

where x represents a function argument.

The convolution layer extracts the space characteristics among the key points through convolution operation, and inputs the convolution result into a full-connection Dense (J) layer, wherein J is the regression target number (namely the number of the key points of the human body), and J is equal to 17 in the invention. And finally, taking a sigmoid function as an output unit:

and (5) performing inverse normalization on the output result to obtain a corresponding three-dimensional coordinate point.

3) Coordinate point processing

The world coordinate system is determined to be the point A of the camera, the output result of the visual angle B branch line is converted into the world coordinate system coordinate, and the initial result is

And

representing the camera coordinate system coordinates at each of view a and view B. The transfer of the camera coordinate system to the world coordinate system transfer matrix is:

wherein R is an orthogonal rotation matrix, (R) _x ,r _y ,r _z ) Is the angular deviation of camera B relative to camera A, and B is the position of camera B in the world coordinate system (B) _x ,b _y ,b _z )，

Representing the coordinates of view B output in the world coordinate system.

The final two branch results after the output conversion of the B branch at the visual angle are

And

and the coordinates of the key points of the human body output from the visual angle A and the visual angle B in a world coordinate system are shown, wherein T belongs to {1 … T }, the corresponding image frame numbers are shown, and the model outputs at the same moment are consistent.

4) Weakly supervised model training

The three-dimensional key point model loss function comprises a distance error loss and an inter-frame loss, and the distance error loss function is defined as follows:

t denotes the number of frames simultaneously input to the network,

indicating a calculated euclidean distance.

The interframe error loss function is as follows:

the final joint loss function is defined as the sum of two losses of the two-dimensional key point branch and the three-dimensional key point branch in the last stage:

Loss＝L _D +L _TD +f

the goal of the model training is to minimize the loss function, perform back propagation weight training by using a gradient descent method, and iteratively update model parameters, which are known to those skilled in the art and specifically refer to weight parameters and bias parameters in the neural network calculation.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A human body three-dimensional key point extraction method is characterized by comprising the following steps:

s2, detecting two-dimensional key point confidence maps of the human body on the data of the two visual angles by adopting a double-branch multi-stage structure; the double-branch multi-stage structure in step S2 is specifically: the upper branch is used for learning the positions of key points in the camera A, the lower branch is used for learning the positions of key points in the camera B, and the upper branch and the lower branch comprise a plurality of stages, wherein 3 layers of 3 × 3 convolution and two layers of 1 × 1 convolution are adopted in the stage 1, and 5 layers of 7 × 7 convolution and two layers of 1 × 1 convolution are adopted in the rest stages;

the method also comprises the steps of extracting original image features by adopting a layer of three-dimensional CNN, wherein the input of the first stage is the original image features extracted by the layer of three-dimensional CNN; the input of the subsequent stage is the original image characteristics extracted by the three-dimensional CNN of the layer and the confidence map prediction result of the previous stage; the three-dimensional CNN is used for extracting image characteristics of the current frame and frames before and after the current frame; the size of the three-dimensional CNN convolution kernel of the layer is 3 multiplied by 3;

s3, establishing a three-dimensional key point generation model; step S3, the three-dimensional key point generation model specifically comprises the following steps: the three convolutional layers and the one full-connection layer adopt a sigmoid function as an output unit and use a ReLU function as an activation function of the convolutional layers;

and S4, processing the human body action behavior data to be detected acquired in the step S1 in the step S2 to obtain a corresponding two-dimensional key point confidence map, and inputting the two-dimensional key point confidence map into the three-dimensional key point generation model established in the step S3 to obtain a three-dimensional key point coordinate.

2. The method for extracting three-dimensional key points of a human body according to claim 1, wherein the step S1 specifically comprises: two cameras are adopted and marked as a camera A and a camera B, human action behavior data acquisition is carried out simultaneously, and synchronous frame sampling is carried out on acquired video data.

3. The method for extracting three-dimensional key points of a human body according to claim 2, further comprising performing weak supervision training on the three-dimensional key point generation model, specifically: and the optimization target is a minimum loss function, a gradient descent method is adopted for carrying out back propagation weight training, and the parameters of the three-dimensional key point generation model are updated in an iterative manner.

4. The method for extracting three-dimensional key points of a human body according to claim 3, wherein the loss function expression is as follows:

Loss＝L _D +L _TD +f

wherein L is _D Represents the distance error loss function, L _TD And f represents a two-dimensional confidence loss function of the whole double-branch multi-stage structure.