CN111968208B

CN111968208B - Human body animation synthesis method based on human body soft tissue grid model

Info

Publication number: CN111968208B
Application number: CN202010645245.5A
Authority: CN
Inventors: 王卓薇; 林伟达
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2023-10-03
Anticipated expiration: 2040-07-07
Also published as: CN111968208A

Abstract

The invention discloses a human body animation synthesis method based on a human body soft tissue grid model. According to the three-dimensional gesture and the soft tissue texture, the characters in the original image are mapped to the three-dimensional human soft tissue grid model, reconstruction of the human soft tissue grid model is achieved, and human animation synthesis is achieved through redirection of the human soft tissue grid model. The method combining the human body soft tissue grid model and the deep learning can accurately capture the motion of the human body soft tissue, so that the synthesized human body animation has richer detail expression.

Description

Human body animation synthesis method based on human body soft tissue grid model

Technical Field

The invention relates to the field of algorithms for film and television production and virtual reality motion capture, in particular to a soft tissue motion human body animation synthesis method based on human body grid model redirection.

Background

At present, an optical motion capture system is used for determining the position of a moving object by relying on expensive and bulky external equipment such as a physical tracking point, laser, a photosensitive sensor and the like, and complicated post-processing work is needed; and because of the limited number of physical tracking points, they have little difficulty capturing precise soft tissue motion.

The current animation redirection synthesis technology is limited to a skeleton animation scheme, motion information is described by a skeleton, and the relationship between a human body grid and the skeleton is represented by a skin so as to obtain human body animation, but fine human body soft tissue motion cannot be captured.

Disclosure of Invention

The invention aims to provide a human body animation synthesis method for soft tissue movement based on human body mesh model redirection, which is used for solving the problems that the prior art is difficult to capture accurate soft tissue movement and complex post-processing work is required.

In order to realize the tasks, the invention adopts the following technical scheme:

a human body animation synthesis method based on a human body soft tissue grid model comprises the following steps:

inputting the video frames into a feature pyramid network to extract feature graphs containing low-level feature high-resolution and high-level feature high-semantic information;

the feature images are sent to a multi-task cascading module, corresponding areas are distributed on the feature images for different task candidate frames, and the corresponding areas are output to different task branches, so that corresponding feature images are distributed for different task branches;

inputting the allocated feature map to a depth high-resolution representation network for human body two-dimensional posture estimation to generate a two-dimensional posture, and inputting the two-dimensional posture to a three-dimensional posture for human body three-dimensional posture estimation based on a time domain cavity convolution network to predict a human body soft tissue grid model;

human body part segmentation is carried out on the distributed feature graphs through a region convolution neural network for object detection and segmentation, and a soft tissue UV texture graph is regressed through a method of convolving a feature layer of the network, so that a human body surface shape of a human body soft tissue grid model is generated;

reconstructing a human soft tissue network model according to the three-dimensional posture and the human surface shape of the human soft tissue network model;

judging whether the reconstructed human body soft tissue grid model is correct or not by utilizing a human body soft tissue grid model judging pool, correcting, and finally outputting a corrected human body soft tissue grid model;

and fitting the transferred stylized animation roles to the stylized animation roles by using the output human body soft tissue model, and finally outputting the generated human body animation.

Further, the step of inputting the video frames into the feature pyramid network to extract the feature map containing the high resolution of the low-level features and the high semantic information of the high-level features comprises the following steps:

for a video frame input into a feature pyramid network, carrying out feature extraction by utilizing a residual error network to generate feature graphs with different resolutions, wherein the feature graphs form a bottom-up path; the top-layer feature images form a channel from top to bottom through pooling operation, feature images with different resolutions are added with corresponding elements after pooling of adjacent feature images for connection after convolution operation, and finally the feature images are output.

Further, for the feature graphs extracted from the feature pyramid and containing high semantic information of the low-level features and the high-level features with high resolution, different regions of interest for different tasks exist on the original graph; and the method utilizes the RoIAlignon-based multi-task cascade module to process, accurately distributes corresponding feature graphs on the feature graphs of the areas interested in different tasks on the original graph, and finally outputs the feature graphs to a deep high-resolution representation network for human body two-dimensional posture estimation to perform two-dimensional posture estimation tasks and a region convolutional neural network based on the area convolutional neural network for object detection and segmentation to perform human body part segmentation tasks as inputs.

Further, the inputting the two-dimensional gesture to the three-dimensional gesture based on the time domain cavity convolution network for human body three-dimensional gesture estimation predicts the three-dimensional gesture of the human body soft tissue grid model, comprising:

for a two-dimensional node of depth high-resolution representation network output, a time domain cavity convolution network is utilized to predict a three-dimensional gesture, in the embodiment, the time domain cavity convolution network takes a video frame, each frame channel is 34 as an input, a feature map of a channel C=1024 is output by applying a convolution kernel with the size of W=3 and a cavity factor of d=1, and then batch normalization is connected, and a function ReLu is activated, so that dropout is regularized; simultaneously, applying B=4 residual network-style residual blocks to form a jump connection, wherein each residual block performs convolution operation of a hole factor with the filter size of W=1; and finally, respectively adding and connecting the output of each residual block and the regularized output corresponding element to obtain the three-dimensional gesture.

Further, the human body part is segmented on the assigned feature map through a region-based convolutional neural network for object detection and segmentation, wherein the region-based convolutional neural network classifies human body parts and generates a boundary box and a mask, and the features of the network are output to a UV texture space reasoning task, so that the human body part segmentation and the UV texture reasoning task share the features; the human body part is segmented into human pixel point UV texture reasoning on the image to provide a smaller starting range.

Further, reconstructing the human soft tissue network model according to the three-dimensional posture and the human surface shape of the human soft tissue network model comprises:

and for the three-dimensional gesture theta and the human body surface shape beta, determining the basic gesture of the model by using the gesture theta, determining the soft tissue deformation of the model by using the gesture beta, and finally outputting and reconstructing the human body soft tissue grid model, so that the pixels of the character in the original image are mapped to the three-dimensional dense human body surface model, and the action capture of the character is realized.

Further, for the reconstructed human soft tissue grid model, the discriminating pool of the human soft tissue grid model is used for discriminating whether the discrimination is correct or not and carrying out correction processing, and the method comprises the following steps:

for generating the human body soft tissue grid model theta, the human body soft tissue grid model theta can be separated into a human body surface shape beta and a human body posture theta, so that the discriminator D of the human body surface shape beta and the human body posture theta can be independently trained _θ And D _β The method comprises the steps of carrying out a first treatment on the surface of the Discriminator D in which the poses θ of k nodes _θ Can decompose into k rotation angle discriminatorsAnd an integral posture discriminator +.>Human body surface shape beta discriminator D _β Identifier D formed by a surface shape beta _β Thus, a total of k+2 discriminators are generated, which generate the loss functions as follows:

in the above formula, an output value of D (Θ) of 0 or 1,1 represents that the shape of the joint or surface identified by the identifier is reasonable, 0 is unreasonable, wherein D _θ The discriminator judges whether the angle of the joint is reasonable or not, D _β Judging whether the formed body shape accords with human body structure, and finally judging whether the output model Θ is reasonable; and by minimizing the loss function, provide weakly supervised learning to correct joint angles, character shape.

Further, for the transferred stylized animation character, fitting the output human body soft tissue model to the stylized animation character, and finally outputting the generated human body animation, wherein the method comprises the following steps of:

and for the downloaded stylized animation roles, fitting the downloaded stylized animation roles to the predicted human body soft tissue grid model theta by using a regularization method, firstly, taking the posture theta of the animation roles fitted to the human body soft tissue grid model theta as initialization, determining a basic posture, and then carrying out shape change modeling by replacing the initialized grid shape of the animation roles with the soft tissue deformation beta of the human body soft tissue grid model theta to finally generate reasonable animation.

A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program to perform the steps of a human animation synthesis method based on a human soft tissue mesh model.

A computer readable storage medium storing a computer program which when executed by a processor implements the steps of a human animation synthesis method based on a human soft tissue mesh model.

Compared with the prior art, the invention has the following technical characteristics:

the invention utilizes the feature pyramid network to extract the features with different dimensions, the extracted low-level features ensure the edge information, the high-level features ensure the high semantic information, and rich feature representation is provided for the subsequent tasks. Different feature maps can be simultaneously and accurately mapped back to the original map through the alignment of the region of interest based on the RoIAlign multi-task cascade architecture, different feature maps are distributed to corresponding task branches, the deep supervision and lifting effects of a plurality of tasks are achieved, and the problems that a large amount of post-processing work is needed and soft tissue motion is difficult to accurately capture in the prior art can be effectively solved by utilizing the complementary advantages of the relevant task synergistic effect and different supervision sources.

Drawings

FIG. 1 is a schematic overall flow diagram of the method of the present invention;

FIG. 2 is a diagram of the HRNet network architecture;

FIG. 3 is a schematic diagram of a FPN lateral and top-down connection scheme;

fig. 4 is a diagram of a video phase three-dimensional time domain hole convolution network.

Detailed Description

The invention provides an end-to-end human body animation synthesis method containing soft tissue movement, which eliminates the bone animation scheme of the current animation redirection and adopts a human body soft tissue grid model to redirect to a target animation model. The invention provides a method for learning the coordinates of skeleton key points of a three-dimensional gesture theta of a target person by utilizing a deep learning method, and simultaneously learning UV textures (coordinates of continuous points of the soft tissue surface of a human soft tissue grid model corresponding to video frame capturing pixels) of the surface shape beta of the human soft tissue. And reconstructing the human body soft tissue grid model according to the three-dimensional figure theta and the human body surface shape beta, and generating the human body animation by redirecting the human body soft tissue grid model.

Wherein, the invention uses a human body grid model to represent the motion capture, and the model is composed of separable gesture theta and human body surface beta, so that the discriminator of human body surface shape beta and gesture theta can be independently trained. The discriminator acts as a weak supervision, implicitly learning the angular limits of each joint, and discouraging people from forming unusual shapes. According to the method for predicting the three-dimensional gesture by the time domain cavity convolution network, time domain information is explored from a video, so that more stable prediction is generated, and the sensitivity to noise is reduced. The invention firstly roughly estimates the position of the pixel by human body part segmentation, and then by a trained regressorThe exact coordinates of pixel i in each region over that region on this dense point model of the human body are indicated using UV texture. How to combine deep learning with the SMPL human body mesh model for motion capture and to improve motion capture accuracy is an important scientific problem to be solved in this project.

Referring to fig. 1 to 4, the human body animation synthesis method based on the human body soft tissue grid model of the invention comprises the following steps:

step 1, inputting video frames into a feature pyramid network FPN to extract feature graphs containing low-level feature high-resolution and high-level feature high-semantic information

And for the video frame input with the FPN, carrying out feature extraction by utilizing a residual network ResNet to generate feature images with different resolutions, carrying out convolution operation on the feature images with different resolutions, adding the feature images with different resolutions with corresponding elements after pooling of adjacent feature images, and finally outputting the feature images.

In the scheme, the FPN is used for generating a feature map which simultaneously contains high resolution of low-level features and high semantic information of high-level features; these feature maps interact with each other and ultimately generate a feature map set representing lower-level information and higher-level information. When using the ResNet architecture, feature maps of different resolutions can be generated, which form bottom-up paths; the top-layer feature images form a channel from top to bottom through pooling operation, and when the feature images with different resolutions are convolved and then are added with corresponding elements after pooling of adjacent feature images, the feature images are finally output. The finally output feature images are fused with feature images with different resolutions, so that the finally output feature images can be used as input features of tasks such as human body part segmentation, UV texture reasoning, two-dimensional gesture estimation and the like in the next step.

And 2, sending the feature map output in the step 1 into a region-feature alignment RoIAlign-based multi-task cascade module, distributing corresponding regions on the feature map for different task candidate frames region pro-pos, and outputting the corresponding regions to different task branches, so that corresponding feature maps are distributed for different task branches.

And (3) for the feature map output in the step (1), performing region allocation of different task candidate frames on the feature map by using a region-feature alignment RoIAlign-based multi-task cascade module, and finally outputting to different task branches.

For feature graphs extracted from feature pyramid FPN and containing high semantic information of low-level features and high-level features with high resolution, regions of interest with different tasks exist on the original graph. The method is characterized in that a RoIAlign-based multitask cascade module is utilized to process, corresponding feature graphs are accurately distributed on feature graphs of areas of interest of different tasks on an original image, and finally the feature graphs are output to a deep high-resolution representation network (human soft tissue grid model) for human body two-dimensional gesture estimation to perform two-dimensional gesture estimation tasks and a region convolutional neural network Mask-RCNN based on object detection and segmentation to perform human body part segmentation tasks as inputs.

In the scheme, the RoIALign is used for enabling the follow-up two-dimensional human body posture estimation, human body part segmentation tasks to jointly act on the feature diagram output by the FPN, different task candidate frames on the original diagram are allocated to areas on the feature diagram without deviation, the different task candidate frames are accurately allocated to different areas on the feature diagram and input to respective specific task branches, and output features of the three-dimensional posture estimation are fused into the input feature diagram of the UV texture reasoning, so that cascading operation of related tasks is enabled. Because the human body two-dimensional posture estimation task and the human body part segmentation task share the features on the feature map output by the FPN, the two tasks can be simultaneously cascaded.

And 3, inputting the feature map distributed in the step 2 into a depth high-resolution representation network HRNet for estimating the two-dimensional posture of the human body to generate a two-dimensional posture, and inputting the two-dimensional posture into a three-dimensional posture theta of a human body soft tissue grid model theta based on a time domain hole convolution network video Pose3D prediction for estimating the three-dimensional posture of the human body.

Step 3.1, generating a two-dimensional gesture by the feature map through HRNet

And (3) carrying out two-dimensional gesture estimation on the feature map distributed in the step (2) by using the HRNet, and finally taking joint points of the two-dimensional gesture as the input of the video Pose3D in the step (3.2).

In the scheme, the HRNet has the function of connecting the sub-networks with different resolutions together in a new mode by adopting a parallel structure, wherein the network structure is shown in fig. 2, and finally, the two-dimensional attitude estimation is output. Because the parallel structure fuses the characterization of various different scales, the high-resolution characterization is maintained, and the high-resolution characterization is recovered from the low-resolution characterization, the two-dimensional gesture estimation effect is obviously improved.

The time domain context information is explored from the video through the time domain cavity convolution network, so that more stable prediction is generated, and the sensitivity to noise is reduced. Because the time domain cavity convolution model is a full convolution structure with residual connection, a dense two-dimensional gesture sequence is adopted as input, two-dimensional gesture information and time dimension information can be processed simultaneously, and the convolution structure is used for accurately controlling a time receptive field, so that the time dependence of the three-dimensional gesture prediction model is facilitated. In addition, hole convolution is employed to model long-term dependencies while maintaining efficiency. The last output contains one three-dimensional pose of the temporal context information of all input sequences, which is therefore more robust.

Step 3.2, predicting the three-dimensional gesture of the character by the two-dimensional gesture through video Pose3D

And (3) carrying out three-dimensional attitude estimation by using the video Pose3D for the two-dimensional attitude estimation output in the step 3.1.

In the scheme, the video Pose3D has the function of predicting the three-dimensional gesture based on the time domain cavity convolution network of the two-dimensional joint point, and the time domain information is explored in the video, so that more stable prediction is generated, and the sensitivity to noise is reduced. The video Pose3D captures long sequence information by using time domain hole convolution in a three-dimensional attitude generation strategy, and the generated three-dimensional attitude features are more stable in output.

For two-dimensional nodes output by HRNet, a time domain hole convolution network is used for predicting three-dimensional gestures, 243 video frames are taken as input by the time domain hole convolution network in the embodiment, a convolution kernel with the size of w=3 and the hole factor of d=1 are applied to the input of each frame of 34 channels (respectively representing x-axis coordinates and y-axis coordinates of 17 two-dimensional nodes), a characteristic diagram of a channel C=1024 is output, and then batch normalization Batchnorm is connected, and a function ReLu is activated and dropout is regularized. And simultaneously applying b=4 residual network-style residual blocks to form a jump connection, wherein each residual block performs convolution operation of a hole factor with a filter size of w=1. And finally, respectively adding the output of each residual block and the regularized output corresponding element to be connected.

And 4, performing human body part segmentation on the feature map distributed in the step 2 through a region convolution neural network Mask-RCNN for object detection and segmentation, and regressing a soft tissue UV texture map by a method of convolving a Mask-RCNN feature layer to generate a human body surface shape beta of a human body soft tissue grid model theta.

Step 4.1, human body part segmentation is carried out on the feature map

And (3) for the feature map distributed in the step (2), performing human body part segmentation by using Mask-RCNN, and outputting the feature map for a UV texture space reasoning task.

In the scheme, the Mask-RCNN has the functions of classifying the human body part, generating a bounding box and a Mask, and outputting the characteristics of the Mask-RCNN to the UV texture space reasoning task so that the human body part segmentation and the UV texture reasoning task share the characteristics. The human body part is segmented into human pixel point UV texture reasoning on the image to provide a smaller starting range.

And 4.2, reasoning the UV texture of the soft tissue to generate the human surface shape beta of the human soft tissue grid model theta.

And (3) for the feature map generated by Mask-RCNN in the step 4.1, the UV texture map coordinates of the characters on the image are regressed by convolution operation, and the human body surface shape beta of the human body soft tissue grid model Θ is generated.

In the scheme, the function of the regression character UV texture is to establish a numerical representation of the human surface shape beta and capture the character soft tissue deformation.

And 5, reconstructing the human soft tissue network model according to the three-dimensional posture theta and the human surface shape beta of the human soft tissue grid model theta.

And (3) determining the basic posture of the model by using the three-dimensional posture theta output in the step (3) and the human body surface shape beta generated in the step (4), determining the soft tissue deformation of the model by using the theta, and finally outputting a reconstructed human body soft tissue grid model, so that the pixels of the human figure in the original figure are mapped to the three-dimensional dense human body surface model, and the action capture of the human figure is realized.

In the scheme, a human body model with a determined gesture is generated by expanding a three-dimensional gesture backbone, and then the human body surface shape beta is attached to the human body model, and the three-dimensional gesture theta and the human body surface shape beta are fused, so that a human body soft tissue grid model can be generated.

Step 6, redirecting the human body soft tissue grid model

And judging whether the reconstructed human body soft tissue grid model is correct or not by utilizing a human body soft tissue grid model judging pool, correcting, and finally outputting a corrected human body soft tissue grid model.

In this scheme, using a discriminating pool of human soft tissue mesh models generated by data driving, the role of the discriminating pool is that the discriminator acts as a weak supervision, implicitly learning the angular limits of each joint, so that the input can be corrected.

For generating the human body soft tissue grid model theta, the human body soft tissue grid model theta can be separated into a human body surface shape beta and a human body posture theta, so that the discriminator D of the human body surface shape beta and the human body posture theta can be independently trained _θ And D _β . Discriminator D in which the poses θ of k nodes _θ Can decompose into k rotation angle discriminatorsAnd an integral posture discriminator +.>Human body surface shape beta discriminator D _β Identifier D formed by a surface shape beta _β Thus, a total of k+2 discriminators are generated, which generate a loss function as follows, where a D (Θ) output value of 0 or 1,1 represents that the shape of the joint or surface identified by the discriminator is reasonable and 0 is unreasonable.

In this scheme, D _θ The discriminator judges whether the angle of the joint is reasonable; d (D) _β Judging whether the formed body shape accords with human body structure, and finally judging whether the output model Θ is reasonable. And by minimizing the loss function, provide weakly supervised learning to correct joint angles, character shape.

Step 7, redirecting the human body soft tissue grid model

And (3) fitting the transferred stylized animation roles to the stylized animation roles by utilizing the human body soft tissue model theta output in the step (6), and finally outputting the generated human body animation.

And fitting the downloaded stylized animation roles to the predicted human body soft tissue grid model theta by using a regularization method. Firstly, fitting an animation role to the posture theta of a human body soft tissue grid model theta to be used as initialization, determining a basic posture, and then carrying out shape change modeling by replacing the initialized grid shape of the animation role with the soft tissue deformation beta of the human body soft tissue grid model theta to finally generate reasonable animation.

In the scheme, the redirection function is to model the shape change of the gesture and the soft tissue deformation beta in the human body soft tissue model theta, the gesture theta of the human body soft tissue grid model theta is used to determine the basic gesture of a person, and the soft tissue deformation beta of the human body soft tissue grid model theta is replaced point by point to be the surface shape beta of the stylized animation role, so that the animation role of accurate and exquisite soft tissue movement can be generated, and the human body soft tissue grid model output by the whole video sequence is mapped to the human body animation.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. The human body animation synthesis method based on the human body soft tissue grid model is characterized by comprising the following steps of:

fitting the transferred stylized animation roles to the stylized animation roles by using the output human body soft tissue model, and finally outputting the generated human body animation;

the inputting the two-dimensional gesture to the three-dimensional gesture of the human body soft tissue grid model based on the time domain cavity convolution network for human body three-dimensional gesture estimation comprises the following steps:

for a two-dimensional node which is deeply and high-resolution and represents network output, predicting a three-dimensional gesture by using a time domain cavity convolution network, wherein the time domain cavity convolution network takes a video frame, each frame channel is 34 as an input, a feature map of a channel C=1024 is output by applying a convolution kernel with the size of W=3 and a cavity factor of d=1, and then batch normalization, activation function and regularization are connected; simultaneously, applying B=4 residual network-style residual blocks to form a jump connection, wherein each residual block performs convolution operation of a hole factor with the filter size of W=1; finally, adding and connecting the output of each residual block and the regularized output corresponding element respectively to obtain a three-dimensional gesture;

for the reconstructed human body soft tissue grid model, the human body soft tissue grid model discrimination pool is utilized to discriminate whether the human body soft tissue grid model is correct or not and correct, and the method comprises the following steps:

2. The human body animation synthesis method based on the human body soft tissue grid model according to claim 1, wherein the step of inputting the video frames into the feature pyramid network to extract the feature map containing the low-level feature high-resolution and high-level feature high-semantic information comprises the following steps:

3. The human body animation synthesis method based on the human body soft tissue grid model according to claim 1, wherein for the feature map extracted from the feature pyramid and containing high semantic information of the low-level features and the high-level features with high resolution, there are different regions of interest for different tasks on the original map; and the method utilizes the RoIAlignon-based multi-task cascade module to process, accurately distributes corresponding feature graphs on the feature graphs of the areas interested in different tasks on the original graph, and finally outputs the feature graphs to a deep high-resolution representation network for human body two-dimensional posture estimation to perform two-dimensional posture estimation tasks and a region convolutional neural network based on the area convolutional neural network for object detection and segmentation to perform human body part segmentation tasks as inputs.

4. The human body animation synthesis method based on the human body soft tissue grid model according to claim 1, wherein the human body part segmentation is performed on the allocated feature map through a region-based convolutional neural network for object detection and segmentation, wherein the region-based convolutional neural network performs feature sharing between human body part segmentation and a UV texture space reasoning task by classifying human body parts and generating a bounding box and a mask, and outputting the features of the network into the UV texture space reasoning task; the human body part is segmented into human pixel point UV texture reasoning on the image to provide a smaller starting range.

5. The human body animation synthesis method based on a human body soft tissue grid model according to claim 1, wherein reconstructing the human body soft tissue network model according to the three-dimensional pose and the human body surface shape of the human body soft tissue grid model comprises:

6. The human body animation synthesis method based on the human body soft tissue grid model according to claim 1, wherein for the transferred stylized animation character, fitting the output human body soft tissue model to the stylized animation character, and finally outputting the generated human body animation, comprising:

7. Terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.

8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.