CN115965765A

CN115965765A - Human motion capture method in deformable scene based on neural deformation

Info

Publication number: CN115965765A
Application number: CN202211534535.8A
Authority: CN
Inventors: 王雁刚; 谢薇; 高桓; 朱明敏
Original assignee: Nanjing Yingqi Intelligent Technology Co ltd; Southeast University
Current assignee: Nanjing Yingqi Intelligent Technology Co ltd; Southeast University
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-04-14

Abstract

The invention discloses a method for capturing human motion in a deformable scene based on neural deformation. And then, estimating a contact probability graph of the human body mesh under the three-dimensional human body posture, obtaining the vertex of the human body mesh contacted with the scene, finding a corresponding contact point on the scene mesh through ray projection, and optimizing by using the contact point pair of the human body mesh and the scene mesh to obtain the global three-dimensional human body posture. And then, constructing a transform-based neural deformation network, and performing non-rigid deformation on the scene grids according to the interaction state of the current human body grids and the scene grids. And finally, the global human body posture and the non-rigid deformation of the execution scene grid are iteratively and alternately optimized, and the high-quality unmarked monocular three-dimensional human body motion capture and the non-rigid three-dimensional scene deformation are realized.

Description

Human motion capture method in deformable scene based on neural deformation

Technical Field

The invention relates to a human motion capture method in a deformable scene based on neural deformation, belonging to the field of computer vision and computer graphics.

Background

Human motion capture has wide application in character animation, human-computer interaction, human behavior understanding and the like. Conventional motion capture collects motion information of a moving person by an optical motion capture system or an inertial motion capture system. However, both optical and inertial kinetic catches require the actor to wear special equipment that affects the range of use of the kinetic catch and the realism of human motion, and these devices are often expensive. In recent years, with the development of deep learning and the creation of large data sets, the research of label-free motion capture technology has made remarkable progress. There is a great deal of work available to capture three-dimensional human motion from single-view videos and images. However, the problem of scale blurring exists in the process of reconstructing a three-dimensional human body from a monocular color image, and the existing method cannot solve the problem well. Furthermore, most of these methods consider the background static, ignoring potential scene changes caused by human scene interactions. Although they use human-environment contact and permeability constraints to avoid collisions, neglecting scene deformation easily leads to a large number of three-dimensional reconstruction errors.

Scene constraints can provide clues for capturing global three-dimensional human body motion, and high-quality scene deformation can guide improvement of global three-dimensional human body posture estimation accuracy. The existing mesh deformation method of the deformable object can deform the mesh under the guidance of the predefined sparse control vertex. However, this problem is often severely ill-defined and under-constrained, especially for large surfaces, since there are many possible deformations that can be matched to the partial surface deformations of sparse control points. Therefore, the deformation regularity of strong a priori coding is a necessary condition to solve this problem. The optimization method uses various analytical priors to define natural mesh deformation such as elasticity, laplacian smoothness, and rigid priors. But these methods simply limit the local surface to be transformed in a similar manner, making it difficult to model complex deformations. The existing method based on the neural network estimates a displacement field to model deformation, but the method has large dimension and is difficult to generalize. Based on Transformer modeling interrelation, local geometric deformation prior is learned, and a set of Euclidean transformation composed of displacement and rotation is deduced to deform the grid based on the deformation prior.

Therefore, the geometric shape of the deformable scene is modeled by adopting a Transformer-based neural deformation network, and meanwhile, additional clues are provided for global three-dimensional human motion capture by utilizing environmental constraints, so that the estimation precision of the global three-dimensional human posture can be effectively improved.

Disclosure of Invention

The invention provides a human motion capture method in a deformable scene based on neural deformation. The method first initializes a three-dimensional body pose of the human kinematics model using a three-dimensional body pose estimator, which generates a three-dimensional pose relative to a root node. And then, estimating a contact probability graph of the human body mesh under the three-dimensional human body posture, obtaining the vertex of the human body mesh contacted with the scene, finding a corresponding contact point on the scene mesh through ray projection, and optimizing by using the contact point pair of the human body mesh and the scene mesh to obtain the global three-dimensional human body posture. And then, constructing a transform-based neural deformation network, and performing non-rigid deformation on the scene grids according to the interaction state of the current human body grids and the scene grids. And finally, the global human body posture and the non-rigid deformation of the execution scene grid are iteratively and alternately optimized, and the high-quality unmarked monocular three-dimensional human body motion capture and the non-rigid three-dimensional scene deformation are realized.

The invention provides a human motion capture method in a deformable scene based on neural deformation, which comprises the following steps:

step 1, initializing a three-dimensional human body posture of a human body model from a monocular color image by using a three-dimensional human body posture estimator;

step 2, estimating a contact probability graph of the human body grids in the three-dimensional human body posture, obtaining contact points in contact with the scene, finding corresponding contact points on the scene grids through light projection, and obtaining contact point pairs of the human body grids and the scene grids;

step 3, based on the contact point pair of the human body grid and the scene grid obtained in the step 2, optimizing an objective function to obtain a global three-dimensional human body posture;

step 4, a transform-based neural deformation network is built, and non-rigid deformation is carried out on the scene grid according to the interaction state of the human body grid and the scene grid under the global three-dimensional human body posture;

and 5, iteratively and alternately optimizing the global three-dimensional human body posture and executing non-rigid deformation of the scene grid, and realizing high-quality unmarked monocular three-dimensional human body motion capture and non-rigid three-dimensional scene deformation.

Further, the step 1 adopts the optimized SMPLfy-X-based three-dimensional body posture initialization human body model, and optimizes the three-dimensional body posture of the human body model SMPL-X by minimizing an objective function.

Further, the objective function is defined as follows:

E _init (β,θ,t)＝E _J +λ _θ E _θ +λ _α E _α +λ _β E _β +λ _C E _C

the optimized parameter beta of the objective function represents the human body shape parameter, theta represents the complete set of the optimized posture parameters, t represents the global translation, and the first item E of the objective function _J Is a reprojection loss representing a robust weighted distance error of the two-dimensional projection of the 2D joint position estimated from the monocular color image and the corresponding estimated three-dimensional joint of the phantom, the second term E of the objective function _θ Is a VAE-based body posture prior, a third term E of an objective function _α Is a priori punishing the extreme bending of the elbow and knee, the fourth term E of the objective function _β Is a regularization term of human body shape, punishment of deviation from neutral state and a last term E of an objective function _C Indicating a penalty for body part self-collision, λ _θ 、λ _α 、λ _β 、λ _C Respectively represent E _θ 、E _α 、E _β 、E _C The weight coefficient of (2).

Further, in step 2, for the human body mesh in the current three-dimensional human body posture, a conditional variational self-encoder is used to generate a contact probability map for the human body in the three-dimensional posture. The trained decoder takes human body mesh vertexes and hidden variables under the initialized three-dimensional human body posture as sampling conditions, wherein the hidden variable space obeys Gaussian distribution. And performing threshold operation on the generated contact probability graph to obtain the vertex of the human body mesh contacted with the environment.

Further, in the step 2, the existing ray projection search strategy is used to find the corresponding contact point on the scene grid, the vertex of the human body grid contacting with the environment is re-projected into the image space, and if the re-projected contact point falls on the human body part which is not shielded, the ray from the camera is projected to find the intersection point with the three-dimensional scene grid; and if the re-projected contact point falls on the shielded human body part, taking the nearest scene vertex as the corresponding contact vertex.

Further, in step 3, after the contact point pair of the human body grid and the scene grid is obtained, the contact points on the human body grid are aligned to the corresponding scene grid contact points, and the global objective function is further optimized on the basis of the result of the optimization of the objective function in step 1, so as to obtain the global three-dimensional human body posture. The global objective function is defined as follows:

E _global (β,θ,t)＝E _J +λ _C E _C +λ _P E _P +λ _T E _T

the optimized parameters β of the objective function represent the parameters of the human body shape, θ represents the complete set of the optimizable pose parameters, and t represents the global translation. First term of objective function E _J Is a reprojection loss. Second term of objective function E _C Indicating that a body-part self-collision is punished. Third term of objective function E _P The representation minimizes the distance between the contact point on the body mesh and the contact point on the corresponding scene mesh. Last term of objective function E _T Represents a temporal smoothing term, represents the L2 distance, λ, of the current frame pose and global translation from the previous frame pose and global translation _C 、λ _P 、λ _T Respectively represent E _C 、E _P 、E _T The weight coefficient of (c).

Further, in the step 4, sparse control points are defined according to the interaction state of the human body grids and the scene grids in the global three-dimensional human body posture. Firstly, a trained three-dimensional scene segmentation grid is used for segmenting a scene, and semantic label estimation is carried out on a segmentation part. After scene segmentation and semantic label estimation are completed, the rigid scene is shielded, and only the deformable scene is subjected to subsequent non-rigid deformation. And then, carrying out collision detection on the current human body mesh and the deformable scene, and if the vertex of the current human body mesh penetrates through the scene and the vertex of the human body mesh is the contact point estimated in the step 2, setting the vertex of the nearest penetrated object as a control point, wherein the target position of the control point is the position of the vertex of the human body mesh.

Further, in the step 4, after the sparse control points and the target positions thereof are defined, a neural deformation network based on a Transformer is built, and non-rigid deformation is performed on the scene grid. And for the deformable scene grid, N points are fixedly and uniformly sampled, the sampling points comprise sparse control points, if the sampling points are the control points, the target displacement is set to be the displacement of the target position, and if the sampling points are not the control points, the target displacement is set to be zero. And sending the positions of the N points obtained by sampling and the target displacement to a neural deformation network, and outputting to obtain a group of Euclidean transformations consisting of displacement and rotation, namely the Euclidean transformations of M nodes. And determining the deformed position of each vertex on the deformable scene mesh by the Euclidean transformation of the nearest m nodes. Finding the nearest m nodes, where the vertex-deformed position is a weighted sum of the positions to which the euclidean transformations of the nearest m nodes are applied, and the calculation formula can be expressed as follows:

where v is the scene mesh vertex, v' is the deformed position of the scene mesh vertex v, R _i And t _i Are respectively node g _i A rotation matrix and a translation vector. w is a _i Is a nodeg _i The weight of the vertex v of the scene mesh is inversely proportional to the distance between the vertex v of the scene mesh and the node.

Further, in the step 5, iterative alternation optimization is performed on the human body posture and the scene deformation. In each iteration, firstly, a contact point pair of the human body grid and the scene grid is obtained according to the step 2, a punishment human body and scene penetration item is added on the basis of the global objective optimization function of the step 3, the objective function is optimized to update the global human body posture, and then the scene grid is updated by deformation through the neural deformation network according to the content described in the step 4.

Compared with the prior art, the invention has the following advantages: 1. the human motion capture method in the deformable scene based on the neural deformation can capture three-dimensional human motion and non-rigid deformation of the deformable scene from a single-view angle RGB video in which a person interacts with a deformable environment. 2. The invention models the non-rigid deformation of a deformable environment and provides the deformation of a deformable scene based on the neural deformation network modeling of a transform. 3. The invention utilizes the mutual constraint of the human body and the scene to model the scene deformation and effectively improves the precision of the overall three-dimensional human body motion capture. 4. The invention only inputs single-view RGB video, has convenient collection, lower cost and easy realization.

Drawings

FIG. 1 is a flow chart in an embodiment of the invention;

FIG. 2 is a diagram of a network structure based on Transformer neural deformation in an embodiment of the present invention;

fig. 3 is a diagram of the reconstruction effect achieved by the present invention, wherein (a) the column diagram is the input color image, (b) the column diagram is the initial reconstruction result, and (c) the column diagram is the final reconstruction result.

Detailed Description

The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and examples, so as to enable a user to clearly understand the present invention by combing the methods and effects of the present invention. It should be noted that, in the case of no conflict, the features of the embodiments of the present invention may be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.

Further, the flowcharts shown in the drawings may be executed in a series of sequential instructions in a computer, and the sequence of the flowcharts may be appropriately modified in some cases.

Example one

Fig. 1 is a flowchart of a human motion capture method in a deformable scene based on neural deformation according to a first embodiment of the present invention, and the following steps are described in detail with reference to fig. 1.

Step S110, initializing the three-dimensional human body posture of the human body model from the monocular color image by using a three-dimensional human body posture estimator;

and initializing a human body model SMPL-X by adopting the optimized SMPLfy-X, and obtaining a three-dimensional human body posture relative to a root node by minimizing an objective function.

The objective function of the three-dimensional body pose is defined as follows:

E _init (β,θ,t)＝E _J +λ _θ E _θ +λ _α E _α +λ _β E _β +λ _C E _C

the optimized parameter beta of the objective function represents the human body shape parameter, theta represents the complete set of the optimized posture parameters, t represents the global translation, and the first item E of the objective function _J Is a reprojection loss representing a robust weighted distance error of the two-dimensional projection of the estimated 2D joint position from the monocular color image and the corresponding estimated three-dimensional joint of the phantom, the second term E of the objective function _θ Is a VAE-based body posture prior, a third term E of an objective function _α Is a priori penalizing the extreme bending of the elbow and knee, the fourth term E of the objective function _β Is a regularization term of human body shape, punishment of deviation from neutral state and a last term E of an objective function _C Indicating a penalty for body part self-collision, λ _θ 、λ _α 、λ _β 、λ _C Respectively represent E _θ 、E _α 、E _β 、E _C The weight coefficient of (2).

Step S120, estimating a contact probability graph of the human body grids in the three-dimensional human body posture, obtaining contact points in contact with the scene, finding corresponding contact points on the scene grids through light projection, and obtaining contact point pairs of the human body grids and the scene grids;

for the body mesh in the current three-dimensional body posture, a conditional variational self-encoder is used for generating a contact probability map for the body in the three-dimensional posture. The trained decoder takes human body mesh vertexes and hidden variables under the initialized three-dimensional human body posture as sampling conditions, wherein the hidden variable space obeys Gaussian distribution. And performing threshold operation on the generated contact probability graph to obtain the human body mesh vertexes in contact with the environment.

The corresponding contact points on the scene grid are found using existing ray casting search strategies. The generated human body contact points are re-projected into image space. If the re-projected contact point falls on an unobstructed human body part, rays from the camera are projected to find the intersection with the three-dimensional scene grid. And if the re-projected contact point falls on the shielded human body part, taking the nearest scene vertex as the corresponding contact.

And step S130, optimizing an objective function to obtain a global three-dimensional human body posture based on the obtained contact point pair of the human body grid and the scene grid.

After the contact point pairs of the human body grids and the scene grids are obtained, the contact points on the human body grids are aligned to the corresponding scene grid contact points, and the rough global three-dimensional human body posture is obtained by minimizing a global objective function.

The global objective function is defined as follows:

E _global (β,θ,t)＝E _J +λ _C E _C +λ _P E _P +λ _T E _T

the optimized parameter beta of the objective function represents the human body shape parameter, theta represents the complete set of the optimized posture parameters, t represents the global translation, and the first item E of the objective function _J Is the reprojection loss, the second term E of the objective function _C Indicating a penalty for body part self-collision. Third term of objective function E _P Representing contact points on the human body mesh and on the corresponding scene meshMinimum distance of contact point, last term of objective function E _T Represents a temporal smoothing term representing the L2 distance, λ, of the current frame pose and global translation from the previous frame pose and global translation _C 、λ _P 、λ _T Respectively represent E _C 、E _P 、E _T The weight coefficient of (2).

Step S140, a neural deformation network based on a Transformer is built, and non-rigid deformation is carried out on the scene grid according to the interaction state of the human body grid and the scene grid under the global three-dimensional human body posture;

and defining sparse control points according to the interaction state of the human body grids and the scene grids in the global three-dimensional human body posture. Firstly, segmenting a scene by using a trained three-dimensional scene segmentation grid, and performing semantic label estimation on a segmentation part. After scene segmentation and semantic label estimation are completed, the rigid scene is shielded, and only the deformable scene is subjected to subsequent non-rigid deformation. And then, performing collision detection on the current human body mesh and the deformable scene, and if the vertex of the current human body mesh penetrates through the scene and the vertex of the human body mesh is an estimated contact point, setting the vertex of the nearest penetrated object as a control point, wherein the target position of the control point is the position of the vertex of the human body mesh.

After the sparse control points and the target positions of the sparse control points are defined, a neural deformation network based on a Transformer is built, and non-rigid deformation is carried out on the scene grid. The structure of the neural deformation network is shown in fig. 2. And for the deformable scene grid, N points are fixedly and uniformly sampled, the sampling points comprise sparse control points, if the sampling points are the control points, the target displacement is set to be the displacement of the target position, and if the sampling points are not the control points, the target displacement is set to be zero. And sending the positions of the N points obtained by sampling and the target displacement to a neural deformation network, and outputting to obtain a group of Euclidean transformations consisting of displacement and rotation, namely the Euclidean transformations of M nodes. And determining the deformed position of each vertex on the deformable scene mesh by the Euclidean transformation of the nearest m nodes. Finding the nearest m nodes, wherein the position after the vertex deformation is the weighted sum of the positions of the Euclidean transformation applying the nearest m nodes, and the calculation formula can be expressed as follows:

where v is the scene mesh vertex, v' is the deformed position of the vertex v, R _i And t _i Are respectively node g _i A rotation matrix and a translation vector. w is a _i Is node g _i For vertex v, the weight is inversely proportional to the distance from the vertex to the node.

And S150, iteratively and alternately optimizing the global human body posture and the scene deformation, and realizing high-quality unmarked monocular three-dimensional human body motion capture and non-rigid three-dimensional scene deformation.

And carrying out iterative alternate optimization on the human body posture and the scene deformation. In each iteration, a pair of contact points of the human body mesh and the scene mesh is obtained according to step S120, a punitive human body and scene penetration item is added on the basis of the global objective optimization function of step S130, the objective function is optimized to update the global human body posture, and then the scene mesh is updated by deformation using a neural deformation network according to the content described in step S140.

In the first embodiment, the reconstruction effect is as shown in fig. 3. The first column of fig. 3 represents the input color image, the second column is the initial reconstruction result, and the third column is the final reconstruction result.

Those skilled in the art will appreciate that the modules or steps of the invention described above can be implemented in a general purpose computing device, centralized on a single computing device or distributed across a network of computing devices, and optionally implemented in program code that is executable by a computing device, such that the modules or steps are stored in a memory device and executed by a computing device, fabricated separately into integrated circuit modules, or fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Claims

1. A method for capturing human motion in a deformable scene based on neural deformation is characterized by comprising the following steps:

step 2, estimating a contact probability graph of the human body grid under the three-dimensional human body posture, obtaining a contact point contacted with the scene, and finding a corresponding contact point on the scene grid through ray projection to obtain a contact point pair of the human body grid and the scene grid;

2. The method for capturing human motion in a deformable scene based on neural deformation as claimed in claim 1, wherein the initializing the three-dimensional human posture of the human model in step 1 is initialized by using an optimization-based SMPLify-X, and the three-dimensional human posture of the human model SMPL-X is optimized by minimizing an objective function, wherein the objective function is defined as follows:

E _init (β，θ，t)＝E _J +λ _θ E _θ +λ _α E _α +λ _β E _β +λ _C E _C

the optimized parameter beta of the objective function represents the human body shape parameter, theta represents the complete set of the optimized posture parameters, t represents the global translation, and the first item E of the objective function _J Is a reprojection loss representing the estimated 2D joint position from a monocular color image and the corresponding estimated manikinRobust weighted distance error of two-dimensional projection of a dimensional joint, second term of objective function E _θ Is a VAE-based body posture prior, the third term E of an objective function _α Is a priori penalizing the extreme bending of the elbow and knee, the fourth term E of the objective function _β Is the regularization term of human body shape, penalizes the deviation from neutral state, and the last term E of the objective function _C Indicating a penalty for body part self-collision, λ _θ 、λ _α 、λ _β 、λ _C Respectively represent E _θ 、E _α 、E _β 、E _C The weight coefficient of (2).

3. The method for capturing human motion in a deformable scene based on neural deformation as claimed in claim 1, wherein in step 2, for the human mesh in the current three-dimensional human posture, a conditional variational self-encoder is used to generate a contact probability map for the human in the three-dimensional posture, and the trained decoder initializes the human mesh vertices and hidden variables in the three-dimensional human posture as sampling conditions, wherein the hidden variable space obeys gaussian distribution, and the generated contact probability map is subjected to threshold operation, so that the human mesh vertices in contact with the environment can be obtained;

finding a corresponding contact point on the scene grid by using the existing light projection searching strategy, re-projecting the vertex of the human body grid in contact with the environment into an image space, and projecting light from the camera if the re-projected contact point falls on an unshielded human body part to find an intersection point with the three-dimensional scene grid; and if the re-projected contact point falls on the shielded human body part, taking the nearest scene vertex as the corresponding contact vertex.

4. The method for capturing human motion in a deformable scene based on neural deformation as claimed in claim 1, wherein in step 3, after obtaining the contact point pair of the human body mesh and the scene mesh, the contact points on the human body mesh are aligned to the corresponding scene mesh contact points, and the global objective function is further optimized based on the result of the optimization of the objective function in step 1 to obtain the global three-dimensional human body pose, and the global objective function is defined as follows:

E _global (β，θ，t)＝E _J +λ _C E _C +λ _P E _P +λ _T E _T

the optimized parameter beta of the objective function represents the human body shape parameter, theta represents the complete set of the optimized posture parameters, t represents the global translation, and the first item E of the objective function _J Is the reprojection loss, the second term E of the objective function _C Indicating a penalty for body part self-collision. Third term of objective function E _P The last term E of the objective function is expressed by minimizing the distance between the contact point on the human body grid and the contact point on the corresponding scene grid _T Represents a temporal smoothing term representing the L2 distance, λ, of the current frame pose and global translation from the previous frame pose and global translation _C 、λ _P 、λ _T Respectively represent E _C 、E _P 、E _T The weight coefficient of (2).

5. The method for capturing the human motion in the deformable scene based on the neural deformation as claimed in claim 1, wherein the specific method of the step 4 is:

firstly, defining sparse control points according to interaction states of a human body grid and a scene grid under a global three-dimensional human body posture, segmenting a scene by using a trained three-dimensional scene segmentation grid, performing semantic label estimation on a segmentation part, shielding a rigid scene after the scene segmentation and the semantic label estimation are completed, and performing subsequent non-rigid deformation on a deformable scene only; next, performing collision detection on the current human body mesh and the deformable scene, if the vertex of the current human body mesh penetrates through the scene and the vertex of the human body mesh is the contact point contacting with the scene in the step 2, setting the vertex of the nearest penetrated object as a control point, wherein the target position of the control point is the position of the vertex of the human body mesh;

then, a neural deformation network based on a Transformer is built, and non-rigid deformation is performed on the scene grid: for the deformable scene grid, N points are fixedly and uniformly sampled, sampling points comprise sparse control points, if the sampling points are control points, target displacement of the sampling points is set to be displacement of a target position, and if the sampling points are non-control points, the target displacement of the sampling points is set to be zero; the positions of the N points obtained by sampling and the target displacement are sent to a neural deformation network, and a group of Euclidean transformations consisting of displacement and rotation, namely the Euclidean transformations of M nodes, are obtained by output; determining the position of each vertex on the deformable scene grid after deformation by the Euclidean transformation of the nearest m nodes; finding the nearest m nodes, where the vertex-deformed position is a weighted sum of the positions to which the euclidean transformations of the nearest m nodes are applied, and the calculation formula can be expressed as follows:

where v is the scene mesh vertex, v' is the deformed position of the scene mesh vertex v, R _i And t _i Are respectively node g _i Rotational matrix and translation vector of, w _i Is node g _i And weighting the vertex v of the scene mesh, wherein the weighting is inversely proportional to the distance between the vertex v of the scene mesh and the node.

6. The method for capturing human body motion in a deformable scene based on neural deformation as claimed in claim 1, wherein in step 5, iteration alternately optimizes global three-dimensional human body posture and performs non-rigid deformation of scene mesh, in each iteration, firstly obtaining contact point pairs of human body mesh and scene mesh according to step 2, adding punitive human body and scene penetration item on the basis of global objective optimization function of step 3, optimizing the objective function to update global human body posture, and then deforming and updating scene mesh by using neural deformation network according to the content described in step 4.