CN117541646A

CN117541646A - Motion capturing method and system based on parameterized model

Info

Publication number: CN117541646A
Application number: CN202311754272.6A
Authority: CN
Inventors: 陈靖涵; 张鹏飞; 苏江
Original assignee: Dark Matter Beijing Intelligent Technology Co ltd
Current assignee: Dark Matter Beijing Intelligent Technology Co ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-02-09

Abstract

The invention discloses a motion capturing method and a motion capturing system based on a parameterized model.A human body detection module acquires RGB video or RGBD video matched with depth information to acquire a target person and a position boundary box of two hands; according to the region image in the target person boundary box and the region image of the target person double-hand boundary box, the foot touchdown detection module obtains a classification result of the person double feet by using a classification algorithm model; the human body posture capturing module captures and estimates the rotation value of each joint point of the human body by using the human body parameterized three-dimensional model; the absolute position estimation module obtains the 3D coordinates of the target person in the camera coordinate system through an absolute position estimation algorithm; the data optimization module obtains the optimized rotation values for eliminating foot sliding and floating and the coordinates of the human body in the camera coordinate system through mean value filtering processing and inverse kinematics optimization algorithm according to the rotation values of all the joints of the human body, the coordinates of the human body in the camera coordinate system and the two classification results of whether the feet of the human body are grounded or not.

Description

Motion capturing method and system based on parameterized model

Technical Field

The invention relates to the technical field of computer vision and human motion capture, in particular to a motion capture method and system based on a parameterized model.

Background

At present, the human body motion capturing method is a technology which is needed in the digital human body and the meta universe, and a more mature scheme exists, and the current human body motion capturing technology can capture more accurate motion under the condition that no equipment is worn and only a camera is needed, so that compared with the motion capturing method which needs to wear equipment, the cost of motion capturing is reduced.

However, most methods of motion capture by means of cameras only focus on limb parts, but ignore hand motions, and the negative effects of foot drift and sliding often occur in the results of motion capture, affecting the look and feel.

Therefore, how to capture the motion of the whole body and eliminate the foot sliding at the same time, and restore the more realistic motion is a problem that the person skilled in the art needs to solve.

Disclosure of Invention

In view of the above, the present invention provides a motion capturing method and system based on parameterized model to solve some of the technical problems mentioned in the background art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a motion capture method based on a parameterized model comprises the following steps:

s1, acquiring RGB video or RGBD video matched with depth information;

s2, positioning the positions of the target persons in the video picture, and simultaneously positioning the positions of the two hands of the target persons to obtain the target persons and the position boundary boxes of the two hands of the target persons;

s3, according to the regional image in the target person boundary box, a classification algorithm model is utilized to obtain a classification result of the person feet, and whether the person feet are in contact with the ground or not is judged;

capturing and estimating the rotation value of each joint point of the human body by utilizing the human body parameterized three-dimensional model according to the region image in the target person boundary frame and the region image of the target person double-hand boundary frame;

according to the regional image in the boundary frame of the target person, obtaining the 3D coordinates of the target person in a camera coordinate system through an absolute position estimation algorithm, and estimating the displacement information of the target person;

s4, according to the rotation value of each joint point of the human body, the coordinates of the human body in a camera coordinate system and the two classification results of whether the feet of the person are grounded or not, the rotation value for eliminating the sliding and floating of the feet and the coordinates of the human body in the camera coordinate system after optimization are obtained through an average value filtering process and an inverse kinematics optimization algorithm.

Preferably, the bounding box for obtaining the positions of the target person and the two hands in S2 is implemented by using the mainstream target detection algorithm YOLO.

Preferably, in step S3, the two-class algorithm model includes a multi-layer perceptron MLP, which includes an input layer, three hidden layers, and an output layer, and five layers of networks are all fully connected layers, and the loss function of the two-class algorithm model adopts two-class cross entropy loss functions.

Preferably, the human parameterized three-dimensional model in step S3 comprises an encoder, a spatial feature pyramid network, and a regressor;

the input image is output by the encoder and contains feature graphs of rich semantic information, then the input space pyramid network further extracts features, finally input parameters required by the human body parameterized three-dimensional model and estimated camera parameters are output by the regressive, the input parameters are rotation values of all bone points, the 3D key point positions and the 2D key point positions of the human body are obtained by forward kinematics and the camera parameters, the three-dimensional model is used for calculating a loss function, and the human body parameterized three-dimensional model is trained by reconstructing the loss function.

Preferably, the reconstruction loss function of the training human body parameterized three-dimensional model is specifically:

L _reg ＝λ _2d ||K-K _gt ||+λ _3d ||J-J _gt ||+λ _para ||Θ-Θ _gt ||

wherein K represents a 2D key point position, J represents a 3D key point position, θ represents a human body parameterized three-dimensional model input parameter and a camera parameter, λ represents the weights of the different parts, |·| represents the L2 norm.

Preferably, the absolute position estimation algorithm in step S3 includes a backbone network and two regressors, where the backbone network is formed by multiple convolution layers, the regressors are mainly formed by full connection layers, the image is extracted by the backbone network, features are respectively input into the two regressors, camera parameters and 3D coordinates of the relative root node are respectively estimated, and then the estimated camera parameters convert the 3D coordinates of the relative root node into absolute 3D coordinates of a camera coordinate system.

Preferably, the loss function of the absolute position estimation algorithm is an L1 norm, specifically:

L＝||R-R _gt || ₁

where R represents the absolute 3D coordinates of the camera coordinate system.

Preferably, step S2 includes a single-person mode or a multi-person mode, wherein in the single-person mode, if a plurality of persons appear on the screen, only one of the bounding boxes with the largest proportion of the screen is output; the multi-view mode is to match the boundary boxes of the same person in different views through a matching algorithm, and position of the same person in different views is located;

in step S3, during the multi-view mode, summarizing the classification results of the multiple views, and taking the classification result of the majority of the views as the classification result of the two feet of the person; outputting the final rotation value of the rotation value output by each visual angle through a multi-visual angle fusion algorithm; the mean of the 3D coordinates of the multiple view angle estimates is the 3D coordinates of the target person in the camera coordinate system.

Preferably, the specific content of step S4 includes:

s41, eliminating jitter information in data through mean value filtering processing of rotation values of all joint points of a human body and coordinates of the human body in a camera coordinate system;

s42, calculating the position of a new foot key point according to the classification result of whether the left foot and the right foot of the human body contact the ground or not by an interpolation method;

s43, optimizing the rotation value of each joint point of the human body in an iterative numerical optimization mode by taking the position of the new foot key point as a constraint.

The motion capture system based on the parameterized model comprises a human body detection module, a foot touchdown detection module, a human body gesture capture module, an absolute position estimation module and a data optimization module;

the human body detection module is used for locating the position of a target person in a video picture by collecting RGB video or RGBD video matched with depth information, locating the positions of both hands of the target person and obtaining a target person and a boundary frame of the positions of both hands of the target person;

the foot touchdown detection module is used for obtaining a classification result of the feet of the person by utilizing a classification algorithm model according to the region image in the target person boundary box and judging whether the feet of the person are contacted with the ground or not;

the human body posture capturing module is used for capturing and estimating the rotation value of each joint point of the human body by utilizing the human body parameterized three-dimensional model according to the region image in the target person boundary frame and the region image of the target person double-hand boundary frame;

the absolute position estimation module is used for obtaining the 3D coordinates of the target person in the camera coordinate system through an absolute position estimation algorithm according to the regional image in the target person boundary box and estimating the displacement information of the target person;

the data optimization module is used for obtaining the optimized rotation value for eliminating foot sliding and floating and the coordinates of the human body in the camera coordinate system through the average filtering processing and the inverse kinematics optimization algorithm according to the rotation value of each joint point of the human body, the coordinates of the human body in the camera coordinate system and the two classification results of whether the feet of the human body are grounded or not.

Compared with the prior art, the invention discloses a motion capturing method and a motion capturing system based on a parameterized model, which have the following advantages:

the effects of capturing motion and driving digital virtual persons are realized through the low-cost household RGB camera, and the deployment is simple and quick;

the end-to-end motion capture system and the data optimization method are provided, in actual operation, optimized motion capture data can be obtained only by inputting RGB video, and a more real effect is achieved;

the multi-view scheme is selected to enable the motion capture result to be more accurate and stable;

the method can capture the detailed information of the hand, and can be applied to more actual scenes by combining the information of the body;

the data optimization method can further optimize the result of motion capture, achieves a more anthropomorphic and real driving effect, is high in processing speed, flexible in setting and universality, and can be applied to virtual digital human models with different skeleton structures through simple modification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a motion capture method based on a parameterized model according to the present invention;

FIG. 2 is a schematic diagram of a human body posture estimation method based on a parameterized model provided by the invention;

FIG. 3 is a schematic diagram of a motion capture system based on a parameterized model according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a motion capture method based on a parameterized model, as shown in fig. 1, comprising the following steps:

s1, acquiring RGB video or RGBD video matched with depth information;

In order to further implement the above technical solution, the obtaining of the target person and the bounding boxes of the positions of the two hands in S2 is implemented by the mainstream target detection algorithm YOLO.

In this embodiment, the video captured by the camera is passed through the mainstream object detection algorithm YOLO frame by frame, and the position of the target person in each frame, and the positions of both hands thereof are output, wherein the positions are represented by the form of a bounding box in which the person and hands are to be completely included, and in order to ensure this, the bounding box of the output is enlarged as a whole.

In order to further implement the above technical solution, in step S3, the two-class algorithm model includes a multi-layer perceptron MLP, which includes an input layer, three hidden layers, and an output layer, where five layers of networks are all fully connected layers, and the loss function of the two-class algorithm model uses two-class cross entropy loss functions.

In order to further implement the above technical solution, as shown in fig. 2, the human body parameterized three-dimensional model in step S3 includes an encoder, a spatial feature pyramid network, and a regressor;

In order to further implement the technical scheme, the reconstruction loss function of the human body parameterized three-dimensional model is specifically:

L _reg ＝λ _2d ||K-K _gt ||+λ _3d ||J-J _gt ||+λ _para ||Θ-Θ _gt ||

In order to further implement the above technical solution, the absolute position estimation algorithm in step S3 includes a backbone network and two regressors, where the backbone network is formed by multiple convolution layers, the regressors are mainly formed by full connection layers, the image is extracted by the backbone network to extract features, the two regressors are input respectively, camera parameters and 3D coordinates of the relative root node are estimated respectively, and then the estimated camera parameters are used to convert the 3D coordinates of the relative root node into absolute 3D coordinates of the camera coordinate system.

In order to further implement the above technical solution, the loss function of the absolute position estimation algorithm is an L1 norm, specifically:

L＝||R-R _gt || ₁

where R represents the absolute 3D coordinates of the camera coordinate system.

In order to further implement the above technical solution, step S2 includes a single person mode or a multi-person mode, wherein in the single person mode, if a plurality of persons appear on the screen, only one of the bounding boxes with the largest proportion of the screen is output; the multi-view mode is to match the boundary boxes of the same person in different views through a matching algorithm, and position of the same person in different views is located;

In order to further implement the above technical solution, the specific content of step S4 includes:

in practical application, the filtering process has a good effect on jitter with the mean value of 0, and the phenomenon of foot sliding and floating can also occur when a virtual digital person is actually driven, so that an optimization algorithm based on inverse kinematics is further carried out:

A motion capture system based on a parameterized model, as shown in fig. 3, comprises a human body detection module, a foot touchdown detection module, a human body gesture capture module, an absolute position estimation module and a data optimization module;

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a parameterized model based motion capture method.

A processing terminal comprises a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, and the processor realizes a motion capture method based on a parameterized model when executing the computer program.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The motion capturing method based on the parameterized model is characterized by comprising the following steps of:

s1, acquiring RGB video or RGBD video matched with depth information;

2. The parameterized model-based motion capture method of claim 1, wherein the obtaining of the bounding boxes for the target person and his hands in S2 is performed by a mainstream target detection algorithm YOLO.

3. The motion capture method based on a parameterized model of claim 1, wherein in step S3, the two-class algorithm model includes a multi-layer perceptron MLP, which includes an input layer, three hidden layers, an output layer, and five layers of networks all being fully connected layers, and the loss function of the two-class algorithm model employs a two-class cross entropy loss function.

4. The method of claim 1, wherein the three-dimensional model of human body parameterization in step S3 comprises an encoder, a spatial feature pyramid network, and a regressor;

5. The method for motion capture based on parameterized model of claim 4, wherein the reconstruction loss function of training the parameterized three-dimensional model of the human body is specifically:

L _reg ＝λ _2d ||K-K _gt ||+λ _3d ||J-J _gt ||+λ _para ||Θ-Θ _gt ||

6. The motion capture method based on a parameterized model of claim 1, wherein the absolute position estimation algorithm in step S3 comprises a backbone network and two regressions, wherein the backbone network is composed of a plurality of convolution layers, the regressions are mainly composed of full connection layers, the image is extracted by the backbone network to be characterized, the two regressions are respectively input, camera parameters and 3D coordinates of the relative root node are respectively estimated, and the estimated camera parameters are used for converting the 3D coordinates of the relative root node into absolute 3D coordinates of a camera coordinate system.

7. The motion capture method based on the parameterized model of claim 6, wherein the loss function of the absolute position estimation algorithm is L1 norm, specifically:

L＝||R-R _gt || ₁

where R represents the absolute 3D coordinates of the camera coordinate system.

8. The method of claim 1, wherein step S2 includes a single-person mode or a multi-person mode, wherein in the single-person mode, if a plurality of persons appear on a screen, only one of the bounding boxes with the largest specific gravity of the screen is output; the multi-view mode is to match the boundary boxes of the same person in different views through a matching algorithm, and position of the same person in different views is located;

9. The method for motion capture based on parameterized model of claim 1, wherein the specific content of step S4 comprises:

10. A motion capture system based on a parameterized model, characterized in that the motion capture system based on the parameterized model of any one of claims 1-9 comprises a human body detection module, a foot touchdown detection module, a human body gesture capture module, an absolute position estimation module and a data optimization module;