CN113569775A

CN113569775A - Monocular RGB input-based mobile terminal real-time 3D human body motion capture method and system, electronic equipment and storage medium

Info

Publication number: CN113569775A
Application number: CN202110880873.6A
Authority: CN
Inventors: 杨凯航; 李冬平; 米楠
Original assignee: Faceunity Technology Co ltd
Current assignee: Faceunity Technology Co ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-10-29

Abstract

The invention discloses a monocular RGB (red, green and blue) input based mobile terminal real-time 3D (three-dimensional) human body motion capture method and system, electronic equipment and a storage medium, belonging to the technical field of virtualization, and comprising the steps of acquiring initial positions of all point locations of a user in an initial state, determining the serial number and name of each point location and the distance between adjacent point locations, and storing an initial posture model; acquiring RGB image information, and capturing target positions of all point positions in the RGB image; and calculating the rotation angle of each adjacent point in the initial state when the initial attitude model performs the same action as the RGB image information according to the IK algorithm, the initial positions of the point positions and the target positions of all the point positions in the image, and driving the initial attitude model to perform the same action as the RGB image information. The invention realizes the real-time 3D human body motion capture and visualization application of the mobile terminal based on RGB input.

Description

Monocular RGB input-based mobile terminal real-time 3D human body motion capture method and system, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of virtualization, and particularly relates to a monocular RGB input-based mobile terminal real-time 3D human body motion capture method and system, electronic equipment and a storage medium.

Background

The human body motion capture technology is a technology capable of detecting the posture and motion track of a human body in a three-dimensional space and reproducing the motion of the human body in a virtual three-dimensional environment. The current commonly used motion capture technologies in the industry are mainly optical and inertial, wherein the optical motion capture technology has high precision and can reach a submillimeter level, but the cost is very high, and the technology is generally used in the fields of automatic control, movie animation and the like, while the inertial motion capture technology with lower cost has lower precision than optical motion compensation, and has the problems of error accumulation and sensor magnetization, and the technology is generally used in the fields with lower precision requirements. While the cost of inertial motion capture devices has not been very high, it has been difficult to spread across the general population of users, primarily due to: 1) it requires wearing a companion device to use and limits the use scenarios 2) the cost is still too high compared to the already popular devices such as cell phones.

Disclosure of Invention

In order to solve the above problems, the present invention provides a monocular RGB input-based mobile terminal real-time 3D human body motion capture method and system, an electronic device, and a storage medium, wherein the method comprises:

acquiring initial positions of all point locations of a user in an initial state, determining the serial number and name of each point location and the distance between adjacent point locations, and storing an initial attitude model;

acquiring RGB image information, and capturing target positions of all point positions in the RGB image;

and calculating the rotation angle of each adjacent point position in the initial state when the initial attitude model performs the same action as the RGB image information according to an IK algorithm, the initial positions of the point positions and the target positions of all the point positions in the RGB image, and driving the initial attitude model to perform the same action as the RGB image information.

Preferably, the points include finger points and body points.

Preferably, the step of obtaining the finger point location includes:

detecting 2D key points of the finger by adopting a MobileNet V2 neural network on the RGB image of the hand region to obtain the 2D point position of the finger;

and adopting a fully connected neural network for the 2D point positions of the fingers, and obtaining the 3D point positions of the fingers through regression.

Preferably, the step of obtaining the body point location comprises:

collecting data;

constructing a backbone network model of the body point location;

according to the data, training a backbone network model of the body point location;

inputting the RGB image of the body area to the trained network model of the body point location to obtain the body point location of the RGB image.

Preferably, the data acquisition comprises a 3D data set and a 2D data set of the body;

the 3D data set includes:

collecting a 3D character model, and constructing a character 3D model data set;

character animation data are collected, and a basic action data set is constructed;

rendering the base action dataset to the character 3D model dataset using rendering software, obtaining the 3D dataset;

the 2D data set comprises a collected portrait video, and images rich in clothes, scenes and actions are selected from the portrait video.

Preferably, training the backbone network model of the body point location comprises:

using the loss function in the backhaul network model

Training the 2D data set to obtain all weight parameters of the backbone network model;

fixing all the weight parameters, and training the 2D data set and the 3D data set by using a loss function until convergence;

unlocking weights in the weight parameters, and using loss function E_allAnd cross-supervision function E_crossTraining until convergence;

wherein the content of the first and second substances,

the nth point hotspot map output for the network 2D,

for a true labeled point-location-heat map in a 2D dataset,

the nth limb heat map output for the network 2D,

a limb heat map which is really marked in the 2D data set;

an nth point thermal map output for the 3D branch of the network,

point-to-point thermograms are labeled for the realities in the 3D dataset,

the nth limb heat map output for the 3D branch of the network,

a limb heat map which is really marked in the 3D data set; h_3d^2dCommon point locations in the 3D data set and the 2D data set; h_2d^3dRepresenting common points in the 2D data set and in the 3D data set.

Preferably, when the initial pose model performs the same action as the RGB image information according to an IK algorithm, the initial positions of the point locations, and the target positions of all the point locations in the RGB image, the rotation angle of each adjacent point location in the initial state includes:

obtaining all movable joints according to the initial positions of the point positions and the target positions of all the point positions in the RGB image;

calculating the rotation angle of the bone in each of said movable joints according to an IK algorithm comprising:

placing the point position of the lateral top end in one movable joint at a corresponding target position;

moving the rest point positions in the movable joint to corresponding positions according to a formula;

placing the point of the medial apical end in the mobile joint to an initial position of the point;

in the formula, b_xThe current position of any point position in the movable joint is taken as the current position; b_x-1To be with said b_xCurrent positions of adjacent point locations; b is_xThe target position corresponding to any point position in the movable joint; b is_x-1To be with said B_xThe target positions corresponding to adjacent point positionsPlacing; x is a natural number;

judging whether the distance from the current position of each point location to the target position is smaller than a threshold value;

if so, the rotation angle of the bone in each movable joint; if the distance between the current position of each point location and the target position is larger than the threshold value, the steps are iterated circularly until the distance between the current position of each point location and the target position is smaller than the threshold value.

The embodiment of the invention provides a monocular RGB input-based mobile terminal real-time 3D human body motion capture system, which comprises:

the initial attitude model module is used for acquiring initial positions of all point locations of a user in an initial state, determining the serial number and the name of each point location and the distance between the adjacent point locations, and storing an initial attitude model;

the capture module is used for acquiring RGB image information and capturing target positions of all point positions in the RGB image;

and the driving module is used for calculating the rotation angle of each adjacent point position in the initial state when the initial attitude model performs the same action as the RGB image information according to an IK algorithm, the initial position of the point position and the target positions of all the point positions in the RGB image, and driving the initial attitude model to perform the same action as the RGB image information.

An embodiment of the present invention provides an electronic device, which includes at least one processing unit and at least one storage unit, where the storage unit stores a program, and when the program is executed by the processing unit, the processing unit is enabled to execute the method described above.

An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program executable by an electronic device, and when the program runs on the electronic device, the program causes the electronic device to execute the method described above.

Compared with the prior art, the invention has the beneficial effects that:

the invention constructs a model for detecting the point location of the human body, and is used for reducing the 3D data acquisition cost and the training difficulty of the End-To-End method.

Drawings

Fig. 1 is a schematic flow chart of a monocular RGB input-based mobile-end real-time 3D human body motion capture method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Referring to fig. 1, a monocular RGB input-based mobile terminal real-time 3D human body motion capture method and system, an electronic device, and a storage medium, the method comprising:

and calculating the rotation angle of each adjacent point in the initial state when the initial attitude model performs the same action as the RGB image information according to the IK algorithm, the initial positions of the point positions and the target positions of all the point positions in the image, and driving the initial attitude model to perform the same action as the RGB image information.

In this embodiment, the point locations include a finger point location and a body point location, and the step of obtaining the finger point location includes:

In this embodiment, the number of the hand positions detected by this method is 21.

Further, the step of obtaining the body point location comprises:

collecting data;

constructing a backbone network model of the body point location;

in the present embodiment, the network inputs the RGB images with size of 256 × 192 and outputs the RGB images as the limb heat map L_2d、L_3dAnd dot heat map H_2d、H_3d. The limb heat map is a rectangular area covering the corresponding limb, for example, the limb heat map of the lower leg is a rectangular area covering the range of the lower leg, wherein L_2dAnd L_3dDifferent, L_2dEach limb corresponds to a heat map of 1, in which the values within the rectangular area of the limb are 255 and the values outside the range are 0, L_3dEach limb corresponds to 3-degree of heating, and the values in the rectangular area range of the limb are 3D of the corresponding limbNormalized value of direction (ankle 3D coordinates minus knee joint 3D coordinates if calf): x, y, z. The point location heat map is a two-dimensional Gaussian distribution covering corresponding joint points, the position of a Gaussian distribution peak is the 2D position of the joint points, and each joint point is only provided with 1 point location heat map. In the network structure, H₂D branching to use H only₂D branch of dataset supervised training, H₃D branching to use H₃The D dataset supervises the branches of the training.

Training a backbone network model of the body point location according to the data;

and inputting the RGB images of the body area to the trained network model of the body point location to obtain the body point location of the RGB images.

In this embodiment, the data acquisition comprises a 3D data set and a 2D data set of the body;

the 3D data set includes:

collecting or modeling a more vivid 3D character model, and constructing a character 3D model data set;

character animation data are collected, and meanwhile, the animation data of a batch of people can be collected by using inertial motion capture equipment to construct a basic motion data set;

rendering the human body animation by using commonly used rendering software (blend, UE4, Unity, etc.);

in an actual use scene, when actions which cannot be accurately detected are met, only the animator needs to manually turn K out corresponding animations.

The 2D data set comprises a collected portrait video, and images rich in clothes, scenes and actions are selected from the portrait video. In an actual use scene, when the 2D point of the human body cannot be accurately detected, the figure images of the corresponding scene and the clothes can be continuously collected, and the 2D point expansion data set is labeled.

In the present embodiment, the 3D data set H₃D contains 31 points; 2D data set H₂D contains 25 points. The order and meaning of the point locations in the two data sets are as follows:

serial number	H₂D point location	H₃D point location	Serial number	H₂D point location	H₃D point location
						0	Right hip	Right hip	16	Root of right thumb	Root of right thumb
1	Right knee	Right knee	17	Root of the right little finger	Root of the right little finger
						2	Right ankle	Right ankle	18	Root of left thumb	Root of left thumb
3	Left hip	Left hip	19	Root of left little finger	Root of left little finger
						4	Left knee	Left knee	20	Right heel	Right heel
5	Left ankle	Left ankle	21	Right tiptoe	Right tiptoe
						6	Head top	Head top	22	Left heel	Left heel
7	Right shoulder	Right shoulder	23	Left tiptoe	Left tiptoe
						8	Right elbow	Right elbow	24	Location of laryngeal prominence	Neck
9	Right wrist	Right wrist	25		Midpoint of the two buttocks
						10	Left shoulder	Left shoulder	26		Navel
11	Left elbow	Left elbow	27		Chest opening
						12	Left wrist	Left wrist	28		Location of laryngeal prominence
13	Right ear	Right ear	29		Right scapula
						14	Nose tip	Nose tip	30		Left scapula

In this embodiment, training the backbone network model of the body point location includes:

using the loss function in the backhaul network model

fixing all weight parameters, and training the 2D data set and the 3D data set by using a loss function until convergence;

weights in the unwrapped weight parameters are used with a loss function E_allAnd cross-supervision function E_crossTraining until convergence;

wherein the content of the first and second substances,

the nth point hotspot map output for the network 2D,

for a true labeled point-location-heat map in a 2D dataset,

the nth limb heat map output for the network 2D,

a limb heat map which is really marked in the 2D data set;

an nth point thermal map output for the 3D branch of the network,

point-to-point thermograms are labeled for the realities in the 3D dataset,

the nth limb heat map output for the 3D branch of the network,

When the initial attitude model is calculated to perform the same action as the RGB image information according to the IK algorithm, the initial positions of the point locations and the target positions of all the point locations in the image, the rotation angle of each adjacent point location in the initial state comprises the following steps:

in the formula, b_xThe current position of any point position in the movable joint is taken as the current position; b_x-1To be with said b_xCurrent positions of adjacent point locations; b is_xThe target position corresponding to any point position in the movable joint; b is_x-1To be with said B_xThe target positions corresponding to adjacent point positions; x is a natural number;

the initial attitude model module is used for acquiring initial positions of all point locations of a user in an initial state, determining the serial number and name of each point location and the distance between adjacent point locations, and storing an initial attitude model;

and the driving module is used for calculating the rotation angle of each adjacent point position in the initial state when the initial attitude model performs the same action as the RGB image information according to the IK algorithm, the initial positions of the point positions and the target positions of all the point positions in the image, and driving the initial attitude model to perform the same action as the RGB image information.

An embodiment of the present invention provides an electronic device, which includes at least one processing unit and at least one storage unit, where the storage unit stores a program, and when the program is executed by the processing unit, the processing unit is enabled to execute the method.

An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program executable by an electronic device, and when the program runs on the electronic device, the electronic device is caused to execute the method described above.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A monocular RGB input-based real-time 3D human body motion capture method for a mobile terminal is characterized by comprising the following steps:

2. The monocular RGB input-based mobile terminal real-time 3D human body motion capture method of claim 1, wherein the points comprise finger points and body points.

3. The monocular RGB input based mobile terminal real-time 3D human body motion capture method of claim 2, wherein the finger point location obtaining step comprises:

4. The monocular RGB input based mobile terminal real-time 3D human body motion capture method of claim 2, wherein the body point location obtaining step comprises:

collecting data;

constructing a backbone network model of the body point location;

5. The monocular RGB input based mobile end real-time 3D human body motion capture method of claim 4, wherein the data collection comprises a 3D dataset and a 2D dataset of a body;

the 3D data set includes:

6. The monocular RGB input based mobile terminal real-time 3D human body motion capture method of claim 4, wherein training the backbone network model of the body point location comprises:

using the loss function in the backhaul network model

wherein the content of the first and second substances,

the nth point hotspot map output for the network 2D,

for a true labeled point-location-heat map in a 2D dataset,

the nth limb heat map output for the network 2D,

a limb heat map which is really marked in the 2D data set;

an nth point thermal map output for the 3D branch of the network,

point-to-point thermograms are labeled for the realities in the 3D dataset,

the nth limb heat map output for the 3D branch of the network,

7. The method as claimed in claim 6, wherein the calculating the rotation angle of each point location adjacent to the initial state when the initial pose model performs the same motion as the RGB image information according to an IK algorithm, the initial location of the point location and the target location of all the point locations in the RGB image comprises:

if the motion angle is smaller than the preset value, calculating the rotation angle of the bone in each movable joint; if the distance between the current position of each point location and the target position is larger than the threshold value, the steps are iterated circularly until the distance between the current position of each point location and the target position is smaller than the threshold value.

8. The utility model provides a system for real-time 3D human motion capture of mobile terminal based on monocular RGB input which characterized in that includes:

9. An electronic device, comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the method of any of claims 1 to 7.

10. A storage medium storing a computer program executable by an electronic device, the program, when run on the electronic device, causing the electronic device to perform the method of any one of claims 1 to 7.