CN112308952B - 3D character motion generation system and method for imitating human motion in given video - Google Patents

3D character motion generation system and method for imitating human motion in given video Download PDF

Info

Publication number
CN112308952B
CN112308952B CN202011101066.1A CN202011101066A CN112308952B CN 112308952 B CN112308952 B CN 112308952B CN 202011101066 A CN202011101066 A CN 202011101066A CN 112308952 B CN112308952 B CN 112308952B
Authority
CN
China
Prior art keywords
mesh
sequence
human body
action
mesh2mesh
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011101066.1A
Other languages
Chinese (zh)
Other versions
CN112308952A (en
Inventor
姜育刚
傅宇倩
付彦伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202011101066.1A priority Critical patent/CN112308952B/en
Publication of CN112308952A publication Critical patent/CN112308952A/en
Application granted granted Critical
Publication of CN112308952B publication Critical patent/CN112308952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Abstract

The invention belongs to the technical field of computers, and particularly relates to a 3D role action generating system and method for simulating the action of a person in a given video. The system comprises four modules of initial human body reconstruction, regular data mesh construction, mesh2mesh smoothing and human body posture migration; for the video containing the human body action source, recovering a mesh source sequence of the action player by an initial human body reconstruction module; constructing the initial mesh sequence into common rule data mesh by a rule data mesh construction module; the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, migrating the posture from the source mesh to the target mesh frame by a human posture migration module to realize the migration of the action sequence contained in the source video to the target 3D role. The invention can generate the mesh sequence consistent with the action of the source video and improve the time sequence consistency of the mesh sequence.

Description

3D character motion generation system and method for simulating human motion in given video
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a 3D character role action generation system and a method.
Background
The character action generation has important practical significance for many computer vision tasks including multimedia interaction technology and visual information understanding. Humans are themselves very adept at learning and mimicking actions from some examples, which in themselves also have a key role in human intelligence. It is therefore desirable that a 3D character also have the ability to learn simulated actions from video samples and generate the same sequence of actions as the source video.
Despite the important research implications of this task, there is still little direct research effort. The work related to the comparison mainly comprises character animation control and simulation learning.
Character animation manipulation ([ 1], [2], [3], [4 ]) mainly studies on how to enable a static 3D character to perform a specific motion, thereby having an animation effect. Such methods are character-defined, i.e., the defined animation needs to be defined according to the skeletal structure of the particular 3D character itself.
The mock learning ([ 5], [6], [7], [8 ]) mainly studies how to make the intelligent robot have the ability to mimic human behavior. The method mainly extracts and summarizes human knowledge, namely summarization and learning of logical ability by learning and analyzing behaviors of human players in different environments, and then applies the method to new environments. The simulation learning is mainly oriented to an intelligent machine system, and usually, corresponding hardware equipment or a virtual environment is needed to support the development of an experiment.
Different from the existing method, the mesh grid is introduced as a representation form of a 3D model, and then the huge representation difference between a video domain and an angular color domain is ingeniously solved by using the research results of the existing human body three-dimensional reconstruction and human body posture migration in two directions, so that the target of enabling a target 3D role to generate an action sequence the same as that of a video is realized. In addition, in order to better improve the time sequence coherence of the generated mesh sequence, the invention provides a mesh2mesh smooth network based on 3D convolution so as to improve the quality of the mesh sequence generated by an initial human body reconstruction model. Compared with a role animation control method, the method can automatically acquire motion information from a video example and then imitate the motion information; compared with the simulation learning method, the method provided by the invention does not need any additional equipment, and completely depends on the deep network model to perform migration and simulation of actions. Experiments prove that the method can generate the mesh sequence consistent with the action of the source video and practically improve the time sequence consistency of the mesh sequence.
Disclosure of Invention
The invention aims to provide a 3D character action generating system and a method capable of simulating human actions in a given video.
In order to reduce the huge representation gap between the video and the character, the invention firstly proposes to use mesh (grid) as a unified 3D model representation form, namely, all the involved 3D models in the invention have mesh. mesh is defined by a series of points and a surface, the points represent vertex information in the model, and the specific form is (id, x, y, z), wherein the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relationship between the vertices, and the specific form is (id 1, id2, id 3), which represents the three vertex numbers connected to the table.
The 3D role action generation system capable of simulating the action of the human body in the given video is based on the deep learning technology and mainly comprises the following four modules: (1) an initial human body reconstruction module; (2) a rule data mescucuid construction module; (3) a mesh2mesh smoothing module; and (4) a human body posture migration module. A segment of source video containing human body actions is given, and firstly, a mesh source sequence of an action player is recovered by an initial human body reconstruction module; then, constructing the initial mesh sequence into common rule data mesh by a rule data mesh construction module; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized.
In the invention, the initial human body reconstruction module adopts the existing human body 3D reconstruction models (text [10] and text [11 ]) based on mesh representation as the initial human body reconstruction module, and the input of the initial human body reconstruction module is the generation result of the source video frame image and the corresponding openpore to obtain an initial human body mesh sequence. The source video frame image is an image frame obtained by performing image frame extraction and down-sampling on input video data; the corresponding openposition generation result refers to the result information of determining the region position of the human body in the whole frame image by extracting the bone joint point information of each frame image action player by using an openposition model (text [9 ]).
In the invention, the rule data mesh construction module constructs the initial mesh sequence into common rule data mesh as the input of the subsequent mesh2mesh smoothing module. Here, the mesh is characterized in that the vertex information in the mesh sequence is regarded as a rectangular parallelepiped with a shape of T × N × 3, regardless of a complex human body structure, i.e., a connection relationship between vertices that is not considered, where T denotes the number of meshes included in the mesh sequence, N denotes the number of vertices included in a single mesh, and 3 denotes a coordinate dimension of a single mesh vertex.
In the invention, the deep neural network structure of the mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers, and the mesh sequence is smoothed by utilizing the time sequence space representation capability of 3D convolution.
In the mesh2mesh smoothing module, the loss function includes two parts:
(1) Joint loss function L j3d Respective joint positions for metric prediction mesh
Figure BDA0002725366490000031
And marking the joint point position J t And taking the average value of the L2 distances of the position deviation of each joint point as a final error value. The mathematical representation is as follows:
Figure BDA0002725366490000032
wherein k represents the total number of body joint points, | · |, represents the euclidean distance;
(2) Loss of motion function L motion Respective joint points for metric prediction mesh
Figure BDA0002725366490000033
Whether the moving direction between the front and rear frames is equal to the marked joint point J or not t Keeping consistency of the moving directions, and taking an L2 distance average value of the offset direction vector as a final error value; the mathematical representation is as follows:
Figure BDA0002725366490000034
wherein | represents the euclidean distance;
the final loss function is therefore: l is a radical of an alcohol j3d +L motion
In the invention, the human body posture migration module adopts a posture migration network (text [13 ]) to generate a new mesh which simultaneously reserves the identity information of the target mesh and the posture information of the source mesh. Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the action consistent with the source video can be generated by the role of the target mesh.
The invention provides a 3D role action generation method capable of simulating human actions in a given video, which comprises the following specific steps:
(1) First, image frame extraction and down-sampling processing are performed on input video data. In consideration of the characteristics that the source video has long duration and the action difference between adjacent frames is small, each source video is subjected to down-sampling at a certain frequency to obtain image frames;
(2) Extracting the bone joint point information of each frame image action player by using an openposition model (text [9 ]), so as to determine the region position of the human body in the whole frame image;
(3) Inputting the source video frame image and the corresponding openness generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence; particularly, any image or video human body 3D reconstruction method based on mesh can be used as an initial human body reconstruction module, and has strong universality;
(4) Constructing the initial mesh sequence into common rule data, namely, mesh cube, and using the common rule data as the input of a subsequent network; the mesh structure is characterized in that complex human body structures are ignored, namely the connection relation between vertexes is not considered, and the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;
(5) Inputting the constructed mesh into a mesh2mesh smoothing module to obtain a smoothed 3Dmesh sequence;
(5.1) training the mesh2mesh network, and continuously optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model (text [12 ]) is used for regression to obtain a corresponding 3D joint point coordinate, then the 3D joint point coordinate is compared with the actually marked human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinate and the actually marked human body joint point information, and further, network parameters are optimized;
(5.2) in a testing stage, inputting the initial 3Dmesh sequence of the testing video obtained in the step (1) to the mesh2mesh smoothing module obtained by training in the step (5.1) to obtain a smoothed 3Dmesh sequence;
(6) And regarding each mesh in the smoothed 3Dmesh sequence as a source mesh with a specific gesture, inputting the source mesh and the target mesh into a human body gesture migration module together, and generating a new mesh which simultaneously reserves the identity information of the target mesh and the gesture information of the source mesh.
And the method comprises the steps of evaluating, wherein the evaluation of the mesh2mesh smoothing module is used as a quantitative evaluation index according to the position distance deviation MPJPE between the joint point position in the mesh and the marked joint point, and the final mesh sequence is subjected to visual qualitative evaluation because the task is a generating task and has no corresponding marking information.
In summary, the innovation of the invention is as follows:
1. the invention provides a method for learning action sequences from videos by using mesh as a representation form of a 3D model for the first time, so that the 3D role plays a visual task of the same action. The task has important significance for multimedia interaction and visual action understanding;
2. the invention provides a method for realizing the extraction of video actions and the generation of role actions by utilizing a 3D human body reconstruction method and a human body posture migration method, the video actions can be effectively migrated to a target 3D role, the whole generation process completely depends on deep learning, no additional equipment or specific virtual environment is needed, and a new solution thought is provided for the role action generation;
3. aiming at the problem of unsmooth mesh sequence, the invention firstly provides the concept of mesh cube, simplifies the mesh sequence with complex structure originally, and provides a mesh2mesh smoothing module based on 3D convolution to model the time sequence spatial information, thereby achieving the effect of optimizing the mesh time sequence consistency. In addition, besides the joint point loss function commonly used in the field of human body reconstruction, the invention also provides a motion error loss function aiming at time sequence change, and optimizes network parameters together with the joint point loss function.
Drawings
Fig. 1 is a schematic diagram of a 3D-oriented character action generation method proposed by the present invention. Given a source video shown in a first row and a target mesh human body role shown in a first column, the method mainly comprises two main stages: the original 3Dmesh sequence is recovered from the video, and the pose of the source mesh is migrated to the target mesh. By introducing the mesh representation method, the method solves the huge gap between human body roles and video data, and further generates a mesh sequence with the same action.
Fig. 2 is an overview of the system of the present invention. It contains three key modules: the device comprises an initial human body reconstruction module, a mesh2mesh smoothing module and a posture migration module. The initial human body reconstruction mode receives input of video or video frame images, and an initial 3Dmesh sequence of a human body is obtained through prediction; the mesh2mesh smoothing module is responsible for optimizing the initial mesh sequence to obtain a mesh sequence with better time sequence consistency; and the gesture migration module is responsible for migrating the gesture of the human body from the source video mesh to the target mesh.
Fig. 3 is a schematic diagram of a mesh2mesh network structure in the present invention. Wherein, the mesh sequence (sub graph a) is represented as a cuboid as shown in a sub graph b; t represents the number of mesh, H represents the number of vertices of a single mesh, and W represents the dimension of a single vertex coordinate. The 3D convolution is used to model the spatio-temporal features of the T meshes. Sub-figure c shows the final mesh2mesh module, containing 8 3D convolutional layers and two "unsqueeze" and "squeeze" layers for dimension adaptation.
Fig. 4 is a graph of the experimental results of mesh2mesh according to the present invention. For the problem of time sequence incoherence existing in the initial mesh source sequence, the mesh2mesh module can effectively utilize the information of the previous and the next messages and improve the information.
Fig. 5 is a diagram showing the result of an experiment of the operation generation method of the present invention. Given a piece of video containing specific human body actions, the invention can migrate the action sequence of the video to the target mesh. The figure shows the generated result of mesh of 3 different identity information as the target mesh.
Detailed Description
Step 1, video frame image preprocessing. For the source video data, the image frame acquisition is first performed at a frequency of 1 frame every 25 frames. For the video required in the training stage, randomly selecting continuous T frames from the collected video frames as input data of the network each time. For the video required in the testing stage, the middle T frame is taken from the collected video as the input data of the network each time. And taking T as 16. For the video clip finally taken
Figure BDA0002725366490000061
Is shown in the formula I t Representing the t-th image frame.
And 2, acquiring the position area of the action player in the image. For each selected video frame image I t It is input into the existing opencast network structure (text [9]]) And predicting to obtain corresponding human skeleton joint points, and determining the position area of the action player according to the position coordinates of the joint points so as to help the initial human body reconstruction model to better perform human body 3Dmesh reconstruction work.
And 3, recovering the source mesh sequence from the input video. Recovery from input video directly using existing 3D human reconstruction methodsHuman mesh source sequence. The present invention is also in GraphCMR (text [10]]) And SMPLfy-X (text [11]]) Experiments were performed on these two models to verify the versatility and mobility of the method. Both models are based on SMPL (text [12]]) A human body model. SMPL-based 3Dmesh contains 6890 vertex coordinates. And modeling the SMPLmesh by the graph convolution neural network, and obtaining the 6890 vertex coordinate positions by a regression mode. SMPLfy-X performs human reconstruction work by fitting the joint points of SMPLmesh to the marked joint points. For video input containing T frames, the frame by frame will be<Video image, openposition prediction result>As input, 3Dmesh obtained after model prediction is obtained. Final initial mesh sequence
Figure BDA0002725366490000062
Is shown in which
Figure BDA0002725366490000063
Indicating the tth individual 3Dmesh.
And 4, constructing the messecuboid. The mesh cube ignores the complex point-surface connection relation of the mesh, keeps the connection relation of the vertexes unchanged, and only models the coordinate position of each vertex for fine adjustment of a subsequent network. For the initial mesh sequence
Figure BDA0002725366490000064
Get each
Figure BDA0002725366490000065
The vertex positions of (1) are stacked to form a mesh cube of T multiplied by N multiplied by 3. Where N represents the number of vertices in a single mesh, N =6890 in the SMPL human model representation.
And 5, constructing a mesh2mesh smoothing module. The mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers as shown in fig. 3. In the design of a network structure, the invention uses a convolution kernel of 5 × 1 × 3 and sets the step size to (1, 1) and the padding to (2, 0, 1) to ensure that the dimension of the input and output data keeps T × N × 3 unchanged. Under the condition of taking GraphCMR and SMPLfy-X as initial human body reconstruction model, mesh2meshThe body of (2) is composed of 8 such 3D convolutional layers. In addition, an unscqueze (dimension with shape of 1 added) and an squeeze layer (dimension with shape of 1 removed) are added to the head and tail of the mesh2mesh smoothing module, respectively, to match the data input with batchsize (batch size of input network) of 1. Adding the mesh vertex coordinates smoothed by the mesh2mesh to the original vertex connection relation to reconstruct a new smoothed mesh sequence
Figure BDA0002725366490000071
And 6, training a mesh2mesh smoothing module. For each video frame I in the video data set t All have corresponding label information J t Can be used as supervision information for model learning. Wherein J t And 3D coordinate positions of the marked human body joint points at the t-th moment are shown. Therefore, the mesh sequence obtained by using mesh2mesh prediction can be utilized
Figure BDA0002725366490000072
And training the model by designing a loss function according to the error between the model and the labeling information. Firstly, the invention utilizes the self-contained joint point regressor of the SMPL model to extract the mesh sequence
Figure BDA0002725366490000073
Obtaining corresponding 3D joint point information by middle regression
Figure BDA0002725366490000074
Wherein the content of the first and second substances,
Figure BDA0002725366490000075
k represents the number of joint points. Correspondingly, a commonly used error loss function is the joint point loss function. Joint loss function L j3d Measuring the position of each joint point of the predicted mesh
Figure BDA0002725366490000076
And marking the joint point position J t And taking the average value of the L2 distances of the position deviation of each joint point as a final error value. It is provided withThe mathematical representation is as follows:
Figure BDA0002725366490000077
in addition, the invention also provides a new motion error loss function L motion Respective joint points for metric prediction mesh
Figure BDA0002725366490000078
Whether the moving direction between the front and rear frames is equal to the marked joint point J or not t The moving directions of (a) are kept consistent, and the average value of the L2 distances of the offset direction vectors is taken as the final error value. The mathematical representation is as follows:
Figure BDA0002725366490000079
wherein | represents the euclidean distance of the corresponding vertex;
the final loss function is therefore: l is a radical of an alcohol j3d +L motion
The method of the invention adopts an SGD (Standard gradient descent) optimization algorithm to train network parameters.
And 7, testing. Obtaining an initial 3Dmesh sequence of the test video and a corresponding initial mescuboid by referring to the modes of the steps 1-4, and then inputting the initial 3Dmesh sequence and the corresponding initial mescuboid into a mesh2mesh module obtained by training in the step 6 to obtain a smoothed 3Dmesh sequence
Figure BDA00027253664900000710
Step 8, for each source mesh with specific gesture
Figure BDA00027253664900000711
With target meshM id Together as a gesture migration network (text [13]]) Generating a simultaneous reservation M id Identity information and
Figure BDA00027253664900000712
new meshM for attitude information new . Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the target mesh role can generate the action consistent with the source video.
And (4) evaluating the method. For the mesh2mesh model, because the model has referenceable labeled joint point information, MPJPE (mean distance error of joint points) is adopted as an evaluation index. And aiming at the whole action generation task, since no referable information exists, the method evaluation is carried out by a visual qualitative method. The experimental result (table 1) shows that the mesh2mesh smoothing module provided by the invention can improve the performance of the initial human body reconstruction model, reduce the error value between the model and the labeled node, and improve the reconstruction effect. Fig. 4 shows that the mesh2mesh has the effect of improving the mesh sequence of the abnormal time sequence action in a visual manner, so that the overall action is more coherent. The effect diagram of the role action generation method is shown in fig. 5, and given a source mesh with a specific action, the method can enable a target mesh to accurately imitate the action in a source video, can process the target mesh with different initial postures and different identity information, and generates a mesh sequence with higher detail reduction degree for the target mesh.
TABLE 1 quantitative test results table (MPJPE value, smaller value, better model performance)
GraphCMR SMPLify-X
Initial human body model 74.7 136.4
Initial human body reconstruction model + mesh2mesh 72.8 128.4
Reference to the literature
[1]A.Macchietto,V.Zordan,and C.R.Shelton.Momentum control for balance.InACMSIGGRAPH2009papers,pages 1–8.2009.
[2]I.Mordatch,M.De Lasa,and A.Hertzmann.Robust physics-based locomotion using low-dimensional planning.In ACM SIGGRAPH 2010papers,pages 1–8.2010.
[3]L.Kavan,S.Collins,J.Zara,and C.O’Sullivan.Skinning with dual quaternions.In Proceedings of the2007symposium on Interactive 3D graphics and games,pages 39–46,2007.
[4]R.Wareham and J.Lasenby.Bone glow:An improved method for the assignment of weights for mesh deformation.In International Conference on Articulated Motion and Deformable Objects,pages 63–71.Springer,2008.
[5]A.Hussein,M.M.Gaber,E.Elyan,and C.Jayne.Imitation learning:A survey of learning methods.ACM Computing Surveys(CSUR),50(2):1–35,2017.
[6]L.-J.Lin.Self-improving reactive agents base donrein for cement learning,planning and teaching.Machine learning,8(3-4):293–321,1992.
[7]K.R.Dixonand P.K.Khosla.Learning by observationwith mobile robots:A computational approach.In IEEE International Conference on Robotics and Automation,2004.Proceedings.ICRA’04.2004,volume 1,pages102–107.IEEE,2004.
[8]S.Levine,C.Finn,T.Darrell,and P.Abbeel.End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research,17(1):1334–1373,2016
[9]Z.Cao,G.Hidalgo,T.Simon,S.-E.Wei,and Y.Sheikh.Openpose:realtime multi-person 2d pose estimation using part affinity fields.TPAMI,2019.
[10]N.Kolotouros,G.Pavlakos,and K.Daniilidis.Convolutional mesh regression for single-image human shape reconstruction.In CVPR,2019.
[11]G.Pavlakos,V.Choutas,N.Ghorbani,T.Bolkart,A.A.A.Osman,D.Tzionas,and M.J.Black.Expressive body capture:3dhands,face,andbodyfromasingleimage.InCVPR,2019.
[12]M.Loper,N.Mahmood,J.Romero,G.Pons-Moll,andM.J.Black.SMPL:Askinnedmulti-personlinearmodel.InSIG GRAPH Asia,2015.
[13]J.Wang,C.Wen,Y.Fu,H.Lin,T.Zou,X.Xue,and Y.Zhang.Neuralposetransferbyspatiallyadaptive instance normalization.In CVPR,2020.。

Claims (6)

1. A 3D character motion generation system that mimics the motion of a character in a given video, comprising the following four modules: (1) an initial human body reconstruction module; (2) a rule data mescucuid construction module; (3) a mesh2mesh smoothing module; (4) a human body posture migration module; for a given segment of source video containing human body actions, firstly, restoring a mesh source sequence of an action player by an initial human body reconstruction module; then, a rule data messcuboid construction module constructs the initial mesh sequence into a rule data messcuboid; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized; the mesh is used as a uniform 3D model representation form, the mesh is defined by a series of points and a surface, the points represent vertex information in the model, the specific form of the points is (id, x, y and z), the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relation among the vertexes, the specific form of the table is (id 1, id2 and id 3), and the table represents the serial numbers of the three vertexes connected with the table;
here, the mesh is characterized in that a complex human body structure is ignored, that is, the vertex information in the mesh sequence is regarded as a cuboid with a shape of T × N × 3, regardless of the connection relationship between the vertices, where T represents the number of meshes included in the mesh sequence, N represents the number of vertices included in a single mesh, and 3 represents the coordinate dimension of a single mesh vertex.
2. The 3D character action generation system according to claim 1, wherein the initial human body reconstruction module adopts a human body 3D reconstruction model based on mesh representation, and inputs of the human body 3D reconstruction model are a source video frame image and a corresponding opencut generation result to obtain an initial human body mesh sequence; the source video frame image is an image frame obtained by performing image frame extraction and down-sampling processing on input video data; the corresponding openposition generation result refers to that skeletal joint point information of an action player in each frame image is extracted by using an openposition model, so that result information of the region position of the human body in the whole frame image is determined.
3. The 3D character motion generation system according to claim 1, wherein the network structure of the mesh2mesh smoothing module is formed by stacking 3D convolution layers, and the mesh sequence is smoothed by using a time sequence space characterization capability of 3D convolution;
in the mesh2mesh smoothing module, the loss function includes two parts:
(1) Joint point loss function L j3d Respective joint positions for metric prediction mesh
Figure FDA0003827758370000021
And marking the joint point position J t Taking the average value of L2 distances of the position deviation of each joint point as a final error value; the mathematical representation is as follows:
Figure FDA0003827758370000022
wherein k represents the total number of human body joint points, and | is | · | | |, represents the euclidean distance;
(2) Loss of motion function L motion Respective joint points for metric prediction mesh
Figure FDA0003827758370000023
Whether the moving direction between the front and rear frames is equal to the marked joint point J or not t Keeping consistency of the moving directions, and taking an L2 distance average value of the offset direction vector as a final error value; the mathematical representation is as follows:
Figure FDA0003827758370000024
wherein, | | · | | | represents a euclidean distance;
the final loss function is: l is a radical of an alcohol j3d +L motion
4. The 3D character action generating system according to claim 3, wherein in the mesh2mesh smoothing module, the 3D convolutional layer uses a 5 × 1 × 3 convolutional kernel, and sets the step size to (1, 1) and the padding to (2, 0, 1) to ensure that the data dimension of the input and output is kept unchanged by txnx 3; the main body of mesh2mesh is composed of 8 layers of such 3D convolutional layers; in addition, an unsqueze, i.e., adding dimension with shape of 1, and an squeeze layer, i.e., removing dimension with shape of 1, are added to the head and tail of the mesh2mesh smoothing module, respectively, to match the batchsize, i.e., the data input with batch data size of 1 to the network.
5. The 3D character action generating system according to claim 4, wherein the human body posture migration module generates a new mesh which simultaneously retains the identity information of the target mesh and the posture information of the source mesh by adopting a posture migration network, namely sequentially migrating the postures of the whole source mesh sequence into the target mesh so that the target mesh character generates an action consistent with the source video.
6. A 3D character action generating method based on the system of any one of claims 1 to 5, characterized by comprising the following steps:
(1) Firstly, performing image frame extraction and downsampling processing on input video data, and obtaining image frames by performing downsampling on each source video in a certain frequency mode;
(2) Extracting the bone joint point information of the action player of each frame image by adopting an openposition model so as to determine the region position of the human body in the whole frame image;
(3) Inputting the source video frame image and the corresponding openposition generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence;
(4) Constructing the initial mesh sequence into regular data, namely, mesh; the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;
(5) Inputting the constructed mesh cuboid into a mesh2mesh smoothing module to obtain a smoothed 3Dmesh sequence;
(5.1) training the mesh2mesh network, and optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model is used for regression to obtain corresponding 3D joint point coordinates, then the coordinates are compared with the truly labeled human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinates and the truly labeled human body joint point information, and further, network parameters are optimized;
(5.2) in a testing stage, inputting the initial 3Dmesh sequence of the testing video obtained in the step (1) to the mesh2mesh smoothing module obtained by training in the step (5.1) to obtain a smoothed 3Dmesh sequence;
(6) And regarding each mesh in the smoothed 3Dmesh sequence as a source mesh with a specific gesture, inputting the source mesh and the target mesh into a human body gesture migration module together, and generating a new mesh which simultaneously reserves the identity information of the target mesh and the gesture information of the source mesh.
CN202011101066.1A 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video Active CN112308952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011101066.1A CN112308952B (en) 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011101066.1A CN112308952B (en) 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video

Publications (2)

Publication Number Publication Date
CN112308952A CN112308952A (en) 2021-02-02
CN112308952B true CN112308952B (en) 2022-11-18

Family

ID=74327365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011101066.1A Active CN112308952B (en) 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video

Country Status (1)

Country Link
CN (1) CN112308952B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731330A (en) * 2022-11-16 2023-03-03 北京百度网讯科技有限公司 Target model generation method, animation generation method, device and electronic equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0936576A2 (en) * 1998-02-12 1999-08-18 Mitsubishi Denki Kabushiki Kaisha A system for reconstructing the 3-dimensional motions of a human figure from a monocularly-viewed image sequence
JP2006141453A (en) * 2004-11-16 2006-06-08 Bandai Networks Co Ltd Program for video game and video game device
CN108230431A (en) * 2018-01-24 2018-06-29 深圳市云之梦科技有限公司 A kind of the human action animation producing method and system of two-dimensional virtual image
CN108537136A (en) * 2018-03-19 2018-09-14 复旦大学 The pedestrian's recognition methods again generated based on posture normalized image
CN109147048A (en) * 2018-07-23 2019-01-04 复旦大学 A kind of three-dimensional grid method for reconstructing using individual cromogram
CN110033505A (en) * 2019-04-16 2019-07-19 西安电子科技大学 A kind of human action capture based on deep learning and virtual animation producing method
CN110188668A (en) * 2019-05-28 2019-08-30 复旦大学 A method of classify towards small sample video actions
WO2020064873A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
CN111161395A (en) * 2019-11-19 2020-05-15 深圳市三维人工智能科技有限公司 Method and device for tracking facial expression and electronic equipment
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium
CN111553968A (en) * 2020-05-11 2020-08-18 青岛联合创智科技有限公司 Method for reconstructing animation by three-dimensional human body

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11568207B2 (en) * 2018-09-27 2023-01-31 Deepmind Technologies Limited Learning observation representations by predicting the future in latent space
US10872294B2 (en) * 2018-09-27 2020-12-22 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
SG11202107737WA (en) * 2019-01-15 2021-08-30 Shane Yang Augmented cognition methods and apparatus for contemporaneous feedback in psychomotor learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0936576A2 (en) * 1998-02-12 1999-08-18 Mitsubishi Denki Kabushiki Kaisha A system for reconstructing the 3-dimensional motions of a human figure from a monocularly-viewed image sequence
JP2006141453A (en) * 2004-11-16 2006-06-08 Bandai Networks Co Ltd Program for video game and video game device
CN108230431A (en) * 2018-01-24 2018-06-29 深圳市云之梦科技有限公司 A kind of the human action animation producing method and system of two-dimensional virtual image
CN108537136A (en) * 2018-03-19 2018-09-14 复旦大学 The pedestrian's recognition methods again generated based on posture normalized image
CN109147048A (en) * 2018-07-23 2019-01-04 复旦大学 A kind of three-dimensional grid method for reconstructing using individual cromogram
WO2020064873A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
CN110033505A (en) * 2019-04-16 2019-07-19 西安电子科技大学 A kind of human action capture based on deep learning and virtual animation producing method
CN110188668A (en) * 2019-05-28 2019-08-30 复旦大学 A method of classify towards small sample video actions
CN111161395A (en) * 2019-11-19 2020-05-15 深圳市三维人工智能科技有限公司 Method and device for tracking facial expression and electronic equipment
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium
CN111553968A (en) * 2020-05-11 2020-08-18 青岛联合创智科技有限公司 Method for reconstructing animation by three-dimensional human body

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020);Samuel Albanie 等;《arXiv:2008.00744v1》;20200803;第1-8页 *
深度伪造技术的安全挑战与应对;顾钊铨 等;《信息安全》;20200922;第55-57页 *

Also Published As

Publication number Publication date
CN112308952A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
Jatavallabhula et al. gradsim: Differentiable simulation for system identification and visuomotor control
Zhu et al. Inferring forces and learning human utilities from videos
Xia et al. A survey on human performance capture and animation
Petit et al. Tracking elastic deformable objects with an RGB-D sensor for a pizza chef robot
CN111028317B (en) Animation generation method, device and equipment for virtual object and storage medium
Perret et al. Interactive assembly simulation with haptic feedback
Purushwalkam et al. Bounce and learn: Modeling scene dynamics with real-world bounces
Murthy et al. gradsim: Differentiable simulation for system identification and visuomotor control
Tan et al. Realtime simulation of thin-shell deformable materials using CNN-based mesh embedding
Romero et al. Modeling and estimation of nonlinear skin mechanics for animated avatars
CN116363308A (en) Human body three-dimensional reconstruction model training method, human body three-dimensional reconstruction method and equipment
Liang et al. Machine learning for digital try-on: Challenges and progress
CN112308952B (en) 3D character motion generation system and method for imitating human motion in given video
Guo et al. Inverse simulation: Reconstructing dynamic geometry of clothed humans via optimal control
Wu et al. An unsupervised real-time framework of human pose tracking from range image sequences
Wu et al. Agentdress: Realtime clothing synthesis for virtual agents using plausible deformations
Mirolo et al. A solid modelling system for robot action planning
Zhao et al. Stability-driven contact reconstruction from monocular color images
Schröder et al. Design and evaluation of reduced marker layouts for hand motion capture
Wu et al. Example-based real-time clothing synthesis for virtual agents
Mousas et al. Efficient hand-over motion reconstruction
Walsman et al. Break and make: Interactive structural understanding using lego bricks
Kaushik et al. Imitating human movement using a measure of verticality to animate low degree-of-freedom non-humanoid virtual characters
Zhi et al. Learning from demonstration via probabilistic diagrammatic teaching
Xu Single-view and multi-view methods in marker-less 3d human motion capture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant