CN112308952A - 3D character motion generation system and method for imitating human motion in given video - Google Patents

3D character motion generation system and method for imitating human motion in given video Download PDF

Info

Publication number
CN112308952A
CN112308952A CN202011101066.1A CN202011101066A CN112308952A CN 112308952 A CN112308952 A CN 112308952A CN 202011101066 A CN202011101066 A CN 202011101066A CN 112308952 A CN112308952 A CN 112308952A
Authority
CN
China
Prior art keywords
mesh
sequence
human body
mesh2mesh
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011101066.1A
Other languages
Chinese (zh)
Other versions
CN112308952B (en
Inventor
姜育刚
傅宇倩
付彦伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202011101066.1A priority Critical patent/CN112308952B/en
Publication of CN112308952A publication Critical patent/CN112308952A/en
Application granted granted Critical
Publication of CN112308952B publication Critical patent/CN112308952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention belongs to the technical field of computers, and particularly relates to a 3D role action generating system and method for simulating the action of a person in a given video. The system comprises four modules of initial human body reconstruction, regular data mesh construction, mesh2mesh smoothing and human body posture migration; for the video containing the human body action source, recovering a mesh source sequence of the action player by an initial human body reconstruction module; constructing the initial mesh sequence into common rule data mesh by a rule data mesh construction module; the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, the human body posture migration module migrates the posture from the source mesh to the target mesh frame by frame, so that the action sequence contained in the source video is migrated to the target 3D role. The invention can generate the mesh sequence consistent with the action of the source video and improve the time sequence consistency of the mesh sequence.

Description

3D character motion generation system and method for imitating human motion in given video
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a 3D character role action generation system and a method.
Background
The character action generation has important practical significance for many computer vision tasks including multimedia interaction technology and visual information understanding. Humans are themselves very adept at learning and mimicking actions from some examples, which in themselves also have a key role in human intelligence. It is therefore desirable that 3D characters also have the ability to learn simulated actions from video samples and generate the same sequence of actions as the source video.
Despite the important research implications of this task, there is still little direct research effort. The work related to the comparison mainly comprises character animation control and simulation learning.
Character animation manipulation ([1], [2], [3], [4]) mainly studies on how to enable a static 3D character to perform a specific motion, thereby having an animation effect. Such methods are role-defined, i.e., the defined animation needs to be defined according to the skeletal structure of the particular 3D character itself.
The mock learning ([5], [6], [7], [8]) mainly studies how to make the intelligent robot have the ability to mimic human behavior. The method mainly extracts and summarizes human knowledge, namely summarization and learning of logical ability by learning and analyzing behaviors of human players in different environments, and further applies the method to new environments. The simulation learning is mainly oriented to an intelligent machine system, and usually, corresponding hardware equipment or a virtual environment is needed to support the development of an experiment.
Different from the existing method, the mesh grid is introduced as a representation form of a 3D model, and then the huge representation difference between a video domain and an angular color domain is ingeniously solved by using the research results of the existing human body three-dimensional reconstruction and human body posture migration in two directions, so that the target of enabling a target 3D role to generate an action sequence the same as that of a video is realized. In addition, in order to better improve the time sequence coherence of the generated mesh sequence, the invention provides a mesh2mesh smooth network based on 3D convolution so as to improve the quality of the mesh sequence generated by an initial human body reconstruction model. Compared with a role animation control method, the method can automatically acquire motion information from a video example and then imitate the motion information; compared with the simulation learning method, the method provided by the invention does not need any additional equipment, and completely depends on the deep network model to perform migration and simulation of actions. Experiments prove that the method can generate the mesh sequence consistent with the action of the source video and practically improve the time sequence consistency of the mesh sequence.
Disclosure of Invention
The invention aims to provide a 3D character motion generation system and a method capable of simulating human motion in a given video.
In order to reduce the huge representation gap between the video and the character, the invention firstly proposes to use mesh (grid) as a unified 3D model representation form, namely, all the involved 3D models in the invention have mesh. mesh is defined by a series of points and a surface, the points represent vertex information in the model, and the specific form is (id, x, y, z), wherein the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relationship between the vertices, and the specific form is (id1, id2, id3), which represents the three vertex sequence numbers connected with the table.
The 3D role action generation system capable of simulating the action of the character in the given video is based on the deep learning technology and mainly comprises the following four modules: (1) an initial human body reconstruction module; (2) a rule data meshcuboid construction module; (3) a mesh2mesh smoothing module; (4) human body posture migration module. A segment of source video containing human body actions is given, and firstly, a mesh source sequence of an action player is recovered by an initial human body reconstruction module; then, a rule data mesh building module builds the initial mesh sequence into a common rule data mesh; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized.
In the invention, the initial human body reconstruction module adopts the existing human body 3D reconstruction models (text [10] and text [11]) based on mesh representation as the initial human body reconstruction module, and the input of the initial human body reconstruction module is the generation result of the source video frame image and the corresponding openpore to obtain an initial human body mesh sequence. The source video frame image is an image frame obtained by performing image frame extraction and down-sampling on input video data; the corresponding openposition generation result refers to the result information of determining the region position of the human body in the whole frame image by extracting the bone joint point information of each frame image action player by using an openposition model (text [9 ]).
In the invention, the rule data mesh construction module constructs the initial mesh sequence into a common rule data mesh as the input of a subsequent mesh2mesh smoothing module. Here, the mesh is characterized in that the vertex information in the mesh sequence is regarded as a rectangular parallelepiped with a shape of T × N × 3, regardless of a complex human body structure, i.e., a connection relationship between vertices that is not considered, where T denotes the number of meshes included in the mesh sequence, N denotes the number of vertices included in a single mesh, and 3 denotes a coordinate dimension of a single mesh vertex.
In the invention, the deep neural network structure of the mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers, and the mesh sequence is smoothed by utilizing the time sequence space representation capability of 3D convolution.
In the mesh2mesh smoothing module, the loss function includes two parts:
(1) joint loss function Lj3dRespective joint positions for metric prediction mesh
Figure BDA0002725366490000031
And marking the joint point position JtThe final error value is the average of the L2 distances of the position offsets of the respective joints. The mathematical representation is as follows:
Figure BDA0002725366490000032
wherein k represents the total number of body joint points, | · |, represents the euclidean distance;
(2) loss of motion function LmotionRespective joint points for metric prediction mesh
Figure BDA0002725366490000033
Whether the moving direction between the front and rear frames is equal to the marked joint point J or nottKeeping the moving directions of the offset direction vectors consistent, and taking the average value of the L2 distance of the offset direction vectors as a final error value; the mathematical representation is as follows:
Figure BDA0002725366490000034
wherein | represents the euclidean distance;
the final loss function is therefore: l isj3d+Lmotion
In the invention, the human body posture migration module adopts a posture migration network (text [13]) to generate a new mesh which simultaneously reserves the identity information of the target mesh and the posture information of the source mesh. Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the target mesh role can generate the action consistent with the source video.
The invention provides a 3D role action generation method capable of simulating the action of a person in a given video, which comprises the following specific steps:
(1) firstly, image frame extraction and down-sampling processing are carried out on input video data. In consideration of the characteristics that source videos have long duration and small action difference between adjacent frames, each source video is subjected to down-sampling at a certain frequency to obtain image frames;
(2) extracting the bone joint point information of each frame image action player by using an openposition model (text [9]), so as to determine the region position of the human body in the whole frame image;
(3) inputting the source video frame image and the corresponding openposition generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence; particularly, any image or video human body 3D reconstruction method based on mesh can be used as an initial human body reconstruction module, and has strong universality;
(4) constructing the initial mesh sequence into common rule data, namely, mesh cube, and using the common rule data as the input of a subsequent network; the mesh structure is characterized in that complex human body structures are ignored, namely the connection relation between vertexes is not considered, and the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;
(5) inputting the constructed mesh cuboid into a mesh2mesh smoothing module to obtain a smoothed 3Dmesh sequence;
(5.1) training the mesh2mesh network, and continuously optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model (text [12]) is used for regression to obtain corresponding 3D joint point coordinates, then, the coordinates are compared with the truly labeled human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinates and the truly labeled human body joint point information, and then, network parameters are optimized;
(5.2) in the testing stage, inputting the initial 3Dmesh sequence of the test video obtained in the step 1-4 into a mesh2mesh smoothing module obtained by training in the step (5.1) to obtain a smoothed 3Dmesh sequence;
(6) and regarding each mesh in the smoothed 3Dmesh sequence as a source mesh with a specific gesture, inputting the source mesh and the target mesh into a human body gesture migration module together, and generating a new mesh which simultaneously reserves the identity information of the target mesh and the gesture information of the source mesh.
And the method comprises the steps of evaluating, wherein the evaluation of the mesh2mesh smoothing module is used as a quantitative evaluation index according to the position distance deviation MPJPE between the joint point position in the mesh and the marked joint point, and the final mesh sequence is subjected to visual qualitative evaluation because the task is a generated task and has no corresponding marked information.
In summary, the innovation of the invention is as follows:
1. the invention provides a method for learning action sequences from videos by using mesh as a representation form of a 3D model for the first time, so that the 3D role plays a visual task of the same action. The task has important significance for multimedia interaction and visual action understanding;
2. the invention provides a method for realizing the extraction of video actions and the generation of role actions by utilizing a 3D human body reconstruction method and a human body posture migration method, the video actions can be effectively migrated to a target 3D role, the whole generation process completely depends on deep learning, no additional equipment or specific virtual environment is needed, and a new solution thought is provided for the role action generation;
3. aiming at the problem of unsmooth mesh sequence, the invention firstly provides the concept of mesh cube, simplifies the mesh sequence with complex structure originally, and provides a mesh2mesh smoothing module based on 3D convolution to model the time sequence spatial information, thereby achieving the effect of optimizing the continuity of the mesh time sequence. In addition, besides the joint point loss function commonly used in the field of human body reconstruction, the invention also provides a motion error loss function aiming at time sequence change, and jointly optimizes network parameters together with the joint point loss function.
Drawings
Fig. 1 is a schematic diagram of a 3D-oriented character action generation method proposed by the present invention. Given a source video shown in a first row and a target mesh human body role shown in a first column, the method mainly comprises two main stages: the original 3Dmesh sequence is recovered from the video, and the pose of the source mesh is migrated to the target mesh. By introducing the mesh representation method, the method solves the huge gap between human body roles and video data, and further generates a mesh sequence with the same action.
Fig. 2 is an overview of the system of the present invention. It contains three key modules: the device comprises an initial human body reconstruction module, a mesh2mesh smoothing module and a posture migration module. The initial human body reconstruction mode receives input of a video or a video frame image, and an initial 3Dmesh sequence of a human body is obtained through prediction; the mesh2mesh smoothing module is responsible for optimizing the initial mesh sequence to obtain a mesh sequence with better time sequence consistency; and the gesture migration module is responsible for migrating the gesture of the human body from the source video mesh to the target mesh.
Fig. 3 is a schematic diagram of a mesh2mesh network structure in the present invention. Wherein, the mesh sequence (subgraph a) is represented as a cuboid as shown in a subgraph b; t represents the number of mesh, H represents the number of vertices of a single mesh, and W represents the dimension of a single vertex coordinate. The 3D convolution is used to model the spatio-temporal features of the T meshes. Sub-graph c shows the final mesh2mesh module, containing 8 3D convolutional layers and two "unsqueeze" and "squeeze" layers for dimension adaptation.
Fig. 4 is a graph of the experimental result of mesh2mesh of the present invention. For the problem of time sequence incoherence existing in the initial mesh source sequence, the mesh2mesh module can effectively utilize the information of the previous and the next messages and improve the information.
Fig. 5 is a diagram showing the result of an experiment of the operation generation method of the present invention. Given a piece of video containing specific human body actions, the invention can migrate the action sequence of the video to the target mesh. The figure shows the generated result of mesh of 3 different identity information as the target mesh.
Detailed Description
Step 1, video frame image preprocessing. For the source video data, the image frame acquisition is first performed at a frequency of 1 frame every 25 frames. For the video required in the training stage, randomly selecting continuous T frames from the collected video frames as input data of the network each time. For the video required in the testing stage, the middle T frame is taken from the collected video as the input data of the network each time. And taking T as 16. For the final video segment taken
Figure BDA0002725366490000061
Is shown in the formula ItRepresenting the t-th image frame.
And 2, acquiring the position area of the action player in the image. For each selected video frame image ItIt is input into the existing opencast network structure (text [9]]) And predicting to obtain corresponding human skeleton joint points, and determining the position area of the action player according to the position coordinates of the joint points so as to help the initial human body reconstruction model to better perform human body 3Dmesh reconstruction work.
And 3, recovering the source mesh sequence from the input video. And directly recovering the human mesh source sequence from the input video by using the existing 3D human body reconstruction method. The invention is also in GraphCMR (text [10]]) And SMPLfy-X (text [11]]) Experiments were performed on these two models to verify the versatility and mobility of the method. Both models are based on SMPL (text [12]]) A human body model. SMPL-based 3Dmesh contains 6890 vertex coordinates. And modeling the SMPLmesh by the graph convolution neural network, and obtaining the 6890 vertex coordinate positions by a regression mode. SMPLfy-X performs human reconstruction work by fitting the joint points of SMPLmesh to the marked joint points. For a video input containing T frames, the frame by frame will be<Video image, openposition prediction result>As input, 3Dmesh obtained after model prediction is obtained. Final initial mesh sequence
Figure BDA0002725366490000062
Is shown in which
Figure BDA0002725366490000063
Indicating the tth individual 3 Dmesh.
And 4, constructing the messecuboid. The mesh cube ignores the complex point-surface connection relation of the mesh, keeps the connection relation of the vertexes unchanged, and only models the coordinate position of each vertex for fine adjustment of a subsequent network. For the initial mesh sequence
Figure BDA0002725366490000064
Get each
Figure BDA0002725366490000065
The vertex positions of (a) are stacked to constitute a mesh cube of T × N × 3. Wherein N represents the number of vertexes in a single mesh, and in the SMPL human body model representation mode, N is equal to6890。
And 5, constructing a mesh2mesh smoothing module. The mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers as shown in fig. 3. In the design of a network structure, the invention uses a convolution kernel of 5 × 1 × 3, and sets the step size to be (1,1,1) and the padding to be (2,0,1) so as to ensure that the dimension of the input and output data keeps T × N × 3 unchanged. In the case of using GraphCMR and SMPLify-X as an initial human reconstruction model, the body of mesh2mesh is composed of 8 layers of such 3D convolutional layers. In addition, an unsqueeze (dimension with the shape of 1 added) and an squeeze layer (dimension with the shape of 1 removed) are added to the head and the tail of the mesh2mesh smoothing module respectively to match the data input with the batchsize (batch size of input network) of 1. Adding the mesh vertex coordinates smoothed by the mesh2mesh to the original vertex connection relation to reconstruct a new smoothed mesh sequence
Figure BDA0002725366490000071
And 6, training a mesh2mesh smoothing module. For each video frame I in the video data settAll have corresponding label information JtCan be used as supervision information for model learning. Wherein JtAnd 3D coordinate positions of the marked human body joint points at the t-th moment are shown. Therefore, the mesh sequence obtained by using mesh2mesh prediction can be utilized
Figure BDA0002725366490000072
And training the model by designing a loss function according to the error between the model and the labeling information. Firstly, the invention utilizes the self-contained joint point regressor of the SMPL model to extract the mesh sequence
Figure BDA0002725366490000073
Obtaining corresponding 3D joint point information by middle regression
Figure BDA0002725366490000074
Wherein the content of the first and second substances,
Figure BDA0002725366490000075
k represents the number of joint points. Correspond to each otherA commonly used error loss function is the joint point loss function. Joint loss function Lj3dMeasuring the position of each joint point of the predicted mesh
Figure BDA0002725366490000076
And marking the joint point position JtThe final error value is the average of the L2 distances of the position offsets of the respective joints. The mathematical representation is as follows:
Figure BDA0002725366490000077
in addition, the invention also provides a new motion error loss function LmotionRespective joint points for metric prediction mesh
Figure BDA0002725366490000078
Whether the moving direction between the front and rear frames is equal to the marked joint point J or nottThe moving direction of (d) is kept consistent, and the L2 distance average of the offset direction vector is taken as the final error value. The mathematical representation is as follows:
Figure BDA0002725366490000079
wherein | represents the euclidean distance of the corresponding vertex;
the final loss function is therefore: l isj3d+Lmotion
The method of the invention adopts an SGD (Standard gradient descent) optimization algorithm to train network parameters.
And 7, testing. Obtaining an initial 3Dmesh sequence of the test video and a corresponding initial mescuboid by referring to the modes of the steps 1-4, and then inputting the initial 3Dmesh sequence and the corresponding initial mescuboid into a mesh2mesh module obtained by training in the step 6 to obtain a smoothed 3Dmesh sequence
Figure BDA00027253664900000710
Step (ii) of8. For each source mesh with a specific gesture
Figure BDA00027253664900000711
With the target meshMidTogether as a gesture migration network (text [13]]) Generating a simultaneous reservation MidIdentity information and
Figure BDA00027253664900000712
new meshM for attitude informationnew. Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the target mesh role can generate the action consistent with the source video.
And (4) evaluating the method. For the mesh2mesh model, because the model has referenceable labeled joint point information, MPJPE (mean distance error of joint points) is adopted as an evaluation index. And aiming at the whole action generation task, since no referenceable information is available, the method evaluation is carried out by a visual qualitative method. The experimental result (table 1) shows that the mesh2mesh smoothing module provided by the invention can improve the performance of the initial human body reconstruction model, reduce the error value between the model and the labeled node, and improve the reconstruction effect. Fig. 4 shows that the mesh2mesh has the effect of improving the mesh sequence of the abnormal time sequence action in a visual manner, so that the overall action is more coherent. The effect diagram of the role action generation method is shown in fig. 5, and given a source mesh with a specific action, the method can enable a target mesh to accurately imitate the action in a source video, can process the target mesh with different initial postures and different identity information, and generates a mesh sequence with higher detail reduction degree for the target mesh.
TABLE 1 quantitative test results table (MPJPE value, smaller value, better model performance)
GraphCMR SMPLify-X
Initial human body model 74.7 136.4
Initial human body reconstruction model + mesh2mesh 72.8 128.4
Reference to the literature
[1]A.Macchietto,V.Zordan,and C.R.Shelton.Momentum control for balance.InACMSIGGRAPH2009papers,pages 1–8.2009.
[2]I.Mordatch,M.De Lasa,and A.Hertzmann.Robust physics-based locomotion using low-dimensional planning.In ACM SIGGRAPH 2010papers,pages 1–8.2010.
[3]L.Kavan,S.Collins,J.Zara,and C.O’Sullivan.Skinning with dual quaternions.In Proceedings of the2007symposium on Interactive 3D graphics and games,pages 39–46,2007.
[4]R.Wareham and J.Lasenby.Bone glow:An improved method for the assignment of weights for mesh deformation.In International Conference on Articulated Motion and Deformable Objects,pages 63–71.Springer,2008.
[5]A.Hussein,M.M.Gaber,E.Elyan,and C.Jayne.Imitation learning:A survey of learning methods.ACM Computing Surveys(CSUR),50(2):1–35,2017.
[6]L.-J.Lin.Self-improving reactive agents base donrein for cement learning,planning and teaching.Machine learning,8(3-4):293–321,1992.
[7]K.R.Dixonand P.K.Khosla.Learning by observationwith mobile robots:A computational approach.In IEEE International Conference on Robotics and Automation,2004.Proceedings.ICRA’04.2004,volume 1,pages102–107.IEEE,2004.
[8]S.Levine,C.Finn,T.Darrell,and P.Abbeel.End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research,17(1):1334–1373,2016
[9]Z.Cao,G.Hidalgo,T.Simon,S.-E.Wei,and Y.Sheikh.Openpose:realtime multi-person 2d pose estimation using part affinity fields.TPAMI,2019.
[10]N.Kolotouros,G.Pavlakos,and K.Daniilidis.Convolutional mesh regression for single-image human shape reconstruction.In CVPR,2019.
[11]G.Pavlakos,V.Choutas,N.Ghorbani,T.Bolkart,A.A.A.Osman,D.Tzionas,and M.J.Black.Expressive body capture:3dhands,face,andbodyfromasingleimage.InCVPR,2019.
[12]M.Loper,N.Mahmood,J.Romero,G.Pons-Moll,andM.J.Black.SMPL:Askinnedmulti-personlinearmodel.InSIG GRAPH Asia,2015.
[13]J.Wang,C.Wen,Y.Fu,H.Lin,T.Zou,X.Xue,and Y.Zhang.Neuralposetransferbyspatiallyadaptive instance normalization.In CVPR,2020.。

Claims (7)

1. A 3D character motion generation system that mimics the motion of a character in a given video, comprising the following four modules: (1) an initial human body reconstruction module; (2) a rule data meshcuboid construction module; (3) a mesh2mesh smoothing module; (4) a human body posture migration module; for a given segment of source video containing human body actions, firstly, restoring a mesh source sequence of an action player by an initial human body reconstruction module; then, a rule data mesh building module builds the initial mesh sequence into a common rule data mesh; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized; the mesh is used as a uniform 3D model representation form, the mesh is defined by a series of points and a surface, the points represent vertex information in the model, the specific form of the points is (id, x, y and z), the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relationship between the vertices, and the specific form is (id1, id2, id3), which represents the three vertex sequence numbers connected with the table.
2. The 3D role action generation system according to claim 1, wherein the initial human body reconstruction module adopts a human body 3D reconstruction model based on mesh representation, and inputs of the model are a source video frame image and a corresponding openposition generation result to obtain an initial human body mesh sequence; the source video frame image is an image frame obtained by performing image frame extraction and down-sampling processing on input video data; the corresponding openposition generation result refers to the result information of determining the region position of the human body in the whole frame image by extracting the skeletal joint point information of the action player in each frame image by using an openposition model.
3. The 3D character action generating system according to claim 2, wherein the rule data mesh construction module constructs an initial mesh sequence into a common rule data mesh as an input of a subsequent mesh2mesh smoothing module; here, the mesh is characterized in that a complex human body structure is ignored, that is, the vertex information in the mesh sequence is regarded as a cuboid with a shape of T × N × 3, regardless of the connection relationship between the vertices, where T represents the number of meshes included in the mesh sequence, N represents the number of vertices included in a single mesh, and 3 represents the coordinate dimension of a single mesh vertex.
4. The 3D character action generating system according to claim 3, wherein the network structure of the mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers, and the mesh sequence is smoothed by using the sequential spatial representation capability of the 3D convolution;
in the mesh2mesh smoothing module, the loss function includes two parts:
(1) joint loss function Lj3dRespective joint positions for metric prediction mesh
Figure FDA0002725366480000023
And marking the joint point position JtTaking the average value of the L2 distance of the position deviation of each joint point as a final error value; the mathematical representation is as follows:
Figure FDA0002725366480000021
wherein k represents the total number of human body joint points, | · |, represents the euclidean distance;
(2) loss of motion function LmotionRespective joint points for metric prediction mesh
Figure FDA0002725366480000024
Whether the moving direction between the front and rear frames is equal to the marked joint point J or nottKeeping the moving directions of the offset direction vectors consistent, and taking the average value of the L2 distance of the offset direction vectors as a final error value; the mathematical representation is as follows:
Figure FDA0002725366480000022
wherein | represents the euclidean distance;
the final loss function is: l isj3d+Lmotion
5. The 3D character action generating system according to claim 4, wherein in the mesh2mesh smoothing module, the 3D convolutional layer uses a 5 × 1 × 3 convolutional kernel, and sets the step size to (1,1,1) and padding to (2,0,1) to ensure that the data dimension of input and output remains T × N × 3; the main body of mesh2mesh is composed of 8 layers of such 3D convolutional layers; in addition, an unsqueeze, i.e. a dimension with the shape of 1, and an squeeze layer, i.e. a dimension with the shape of 1, are added to the head and the tail of the mesh2mesh smoothing module respectively so as to match the batchsize, i.e. the data input with the batch data size of 1 of the input network.
6. The 3D character action generating system according to claim 5, wherein the human body posture migration module generates a new mesh which simultaneously retains the identity information of the target mesh and the posture information of the source mesh by adopting a posture migration network, namely sequentially migrating the postures of the whole source mesh sequence into the target mesh so that the target mesh character generates an action consistent with the source video.
7. A 3D character action generating method based on the system of any one of claims 1 to 6, characterized by comprising the following steps:
(1) firstly, carrying out image frame extraction and down-sampling processing on input video data, and obtaining image frames by carrying out down-sampling on each source video at a certain frequency;
(2) extracting the bone joint point information of the action player of each frame image by adopting an openposition model so as to determine the region position of the human body in the whole frame image;
(3) inputting the source video frame image and the corresponding openposition generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence;
(4) constructing the initial mesh sequence into common rule data, namely, mesh cube; the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;
(5) inputting the constructed mesh cuboid into a mesh2mesh smoothing module to obtain a smoothed 3Dmesh sequence;
(5.1) training the mesh2mesh network, and optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model is used for regression to obtain corresponding 3D joint point coordinates, then the corresponding 3D joint point coordinates are compared with the real marked human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinates and the real marked human body joint point information, and further, network parameters are optimized;
(5.2) in the testing stage, inputting the initial 3Dmesh sequence of the test video obtained in the step 1-4 into a mesh2mesh smoothing module obtained by training in the step (5.1) to obtain a smoothed 3Dmesh sequence;
(6) and regarding each mesh in the smoothed 3Dmesh sequence as a source mesh with a specific gesture, inputting the source mesh and the target mesh into a human body gesture migration module together, and generating a new mesh which simultaneously reserves the identity information of the target mesh and the gesture information of the source mesh.
CN202011101066.1A 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video Active CN112308952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011101066.1A CN112308952B (en) 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011101066.1A CN112308952B (en) 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video

Publications (2)

Publication Number Publication Date
CN112308952A true CN112308952A (en) 2021-02-02
CN112308952B CN112308952B (en) 2022-11-18

Family

ID=74327365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011101066.1A Active CN112308952B (en) 2020-10-15 2020-10-15 3D character motion generation system and method for imitating human motion in given video

Country Status (1)

Country Link
CN (1) CN112308952B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731330A (en) * 2022-11-16 2023-03-03 北京百度网讯科技有限公司 Target model generation method, animation generation method, device and electronic equipment
CN117218297A (en) * 2023-09-29 2023-12-12 北京百度网讯科技有限公司 Human body reconstruction parameter generation method, device, equipment and medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0936576A2 (en) * 1998-02-12 1999-08-18 Mitsubishi Denki Kabushiki Kaisha A system for reconstructing the 3-dimensional motions of a human figure from a monocularly-viewed image sequence
JP2006141453A (en) * 2004-11-16 2006-06-08 Bandai Networks Co Ltd Program for video game and video game device
CN108230431A (en) * 2018-01-24 2018-06-29 深圳市云之梦科技有限公司 A kind of the human action animation producing method and system of two-dimensional virtual image
CN108537136A (en) * 2018-03-19 2018-09-14 复旦大学 The pedestrian's recognition methods again generated based on posture normalized image
CN109147048A (en) * 2018-07-23 2019-01-04 复旦大学 A kind of three-dimensional grid method for reconstructing using individual cromogram
CN110033505A (en) * 2019-04-16 2019-07-19 西安电子科技大学 A kind of human action capture based on deep learning and virtual animation producing method
CN110188668A (en) * 2019-05-28 2019-08-30 复旦大学 A method of classify towards small sample video actions
WO2020064873A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
US20200104684A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
US20200104680A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Action selection neural network training using imitation learning in latent space
CN111161395A (en) * 2019-11-19 2020-05-15 深圳市三维人工智能科技有限公司 Method and device for tracking facial expression and electronic equipment
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium
US20200222757A1 (en) * 2019-01-15 2020-07-16 Shane Yang Augmented Cognition Methods And Apparatus For Contemporaneous Feedback In Psychomotor Learning
CN111553968A (en) * 2020-05-11 2020-08-18 青岛联合创智科技有限公司 Method for reconstructing animation by three-dimensional human body

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0936576A2 (en) * 1998-02-12 1999-08-18 Mitsubishi Denki Kabushiki Kaisha A system for reconstructing the 3-dimensional motions of a human figure from a monocularly-viewed image sequence
JP2006141453A (en) * 2004-11-16 2006-06-08 Bandai Networks Co Ltd Program for video game and video game device
CN108230431A (en) * 2018-01-24 2018-06-29 深圳市云之梦科技有限公司 A kind of the human action animation producing method and system of two-dimensional virtual image
CN108537136A (en) * 2018-03-19 2018-09-14 复旦大学 The pedestrian's recognition methods again generated based on posture normalized image
CN109147048A (en) * 2018-07-23 2019-01-04 复旦大学 A kind of three-dimensional grid method for reconstructing using individual cromogram
US20200104680A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Action selection neural network training using imitation learning in latent space
WO2020064873A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
US20200104684A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
US20200222757A1 (en) * 2019-01-15 2020-07-16 Shane Yang Augmented Cognition Methods And Apparatus For Contemporaneous Feedback In Psychomotor Learning
CN110033505A (en) * 2019-04-16 2019-07-19 西安电子科技大学 A kind of human action capture based on deep learning and virtual animation producing method
CN110188668A (en) * 2019-05-28 2019-08-30 复旦大学 A method of classify towards small sample video actions
CN111161395A (en) * 2019-11-19 2020-05-15 深圳市三维人工智能科技有限公司 Method and device for tracking facial expression and electronic equipment
CN111325817A (en) * 2020-02-04 2020-06-23 清华珠三角研究院 Virtual character scene video generation method, terminal device and medium
CN111553968A (en) * 2020-05-11 2020-08-18 青岛联合创智科技有限公司 Method for reconstructing animation by three-dimensional human body

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SAMUEL ALBANIE 等: "The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)", 《ARXIV:2008.00744V1》 *
顾钊铨 等: "深度伪造技术的安全挑战与应对", 《信息安全》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731330A (en) * 2022-11-16 2023-03-03 北京百度网讯科技有限公司 Target model generation method, animation generation method, device and electronic equipment
CN117218297A (en) * 2023-09-29 2023-12-12 北京百度网讯科技有限公司 Human body reconstruction parameter generation method, device, equipment and medium

Also Published As

Publication number Publication date
CN112308952B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
Xia et al. A survey on human performance capture and animation
Zhu et al. Inferring forces and learning human utilities from videos
Ueda et al. A hand-pose estimation for vision-based human interfaces
EP3454302B1 (en) Approximating mesh deformation for character rigs
Ye et al. Synthesis of detailed hand manipulations using contact sampling
CN111028317B (en) Animation generation method, device and equipment for virtual object and storage medium
CN112308952B (en) 3D character motion generation system and method for imitating human motion in given video
Murthy et al. gradsim: Differentiable simulation for system identification and visuomotor control
Romero et al. Modeling and estimation of nonlinear skin mechanics for animated avatars
Vendrow et al. Somoformer: Multi-person pose forecasting with transformers
CN116363308A (en) Human body three-dimensional reconstruction model training method, human body three-dimensional reconstruction method and equipment
Mirolo et al. A solid modelling system for robot action planning
Walsman et al. Break and make: Interactive structural understanding using lego bricks
Krishna SignPose: Sign language animation through 3D pose lifting
Wu et al. Example-based real-time clothing synthesis for virtual agents
Wu et al. AgentDress: Realtime clothing synthesis for virtual agents using plausible deformations
Ly et al. Co-evolutionary predictors for kinematic pose inference from rgbd images
Romeo et al. Muscle Simulation with Extended Position Based Dynamics.
Kaushik et al. Imitating human movement using a measure of verticality to animate low degree-of-freedom non-humanoid virtual characters
Yang et al. Explicit-to-implicit robot imitation learning by exploring visual content change
Wang et al. A Generative Human-Robot Motion Retargeting Approach Using a Single RGBD Sensor.
Yang et al. Learning a contact potential field for modeling the hand-object interaction
Xu Single-view and multi-view methods in marker-less 3d human motion capture
Xia et al. Recent advances on virtual human synthesis
Wu et al. Video driven adaptive grasp planning of virtual hand using deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant