CN112308952A

CN112308952A - 3D character motion generation system and method for imitating human motion in given video

Info

Publication number: CN112308952A
Application number: CN202011101066.1A
Authority: CN
Inventors: 姜育刚; 傅宇倩; 付彦伟
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-02-02
Anticipated expiration: 2040-10-15
Also published as: CN112308952B

Abstract

The invention belongs to the technical field of computers, and particularly relates to a 3D role action generating system and method for simulating the action of a person in a given video. The system comprises four modules of initial human body reconstruction, regular data mesh construction, mesh2mesh smoothing and human body posture migration; for the video containing the human body action source, recovering a mesh source sequence of the action player by an initial human body reconstruction module; constructing the initial mesh sequence into common rule data mesh by a rule data mesh construction module; the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, the human body posture migration module migrates the posture from the source mesh to the target mesh frame by frame, so that the action sequence contained in the source video is migrated to the target 3D role. The invention can generate the mesh sequence consistent with the action of the source video and improve the time sequence consistency of the mesh sequence.

Description

3D character motion generation system and method for imitating human motion in given video

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a 3D character role action generation system and a method.

Background

The character action generation has important practical significance for many computer vision tasks including multimedia interaction technology and visual information understanding. Humans are themselves very adept at learning and mimicking actions from some examples, which in themselves also have a key role in human intelligence. It is therefore desirable that 3D characters also have the ability to learn simulated actions from video samples and generate the same sequence of actions as the source video.

Despite the important research implications of this task, there is still little direct research effort. The work related to the comparison mainly comprises character animation control and simulation learning.

Character animation manipulation ([1], [2], [3], [4]) mainly studies on how to enable a static 3D character to perform a specific motion, thereby having an animation effect. Such methods are role-defined, i.e., the defined animation needs to be defined according to the skeletal structure of the particular 3D character itself.

The mock learning ([5], [6], [7], [8]) mainly studies how to make the intelligent robot have the ability to mimic human behavior. The method mainly extracts and summarizes human knowledge, namely summarization and learning of logical ability by learning and analyzing behaviors of human players in different environments, and further applies the method to new environments. The simulation learning is mainly oriented to an intelligent machine system, and usually, corresponding hardware equipment or a virtual environment is needed to support the development of an experiment.

Different from the existing method, the mesh grid is introduced as a representation form of a 3D model, and then the huge representation difference between a video domain and an angular color domain is ingeniously solved by using the research results of the existing human body three-dimensional reconstruction and human body posture migration in two directions, so that the target of enabling a target 3D role to generate an action sequence the same as that of a video is realized. In addition, in order to better improve the time sequence coherence of the generated mesh sequence, the invention provides a mesh2mesh smooth network based on 3D convolution so as to improve the quality of the mesh sequence generated by an initial human body reconstruction model. Compared with a role animation control method, the method can automatically acquire motion information from a video example and then imitate the motion information; compared with the simulation learning method, the method provided by the invention does not need any additional equipment, and completely depends on the deep network model to perform migration and simulation of actions. Experiments prove that the method can generate the mesh sequence consistent with the action of the source video and practically improve the time sequence consistency of the mesh sequence.

Disclosure of Invention

The invention aims to provide a 3D character motion generation system and a method capable of simulating human motion in a given video.

In order to reduce the huge representation gap between the video and the character, the invention firstly proposes to use mesh (grid) as a unified 3D model representation form, namely, all the involved 3D models in the invention have mesh. mesh is defined by a series of points and a surface, the points represent vertex information in the model, and the specific form is (id, x, y, z), wherein the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relationship between the vertices, and the specific form is (id1, id2, id3), which represents the three vertex sequence numbers connected with the table.

The 3D role action generation system capable of simulating the action of the character in the given video is based on the deep learning technology and mainly comprises the following four modules: (1) an initial human body reconstruction module; (2) a rule data meshcuboid construction module; (3) a mesh2mesh smoothing module; (4) human body posture migration module. A segment of source video containing human body actions is given, and firstly, a mesh source sequence of an action player is recovered by an initial human body reconstruction module; then, a rule data mesh building module builds the initial mesh sequence into a common rule data mesh; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized.

In the invention, the initial human body reconstruction module adopts the existing human body 3D reconstruction models (text [10] and text [11]) based on mesh representation as the initial human body reconstruction module, and the input of the initial human body reconstruction module is the generation result of the source video frame image and the corresponding openpore to obtain an initial human body mesh sequence. The source video frame image is an image frame obtained by performing image frame extraction and down-sampling on input video data; the corresponding openposition generation result refers to the result information of determining the region position of the human body in the whole frame image by extracting the bone joint point information of each frame image action player by using an openposition model (text [9 ]).

In the invention, the rule data mesh construction module constructs the initial mesh sequence into a common rule data mesh as the input of a subsequent mesh2mesh smoothing module. Here, the mesh is characterized in that the vertex information in the mesh sequence is regarded as a rectangular parallelepiped with a shape of T × N × 3, regardless of a complex human body structure, i.e., a connection relationship between vertices that is not considered, where T denotes the number of meshes included in the mesh sequence, N denotes the number of vertices included in a single mesh, and 3 denotes a coordinate dimension of a single mesh vertex.

In the invention, the deep neural network structure of the mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers, and the mesh sequence is smoothed by utilizing the time sequence space representation capability of 3D convolution.

In the mesh2mesh smoothing module, the loss function includes two parts:

(1) joint loss function L_j3dRespective joint positions for metric prediction mesh

And marking the joint point position J_tThe final error value is the average of the L2 distances of the position offsets of the respective joints. The mathematical representation is as follows:

wherein k represents the total number of body joint points, | · |, represents the euclidean distance;

(2) loss of motion function L_motionRespective joint points for metric prediction mesh

Whether the moving direction between the front and rear frames is equal to the marked joint point J or not_tKeeping the moving directions of the offset direction vectors consistent, and taking the average value of the L2 distance of the offset direction vectors as a final error value; the mathematical representation is as follows:

wherein | represents the euclidean distance;

the final loss function is therefore: l is_j3d+L_motion。

In the invention, the human body posture migration module adopts a posture migration network (text [13]) to generate a new mesh which simultaneously reserves the identity information of the target mesh and the posture information of the source mesh. Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the target mesh role can generate the action consistent with the source video.

The invention provides a 3D role action generation method capable of simulating the action of a person in a given video, which comprises the following specific steps:

(1) firstly, image frame extraction and down-sampling processing are carried out on input video data. In consideration of the characteristics that source videos have long duration and small action difference between adjacent frames, each source video is subjected to down-sampling at a certain frequency to obtain image frames;

(2) extracting the bone joint point information of each frame image action player by using an openposition model (text [9]), so as to determine the region position of the human body in the whole frame image;

(3) inputting the source video frame image and the corresponding openposition generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence; particularly, any image or video human body 3D reconstruction method based on mesh can be used as an initial human body reconstruction module, and has strong universality;

(4) constructing the initial mesh sequence into common rule data, namely, mesh cube, and using the common rule data as the input of a subsequent network; the mesh structure is characterized in that complex human body structures are ignored, namely the connection relation between vertexes is not considered, and the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;

(5) inputting the constructed mesh cuboid into a mesh2mesh smoothing module to obtain a smoothed 3Dmesh sequence;

(5.1) training the mesh2mesh network, and continuously optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model (text [12]) is used for regression to obtain corresponding 3D joint point coordinates, then, the coordinates are compared with the truly labeled human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinates and the truly labeled human body joint point information, and then, network parameters are optimized;

(5.2) in the testing stage, inputting the initial 3Dmesh sequence of the test video obtained in the step 1-4 into a mesh2mesh smoothing module obtained by training in the step (5.1) to obtain a smoothed 3Dmesh sequence;

(6) and regarding each mesh in the smoothed 3Dmesh sequence as a source mesh with a specific gesture, inputting the source mesh and the target mesh into a human body gesture migration module together, and generating a new mesh which simultaneously reserves the identity information of the target mesh and the gesture information of the source mesh.

And the method comprises the steps of evaluating, wherein the evaluation of the mesh2mesh smoothing module is used as a quantitative evaluation index according to the position distance deviation MPJPE between the joint point position in the mesh and the marked joint point, and the final mesh sequence is subjected to visual qualitative evaluation because the task is a generated task and has no corresponding marked information.

In summary, the innovation of the invention is as follows:

1. the invention provides a method for learning action sequences from videos by using mesh as a representation form of a 3D model for the first time, so that the 3D role plays a visual task of the same action. The task has important significance for multimedia interaction and visual action understanding;

2. the invention provides a method for realizing the extraction of video actions and the generation of role actions by utilizing a 3D human body reconstruction method and a human body posture migration method, the video actions can be effectively migrated to a target 3D role, the whole generation process completely depends on deep learning, no additional equipment or specific virtual environment is needed, and a new solution thought is provided for the role action generation;

3. aiming at the problem of unsmooth mesh sequence, the invention firstly provides the concept of mesh cube, simplifies the mesh sequence with complex structure originally, and provides a mesh2mesh smoothing module based on 3D convolution to model the time sequence spatial information, thereby achieving the effect of optimizing the continuity of the mesh time sequence. In addition, besides the joint point loss function commonly used in the field of human body reconstruction, the invention also provides a motion error loss function aiming at time sequence change, and jointly optimizes network parameters together with the joint point loss function.

Drawings

Fig. 1 is a schematic diagram of a 3D-oriented character action generation method proposed by the present invention. Given a source video shown in a first row and a target mesh human body role shown in a first column, the method mainly comprises two main stages: the original 3Dmesh sequence is recovered from the video, and the pose of the source mesh is migrated to the target mesh. By introducing the mesh representation method, the method solves the huge gap between human body roles and video data, and further generates a mesh sequence with the same action.

Fig. 2 is an overview of the system of the present invention. It contains three key modules: the device comprises an initial human body reconstruction module, a mesh2mesh smoothing module and a posture migration module. The initial human body reconstruction mode receives input of a video or a video frame image, and an initial 3Dmesh sequence of a human body is obtained through prediction; the mesh2mesh smoothing module is responsible for optimizing the initial mesh sequence to obtain a mesh sequence with better time sequence consistency; and the gesture migration module is responsible for migrating the gesture of the human body from the source video mesh to the target mesh.

Fig. 3 is a schematic diagram of a mesh2mesh network structure in the present invention. Wherein, the mesh sequence (subgraph a) is represented as a cuboid as shown in a subgraph b; t represents the number of mesh, H represents the number of vertices of a single mesh, and W represents the dimension of a single vertex coordinate. The 3D convolution is used to model the spatio-temporal features of the T meshes. Sub-graph c shows the final mesh2mesh module, containing 8 3D convolutional layers and two "unsqueeze" and "squeeze" layers for dimension adaptation.

Fig. 4 is a graph of the experimental result of mesh2mesh of the present invention. For the problem of time sequence incoherence existing in the initial mesh source sequence, the mesh2mesh module can effectively utilize the information of the previous and the next messages and improve the information.

Fig. 5 is a diagram showing the result of an experiment of the operation generation method of the present invention. Given a piece of video containing specific human body actions, the invention can migrate the action sequence of the video to the target mesh. The figure shows the generated result of mesh of 3 different identity information as the target mesh.

Detailed Description

Step 1, video frame image preprocessing. For the source video data, the image frame acquisition is first performed at a frequency of 1 frame every 25 frames. For the video required in the training stage, randomly selecting continuous T frames from the collected video frames as input data of the network each time. For the video required in the testing stage, the middle T frame is taken from the collected video as the input data of the network each time. And taking T as 16. For the final video segment taken

Is shown in the formula I_tRepresenting the t-th image frame.

And 2, acquiring the position area of the action player in the image. For each selected video frame image I_tIt is input into the existing opencast network structure (text [9]]) And predicting to obtain corresponding human skeleton joint points, and determining the position area of the action player according to the position coordinates of the joint points so as to help the initial human body reconstruction model to better perform human body 3Dmesh reconstruction work.

And 3, recovering the source mesh sequence from the input video. And directly recovering the human mesh source sequence from the input video by using the existing 3D human body reconstruction method. The invention is also in GraphCMR (text [10]]) And SMPLfy-X (text [11]]) Experiments were performed on these two models to verify the versatility and mobility of the method. Both models are based on SMPL (text [12]]) A human body model. SMPL-based 3Dmesh contains 6890 vertex coordinates. And modeling the SMPLmesh by the graph convolution neural network, and obtaining the 6890 vertex coordinate positions by a regression mode. SMPLfy-X performs human reconstruction work by fitting the joint points of SMPLmesh to the marked joint points. For a video input containing T frames, the frame by frame will be<Video image, openposition prediction result>As input, 3Dmesh obtained after model prediction is obtained. Final initial mesh sequence

Is shown in which

Indicating the tth individual 3 Dmesh.

And 4, constructing the messecuboid. The mesh cube ignores the complex point-surface connection relation of the mesh, keeps the connection relation of the vertexes unchanged, and only models the coordinate position of each vertex for fine adjustment of a subsequent network. For the initial mesh sequence

Get each

The vertex positions of (a) are stacked to constitute a mesh cube of T × N × 3. Wherein N represents the number of vertexes in a single mesh, and in the SMPL human body model representation mode, N is equal to6890。

And 5, constructing a mesh2mesh smoothing module. The mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers as shown in fig. 3. In the design of a network structure, the invention uses a convolution kernel of 5 × 1 × 3, and sets the step size to be (1,1,1) and the padding to be (2,0,1) so as to ensure that the dimension of the input and output data keeps T × N × 3 unchanged. In the case of using GraphCMR and SMPLify-X as an initial human reconstruction model, the body of mesh2mesh is composed of 8 layers of such 3D convolutional layers. In addition, an unsqueeze (dimension with the shape of 1 added) and an squeeze layer (dimension with the shape of 1 removed) are added to the head and the tail of the mesh2mesh smoothing module respectively to match the data input with the batchsize (batch size of input network) of 1. Adding the mesh vertex coordinates smoothed by the mesh2mesh to the original vertex connection relation to reconstruct a new smoothed mesh sequence

And 6, training a mesh2mesh smoothing module. For each video frame I in the video data set_tAll have corresponding label information J_tCan be used as supervision information for model learning. Wherein J_tAnd 3D coordinate positions of the marked human body joint points at the t-th moment are shown. Therefore, the mesh sequence obtained by using mesh2mesh prediction can be utilized

And training the model by designing a loss function according to the error between the model and the labeling information. Firstly, the invention utilizes the self-contained joint point regressor of the SMPL model to extract the mesh sequence

Obtaining corresponding 3D joint point information by middle regression

Wherein the content of the first and second substances,

k represents the number of joint points. Correspond to each otherA commonly used error loss function is the joint point loss function. Joint loss function L_j3dMeasuring the position of each joint point of the predicted mesh

in addition, the invention also provides a new motion error loss function L_motionRespective joint points for metric prediction mesh

Whether the moving direction between the front and rear frames is equal to the marked joint point J or not_tThe moving direction of (d) is kept consistent, and the L2 distance average of the offset direction vector is taken as the final error value. The mathematical representation is as follows:

wherein | represents the euclidean distance of the corresponding vertex;

the final loss function is therefore: l is_j3d+L_motion。

The method of the invention adopts an SGD (Standard gradient descent) optimization algorithm to train network parameters.

And 7, testing. Obtaining an initial 3Dmesh sequence of the test video and a corresponding initial mescuboid by referring to the modes of the steps 1-4, and then inputting the initial 3Dmesh sequence and the corresponding initial mescuboid into a mesh2mesh module obtained by training in the step 6 to obtain a smoothed 3Dmesh sequence

Step (ii) of8. For each source mesh with a specific gesture

With the target meshM_idTogether as a gesture migration network (text [13]]) Generating a simultaneous reservation M_idIdentity information and

new meshM for attitude information_new. Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the target mesh role can generate the action consistent with the source video.

And (4) evaluating the method. For the mesh2mesh model, because the model has referenceable labeled joint point information, MPJPE (mean distance error of joint points) is adopted as an evaluation index. And aiming at the whole action generation task, since no referenceable information is available, the method evaluation is carried out by a visual qualitative method. The experimental result (table 1) shows that the mesh2mesh smoothing module provided by the invention can improve the performance of the initial human body reconstruction model, reduce the error value between the model and the labeled node, and improve the reconstruction effect. Fig. 4 shows that the mesh2mesh has the effect of improving the mesh sequence of the abnormal time sequence action in a visual manner, so that the overall action is more coherent. The effect diagram of the role action generation method is shown in fig. 5, and given a source mesh with a specific action, the method can enable a target mesh to accurately imitate the action in a source video, can process the target mesh with different initial postures and different identity information, and generates a mesh sequence with higher detail reduction degree for the target mesh.

TABLE 1 quantitative test results table (MPJPE value, smaller value, better model performance)

	GraphCMR	SMPLify-X
			Initial human body model	74.7	136.4
Initial human body reconstruction model + mesh2mesh	72.8	128.4

。

Reference to the literature

[1]A.Macchietto,V.Zordan,and C.R.Shelton.Momentum control for balance.InACMSIGGRAPH2009papers,pages 1–8.2009.

[2]I.Mordatch,M.De Lasa,and A.Hertzmann.Robust physics-based locomotion using low-dimensional planning.In ACM SIGGRAPH 2010papers,pages 1–8.2010.

[3]L.Kavan,S.Collins,J.Zara,and C.O’Sullivan.Skinning with dual quaternions.In Proceedings of the2007symposium on Interactive 3D graphics and games,pages 39–46,2007.

[4]R.Wareham and J.Lasenby.Bone glow:An improved method for the assignment of weights for mesh deformation.In International Conference on Articulated Motion and Deformable Objects,pages 63–71.Springer,2008.

[5]A.Hussein,M.M.Gaber,E.Elyan,and C.Jayne.Imitation learning:A survey of learning methods.ACM Computing Surveys(CSUR),50(2):1–35,2017.

[6]L.-J.Lin.Self-improving reactive agents base donrein for cement learning,planning and teaching.Machine learning,8(3-4):293–321,1992.

[7]K.R.Dixonand P.K.Khosla.Learning by observationwith mobile robots:A computational approach.In IEEE International Conference on Robotics and Automation,2004.Proceedings.ICRA’04.2004,volume 1,pages102–107.IEEE,2004.

[8]S.Levine,C.Finn,T.Darrell,and P.Abbeel.End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research,17(1):1334–1373,2016

[9]Z.Cao,G.Hidalgo,T.Simon,S.-E.Wei,and Y.Sheikh.Openpose:realtime multi-person 2d pose estimation using part affinity fields.TPAMI,2019.

[10]N.Kolotouros,G.Pavlakos,and K.Daniilidis.Convolutional mesh regression for single-image human shape reconstruction.In CVPR,2019.

[11]G.Pavlakos,V.Choutas,N.Ghorbani,T.Bolkart,A.A.A.Osman,D.Tzionas,and M.J.Black.Expressive body capture:3dhands,face,andbodyfromasingleimage.InCVPR,2019.

[12]M.Loper,N.Mahmood,J.Romero,G.Pons-Moll,andM.J.Black.SMPL:Askinnedmulti-personlinearmodel.InSIG GRAPH Asia,2015.

[13]J.Wang,C.Wen,Y.Fu,H.Lin,T.Zou,X.Xue,and Y.Zhang.Neuralposetransferbyspatiallyadaptive instance normalization.In CVPR,2020.。

Claims

1. A 3D character motion generation system that mimics the motion of a character in a given video, comprising the following four modules: (1) an initial human body reconstruction module; (2) a rule data meshcuboid construction module; (3) a mesh2mesh smoothing module; (4) a human body posture migration module; for a given segment of source video containing human body actions, firstly, restoring a mesh source sequence of an action player by an initial human body reconstruction module; then, a rule data mesh building module builds the initial mesh sequence into a common rule data mesh; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized; the mesh is used as a uniform 3D model representation form, the mesh is defined by a series of points and a surface, the points represent vertex information in the model, the specific form of the points is (id, x, y and z), the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relationship between the vertices, and the specific form is (id1, id2, id3), which represents the three vertex sequence numbers connected with the table.

2. The 3D role action generation system according to claim 1, wherein the initial human body reconstruction module adopts a human body 3D reconstruction model based on mesh representation, and inputs of the model are a source video frame image and a corresponding openposition generation result to obtain an initial human body mesh sequence; the source video frame image is an image frame obtained by performing image frame extraction and down-sampling processing on input video data; the corresponding openposition generation result refers to the result information of determining the region position of the human body in the whole frame image by extracting the skeletal joint point information of the action player in each frame image by using an openposition model.

3. The 3D character action generating system according to claim 2, wherein the rule data mesh construction module constructs an initial mesh sequence into a common rule data mesh as an input of a subsequent mesh2mesh smoothing module; here, the mesh is characterized in that a complex human body structure is ignored, that is, the vertex information in the mesh sequence is regarded as a cuboid with a shape of T × N × 3, regardless of the connection relationship between the vertices, where T represents the number of meshes included in the mesh sequence, N represents the number of vertices included in a single mesh, and 3 represents the coordinate dimension of a single mesh vertex.

4. The 3D character action generating system according to claim 3, wherein the network structure of the mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers, and the mesh sequence is smoothed by using the sequential spatial representation capability of the 3D convolution;

in the mesh2mesh smoothing module, the loss function includes two parts:

And marking the joint point position J_tTaking the average value of the L2 distance of the position deviation of each joint point as a final error value; the mathematical representation is as follows:

wherein k represents the total number of human body joint points, | · |, represents the euclidean distance;

wherein | represents the euclidean distance;

the final loss function is: l is_j3d+L_motion。

5. The 3D character action generating system according to claim 4, wherein in the mesh2mesh smoothing module, the 3D convolutional layer uses a 5 × 1 × 3 convolutional kernel, and sets the step size to (1,1,1) and padding to (2,0,1) to ensure that the data dimension of input and output remains T × N × 3; the main body of mesh2mesh is composed of 8 layers of such 3D convolutional layers; in addition, an unsqueeze, i.e. a dimension with the shape of 1, and an squeeze layer, i.e. a dimension with the shape of 1, are added to the head and the tail of the mesh2mesh smoothing module respectively so as to match the batchsize, i.e. the data input with the batch data size of 1 of the input network.

6. The 3D character action generating system according to claim 5, wherein the human body posture migration module generates a new mesh which simultaneously retains the identity information of the target mesh and the posture information of the source mesh by adopting a posture migration network, namely sequentially migrating the postures of the whole source mesh sequence into the target mesh so that the target mesh character generates an action consistent with the source video.

7. A 3D character action generating method based on the system of any one of claims 1 to 6, characterized by comprising the following steps:

(1) firstly, carrying out image frame extraction and down-sampling processing on input video data, and obtaining image frames by carrying out down-sampling on each source video at a certain frequency;

(2) extracting the bone joint point information of the action player of each frame image by adopting an openposition model so as to determine the region position of the human body in the whole frame image;

(3) inputting the source video frame image and the corresponding openposition generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence;

(4) constructing the initial mesh sequence into common rule data, namely, mesh cube; the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;

(5.1) training the mesh2mesh network, and optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model is used for regression to obtain corresponding 3D joint point coordinates, then the corresponding 3D joint point coordinates are compared with the real marked human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinates and the real marked human body joint point information, and further, network parameters are optimized;