CN112308952B

CN112308952B - 3D character motion generation system and method for imitating human motion in given video

Info

Publication number: CN112308952B
Application number: CN202011101066.1A
Authority: CN
Inventors: 姜育刚; 傅宇倩; 付彦伟
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-11-18
Anticipated expiration: 2040-10-15
Also published as: CN112308952A

Abstract

The invention belongs to the technical field of computers, and particularly relates to a 3D role action generating system and method for simulating the action of a person in a given video. The system comprises four modules of initial human body reconstruction, regular data mesh construction, mesh2mesh smoothing and human body posture migration; for the video containing the human body action source, recovering a mesh source sequence of the action player by an initial human body reconstruction module; constructing the initial mesh sequence into common rule data mesh by a rule data mesh construction module; the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, migrating the posture from the source mesh to the target mesh frame by a human posture migration module to realize the migration of the action sequence contained in the source video to the target 3D role. The invention can generate the mesh sequence consistent with the action of the source video and improve the time sequence consistency of the mesh sequence.

Description

3D character motion generation system and method for simulating human motion in given video

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a 3D character role action generation system and a method.

Background

The character action generation has important practical significance for many computer vision tasks including multimedia interaction technology and visual information understanding. Humans are themselves very adept at learning and mimicking actions from some examples, which in themselves also have a key role in human intelligence. It is therefore desirable that a 3D character also have the ability to learn simulated actions from video samples and generate the same sequence of actions as the source video.

Despite the important research implications of this task, there is still little direct research effort. The work related to the comparison mainly comprises character animation control and simulation learning.

Character animation manipulation ([ 1], [2], [3], [4 ]) mainly studies on how to enable a static 3D character to perform a specific motion, thereby having an animation effect. Such methods are character-defined, i.e., the defined animation needs to be defined according to the skeletal structure of the particular 3D character itself.

The mock learning ([ 5], [6], [7], [8 ]) mainly studies how to make the intelligent robot have the ability to mimic human behavior. The method mainly extracts and summarizes human knowledge, namely summarization and learning of logical ability by learning and analyzing behaviors of human players in different environments, and then applies the method to new environments. The simulation learning is mainly oriented to an intelligent machine system, and usually, corresponding hardware equipment or a virtual environment is needed to support the development of an experiment.

Different from the existing method, the mesh grid is introduced as a representation form of a 3D model, and then the huge representation difference between a video domain and an angular color domain is ingeniously solved by using the research results of the existing human body three-dimensional reconstruction and human body posture migration in two directions, so that the target of enabling a target 3D role to generate an action sequence the same as that of a video is realized. In addition, in order to better improve the time sequence coherence of the generated mesh sequence, the invention provides a mesh2mesh smooth network based on 3D convolution so as to improve the quality of the mesh sequence generated by an initial human body reconstruction model. Compared with a role animation control method, the method can automatically acquire motion information from a video example and then imitate the motion information; compared with the simulation learning method, the method provided by the invention does not need any additional equipment, and completely depends on the deep network model to perform migration and simulation of actions. Experiments prove that the method can generate the mesh sequence consistent with the action of the source video and practically improve the time sequence consistency of the mesh sequence.

Disclosure of Invention

The invention aims to provide a 3D character action generating system and a method capable of simulating human actions in a given video.

In order to reduce the huge representation gap between the video and the character, the invention firstly proposes to use mesh (grid) as a unified 3D model representation form, namely, all the involved 3D models in the invention have mesh. mesh is defined by a series of points and a surface, the points represent vertex information in the model, and the specific form is (id, x, y, z), wherein the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relationship between the vertices, and the specific form is (id 1, id2, id 3), which represents the three vertex numbers connected to the table.

The 3D role action generation system capable of simulating the action of the human body in the given video is based on the deep learning technology and mainly comprises the following four modules: (1) an initial human body reconstruction module; (2) a rule data mescucuid construction module; (3) a mesh2mesh smoothing module; and (4) a human body posture migration module. A segment of source video containing human body actions is given, and firstly, a mesh source sequence of an action player is recovered by an initial human body reconstruction module; then, constructing the initial mesh sequence into common rule data mesh by a rule data mesh construction module; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; and finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized.

In the invention, the initial human body reconstruction module adopts the existing human body 3D reconstruction models (text [10] and text [11 ]) based on mesh representation as the initial human body reconstruction module, and the input of the initial human body reconstruction module is the generation result of the source video frame image and the corresponding openpore to obtain an initial human body mesh sequence. The source video frame image is an image frame obtained by performing image frame extraction and down-sampling on input video data; the corresponding openposition generation result refers to the result information of determining the region position of the human body in the whole frame image by extracting the bone joint point information of each frame image action player by using an openposition model (text [9 ]).

In the invention, the rule data mesh construction module constructs the initial mesh sequence into common rule data mesh as the input of the subsequent mesh2mesh smoothing module. Here, the mesh is characterized in that the vertex information in the mesh sequence is regarded as a rectangular parallelepiped with a shape of T × N × 3, regardless of a complex human body structure, i.e., a connection relationship between vertices that is not considered, where T denotes the number of meshes included in the mesh sequence, N denotes the number of vertices included in a single mesh, and 3 denotes a coordinate dimension of a single mesh vertex.

In the invention, the deep neural network structure of the mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers, and the mesh sequence is smoothed by utilizing the time sequence space representation capability of 3D convolution.

In the mesh2mesh smoothing module, the loss function includes two parts:

(1) Joint loss function L _j3d Respective joint positions for metric prediction mesh

And marking the joint point position J _t And taking the average value of the L2 distances of the position deviation of each joint point as a final error value. The mathematical representation is as follows:

wherein k represents the total number of body joint points, | · |, represents the euclidean distance;

(2) Loss of motion function L _motion Respective joint points for metric prediction mesh

Whether the moving direction between the front and rear frames is equal to the marked joint point J or not _t Keeping consistency of the moving directions, and taking an L2 distance average value of the offset direction vector as a final error value; the mathematical representation is as follows:

wherein | represents the euclidean distance;

the final loss function is therefore: l is a radical of an alcohol _j3d +L _motion 。

In the invention, the human body posture migration module adopts a posture migration network (text [13 ]) to generate a new mesh which simultaneously reserves the identity information of the target mesh and the posture information of the source mesh. Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the action consistent with the source video can be generated by the role of the target mesh.

The invention provides a 3D role action generation method capable of simulating human actions in a given video, which comprises the following specific steps:

(1) First, image frame extraction and down-sampling processing are performed on input video data. In consideration of the characteristics that the source video has long duration and the action difference between adjacent frames is small, each source video is subjected to down-sampling at a certain frequency to obtain image frames;

(2) Extracting the bone joint point information of each frame image action player by using an openposition model (text [9 ]), so as to determine the region position of the human body in the whole frame image;

(3) Inputting the source video frame image and the corresponding openness generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence; particularly, any image or video human body 3D reconstruction method based on mesh can be used as an initial human body reconstruction module, and has strong universality;

(4) Constructing the initial mesh sequence into common rule data, namely, mesh cube, and using the common rule data as the input of a subsequent network; the mesh structure is characterized in that complex human body structures are ignored, namely the connection relation between vertexes is not considered, and the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;

(5) Inputting the constructed mesh into a mesh2mesh smoothing module to obtain a smoothed 3Dmesh sequence;

(5.1) training the mesh2mesh network, and continuously optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model (text [12 ]) is used for regression to obtain a corresponding 3D joint point coordinate, then the 3D joint point coordinate is compared with the actually marked human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinate and the actually marked human body joint point information, and further, network parameters are optimized;

(5.2) in a testing stage, inputting the initial 3Dmesh sequence of the testing video obtained in the step (1) to the mesh2mesh smoothing module obtained by training in the step (5.1) to obtain a smoothed 3Dmesh sequence;

(6) And regarding each mesh in the smoothed 3Dmesh sequence as a source mesh with a specific gesture, inputting the source mesh and the target mesh into a human body gesture migration module together, and generating a new mesh which simultaneously reserves the identity information of the target mesh and the gesture information of the source mesh.

And the method comprises the steps of evaluating, wherein the evaluation of the mesh2mesh smoothing module is used as a quantitative evaluation index according to the position distance deviation MPJPE between the joint point position in the mesh and the marked joint point, and the final mesh sequence is subjected to visual qualitative evaluation because the task is a generating task and has no corresponding marking information.

In summary, the innovation of the invention is as follows:

1. the invention provides a method for learning action sequences from videos by using mesh as a representation form of a 3D model for the first time, so that the 3D role plays a visual task of the same action. The task has important significance for multimedia interaction and visual action understanding;

2. the invention provides a method for realizing the extraction of video actions and the generation of role actions by utilizing a 3D human body reconstruction method and a human body posture migration method, the video actions can be effectively migrated to a target 3D role, the whole generation process completely depends on deep learning, no additional equipment or specific virtual environment is needed, and a new solution thought is provided for the role action generation;

3. aiming at the problem of unsmooth mesh sequence, the invention firstly provides the concept of mesh cube, simplifies the mesh sequence with complex structure originally, and provides a mesh2mesh smoothing module based on 3D convolution to model the time sequence spatial information, thereby achieving the effect of optimizing the mesh time sequence consistency. In addition, besides the joint point loss function commonly used in the field of human body reconstruction, the invention also provides a motion error loss function aiming at time sequence change, and optimizes network parameters together with the joint point loss function.

Drawings

Fig. 1 is a schematic diagram of a 3D-oriented character action generation method proposed by the present invention. Given a source video shown in a first row and a target mesh human body role shown in a first column, the method mainly comprises two main stages: the original 3Dmesh sequence is recovered from the video, and the pose of the source mesh is migrated to the target mesh. By introducing the mesh representation method, the method solves the huge gap between human body roles and video data, and further generates a mesh sequence with the same action.

Fig. 2 is an overview of the system of the present invention. It contains three key modules: the device comprises an initial human body reconstruction module, a mesh2mesh smoothing module and a posture migration module. The initial human body reconstruction mode receives input of video or video frame images, and an initial 3Dmesh sequence of a human body is obtained through prediction; the mesh2mesh smoothing module is responsible for optimizing the initial mesh sequence to obtain a mesh sequence with better time sequence consistency; and the gesture migration module is responsible for migrating the gesture of the human body from the source video mesh to the target mesh.

Fig. 3 is a schematic diagram of a mesh2mesh network structure in the present invention. Wherein, the mesh sequence (sub graph a) is represented as a cuboid as shown in a sub graph b; t represents the number of mesh, H represents the number of vertices of a single mesh, and W represents the dimension of a single vertex coordinate. The 3D convolution is used to model the spatio-temporal features of the T meshes. Sub-figure c shows the final mesh2mesh module, containing 8 3D convolutional layers and two "unsqueeze" and "squeeze" layers for dimension adaptation.

Fig. 4 is a graph of the experimental results of mesh2mesh according to the present invention. For the problem of time sequence incoherence existing in the initial mesh source sequence, the mesh2mesh module can effectively utilize the information of the previous and the next messages and improve the information.

Fig. 5 is a diagram showing the result of an experiment of the operation generation method of the present invention. Given a piece of video containing specific human body actions, the invention can migrate the action sequence of the video to the target mesh. The figure shows the generated result of mesh of 3 different identity information as the target mesh.

Detailed Description

Step 1, video frame image preprocessing. For the source video data, the image frame acquisition is first performed at a frequency of 1 frame every 25 frames. For the video required in the training stage, randomly selecting continuous T frames from the collected video frames as input data of the network each time. For the video required in the testing stage, the middle T frame is taken from the collected video as the input data of the network each time. And taking T as 16. For the video clip finally taken

Is shown in the formula I _t Representing the t-th image frame.

And 2, acquiring the position area of the action player in the image. For each selected video frame image I _t It is input into the existing opencast network structure (text [9]]) And predicting to obtain corresponding human skeleton joint points, and determining the position area of the action player according to the position coordinates of the joint points so as to help the initial human body reconstruction model to better perform human body 3Dmesh reconstruction work.

And 3, recovering the source mesh sequence from the input video. Recovery from input video directly using existing 3D human reconstruction methodsHuman mesh source sequence. The present invention is also in GraphCMR (text [10]]) And SMPLfy-X (text [11]]) Experiments were performed on these two models to verify the versatility and mobility of the method. Both models are based on SMPL (text [12]]) A human body model. SMPL-based 3Dmesh contains 6890 vertex coordinates. And modeling the SMPLmesh by the graph convolution neural network, and obtaining the 6890 vertex coordinate positions by a regression mode. SMPLfy-X performs human reconstruction work by fitting the joint points of SMPLmesh to the marked joint points. For video input containing T frames, the frame by frame will be<Video image, openposition prediction result>As input, 3Dmesh obtained after model prediction is obtained. Final initial mesh sequence

Is shown in which

Indicating the tth individual 3Dmesh.

And 4, constructing the messecuboid. The mesh cube ignores the complex point-surface connection relation of the mesh, keeps the connection relation of the vertexes unchanged, and only models the coordinate position of each vertex for fine adjustment of a subsequent network. For the initial mesh sequence

Get each

The vertex positions of (1) are stacked to form a mesh cube of T multiplied by N multiplied by 3. Where N represents the number of vertices in a single mesh, N =6890 in the SMPL human model representation.

And 5, constructing a mesh2mesh smoothing module. The mesh2mesh smoothing module is mainly formed by stacking 3D convolution layers as shown in fig. 3. In the design of a network structure, the invention uses a convolution kernel of 5 × 1 × 3 and sets the step size to (1, 1) and the padding to (2, 0, 1) to ensure that the dimension of the input and output data keeps T × N × 3 unchanged. Under the condition of taking GraphCMR and SMPLfy-X as initial human body reconstruction model, mesh2meshThe body of (2) is composed of 8 such 3D convolutional layers. In addition, an unscqueze (dimension with shape of 1 added) and an squeeze layer (dimension with shape of 1 removed) are added to the head and tail of the mesh2mesh smoothing module, respectively, to match the data input with batchsize (batch size of input network) of 1. Adding the mesh vertex coordinates smoothed by the mesh2mesh to the original vertex connection relation to reconstruct a new smoothed mesh sequence

And 6, training a mesh2mesh smoothing module. For each video frame I in the video data set _t All have corresponding label information J _t Can be used as supervision information for model learning. Wherein J _t And 3D coordinate positions of the marked human body joint points at the t-th moment are shown. Therefore, the mesh sequence obtained by using mesh2mesh prediction can be utilized

And training the model by designing a loss function according to the error between the model and the labeling information. Firstly, the invention utilizes the self-contained joint point regressor of the SMPL model to extract the mesh sequence

Obtaining corresponding 3D joint point information by middle regression

Wherein the content of the first and second substances,

k represents the number of joint points. Correspondingly, a commonly used error loss function is the joint point loss function. Joint loss function L _j3d Measuring the position of each joint point of the predicted mesh

And marking the joint point position J _t And taking the average value of the L2 distances of the position deviation of each joint point as a final error value. It is provided withThe mathematical representation is as follows:

in addition, the invention also provides a new motion error loss function L _motion Respective joint points for metric prediction mesh

Whether the moving direction between the front and rear frames is equal to the marked joint point J or not _t The moving directions of (a) are kept consistent, and the average value of the L2 distances of the offset direction vectors is taken as the final error value. The mathematical representation is as follows:

wherein | represents the euclidean distance of the corresponding vertex;

The method of the invention adopts an SGD (Standard gradient descent) optimization algorithm to train network parameters.

And 7, testing. Obtaining an initial 3Dmesh sequence of the test video and a corresponding initial mescuboid by referring to the modes of the steps 1-4, and then inputting the initial 3Dmesh sequence and the corresponding initial mescuboid into a mesh2mesh module obtained by training in the step 6 to obtain a smoothed 3Dmesh sequence

Step 8, for each source mesh with specific gesture

With target meshM _id Together as a gesture migration network (text [13]]) Generating a simultaneous reservation M _id Identity information and

new meshM for attitude information _new . Because the action is caused by the continuously changing gesture, the gestures of the whole source mesh sequence are sequentially migrated into the target mesh, so that the target mesh role can generate the action consistent with the source video.

And (4) evaluating the method. For the mesh2mesh model, because the model has referenceable labeled joint point information, MPJPE (mean distance error of joint points) is adopted as an evaluation index. And aiming at the whole action generation task, since no referable information exists, the method evaluation is carried out by a visual qualitative method. The experimental result (table 1) shows that the mesh2mesh smoothing module provided by the invention can improve the performance of the initial human body reconstruction model, reduce the error value between the model and the labeled node, and improve the reconstruction effect. Fig. 4 shows that the mesh2mesh has the effect of improving the mesh sequence of the abnormal time sequence action in a visual manner, so that the overall action is more coherent. The effect diagram of the role action generation method is shown in fig. 5, and given a source mesh with a specific action, the method can enable a target mesh to accurately imitate the action in a source video, can process the target mesh with different initial postures and different identity information, and generates a mesh sequence with higher detail reduction degree for the target mesh.

TABLE 1 quantitative test results table (MPJPE value, smaller value, better model performance)

	GraphCMR	SMPLify-X
			Initial human body model	74.7	136.4
Initial human body reconstruction model + mesh2mesh	72.8	128.4

。

Reference to the literature

[1]A.Macchietto,V.Zordan,and C.R.Shelton.Momentum control for balance.InACMSIGGRAPH2009papers,pages 1–8.2009.

[2]I.Mordatch,M.De Lasa,and A.Hertzmann.Robust physics-based locomotion using low-dimensional planning.In ACM SIGGRAPH 2010papers,pages 1–8.2010.

[3]L.Kavan,S.Collins,J.Zara,and C.O’Sullivan.Skinning with dual quaternions.In Proceedings of the2007symposium on Interactive 3D graphics and games,pages 39–46,2007.

[4]R.Wareham and J.Lasenby.Bone glow:An improved method for the assignment of weights for mesh deformation.In International Conference on Articulated Motion and Deformable Objects,pages 63–71.Springer,2008.

[5]A.Hussein,M.M.Gaber,E.Elyan,and C.Jayne.Imitation learning:A survey of learning methods.ACM Computing Surveys(CSUR),50(2):1–35,2017.

[6]L.-J.Lin.Self-improving reactive agents base donrein for cement learning,planning and teaching.Machine learning,8(3-4):293–321,1992.

[7]K.R.Dixonand P.K.Khosla.Learning by observationwith mobile robots:A computational approach.In IEEE International Conference on Robotics and Automation,2004.Proceedings.ICRA’04.2004,volume 1,pages102–107.IEEE,2004.

[8]S.Levine,C.Finn,T.Darrell,and P.Abbeel.End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research,17(1):1334–1373,2016

[9]Z.Cao,G.Hidalgo,T.Simon,S.-E.Wei,and Y.Sheikh.Openpose:realtime multi-person 2d pose estimation using part affinity fields.TPAMI,2019.

[10]N.Kolotouros,G.Pavlakos,and K.Daniilidis.Convolutional mesh regression for single-image human shape reconstruction.In CVPR,2019.

[11]G.Pavlakos,V.Choutas,N.Ghorbani,T.Bolkart,A.A.A.Osman,D.Tzionas,and M.J.Black.Expressive body capture:3dhands,face,andbodyfromasingleimage.InCVPR,2019.

[12]M.Loper,N.Mahmood,J.Romero,G.Pons-Moll,andM.J.Black.SMPL:Askinnedmulti-personlinearmodel.InSIG GRAPH Asia,2015.

[13]J.Wang,C.Wen,Y.Fu,H.Lin,T.Zou,X.Xue,and Y.Zhang.Neuralposetransferbyspatiallyadaptive instance normalization.In CVPR,2020.。

Claims

1. A 3D character motion generation system that mimics the motion of a character in a given video, comprising the following four modules: (1) an initial human body reconstruction module; (2) a rule data mescucuid construction module; (3) a mesh2mesh smoothing module; (4) a human body posture migration module; for a given segment of source video containing human body actions, firstly, restoring a mesh source sequence of an action player by an initial human body reconstruction module; then, a rule data messcuboid construction module constructs the initial mesh sequence into a rule data messcuboid; aiming at the problem of time sequence incoherence of the initial mesh sequence, the mesh2mesh smoothing module further smoothes the initial mesh sequence through 3D convolution, so that the action of the mesh sequence is more coherent; finally, the human body posture migration module is responsible for migrating the posture from the source mesh to the target mesh frame by frame, so that the effect of migrating the action sequence contained in the source video to the target 3D role is realized; the mesh is used as a uniform 3D model representation form, the mesh is defined by a series of points and a surface, the points represent vertex information in the model, the specific form of the points is (id, x, y and z), the id represents a vertex serial number, and the x, y and z represent 3D coordinates of the vertex; the table represents the connection relation among the vertexes, the specific form of the table is (id 1, id2 and id 3), and the table represents the serial numbers of the three vertexes connected with the table;

here, the mesh is characterized in that a complex human body structure is ignored, that is, the vertex information in the mesh sequence is regarded as a cuboid with a shape of T × N × 3, regardless of the connection relationship between the vertices, where T represents the number of meshes included in the mesh sequence, N represents the number of vertices included in a single mesh, and 3 represents the coordinate dimension of a single mesh vertex.

2. The 3D character action generation system according to claim 1, wherein the initial human body reconstruction module adopts a human body 3D reconstruction model based on mesh representation, and inputs of the human body 3D reconstruction model are a source video frame image and a corresponding opencut generation result to obtain an initial human body mesh sequence; the source video frame image is an image frame obtained by performing image frame extraction and down-sampling processing on input video data; the corresponding openposition generation result refers to that skeletal joint point information of an action player in each frame image is extracted by using an openposition model, so that result information of the region position of the human body in the whole frame image is determined.

3. The 3D character motion generation system according to claim 1, wherein the network structure of the mesh2mesh smoothing module is formed by stacking 3D convolution layers, and the mesh sequence is smoothed by using a time sequence space characterization capability of 3D convolution;

in the mesh2mesh smoothing module, the loss function includes two parts:

(1) Joint point loss function L _j3d Respective joint positions for metric prediction mesh

And marking the joint point position J _t Taking the average value of L2 distances of the position deviation of each joint point as a final error value; the mathematical representation is as follows:

wherein k represents the total number of human body joint points, and | is | · | | |, represents the euclidean distance;

the final loss function is: l is a radical of an alcohol _j3d +L _motion 。

4. The 3D character action generating system according to claim 3, wherein in the mesh2mesh smoothing module, the 3D convolutional layer uses a 5 × 1 × 3 convolutional kernel, and sets the step size to (1, 1) and the padding to (2, 0, 1) to ensure that the data dimension of the input and output is kept unchanged by txnx 3; the main body of mesh2mesh is composed of 8 layers of such 3D convolutional layers; in addition, an unsqueze, i.e., adding dimension with shape of 1, and an squeeze layer, i.e., removing dimension with shape of 1, are added to the head and tail of the mesh2mesh smoothing module, respectively, to match the batchsize, i.e., the data input with batch data size of 1 to the network.

5. The 3D character action generating system according to claim 4, wherein the human body posture migration module generates a new mesh which simultaneously retains the identity information of the target mesh and the posture information of the source mesh by adopting a posture migration network, namely sequentially migrating the postures of the whole source mesh sequence into the target mesh so that the target mesh character generates an action consistent with the source video.

6. A 3D character action generating method based on the system of any one of claims 1 to 5, characterized by comprising the following steps:

(1) Firstly, performing image frame extraction and downsampling processing on input video data, and obtaining image frames by performing downsampling on each source video in a certain frequency mode;

(2) Extracting the bone joint point information of the action player of each frame image by adopting an openposition model so as to determine the region position of the human body in the whole frame image;

(3) Inputting the source video frame image and the corresponding openposition generation result into an initial human body reconstruction module, and predicting to obtain an initial human body mesh sequence;

(4) Constructing the initial mesh sequence into regular data, namely, mesh; the vertex information in the mesh sequence is regarded as a cuboid with the shape of T multiplied by N multiplied by 3, wherein T represents the number of meshes contained in the mesh sequence, N represents the number of vertexes contained in a single mesh, and 3 represents the coordinate dimension of the vertex of the single mesh;

(5) Inputting the constructed mesh cuboid into a mesh2mesh smoothing module to obtain a smoothed 3Dmesh sequence;

(5.1) training the mesh2mesh network, and optimizing the mesh2mesh network by using data in a training set; for each mesh in the smoothed 3D mesh sequence, firstly, a joint regression device provided by an SMPL model is used for regression to obtain corresponding 3D joint point coordinates, then the coordinates are compared with the truly labeled human body joint point information, a joint point loss function and a motion loss function are calculated by using the difference value between the joint point coordinates and the truly labeled human body joint point information, and further, network parameters are optimized;