CN111598998B

CN111598998B - Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium

Info

Publication number: CN111598998B
Application number: CN202010400447.3A
Authority: CN
Inventors: 葛志鹏; 曹煊; 葛彦昊; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2023-11-07
Anticipated expiration: 2040-05-13
Also published as: CN111598998A

Abstract

The application relates to a three-dimensional virtual model reconstruction method, a three-dimensional virtual model reconstruction device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image of a target object, the target object having a movable limb; extracting features of the image, and carrying out graph convolution processing on the extracted features to obtain point cloud coordinates with different scales; generating three-dimensional parameters of the target object according to the point cloud coordinates of different scales; reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image. By adopting the method, the accuracy of reconstructing the three-dimensional virtual model can be improved.

Description

Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for reconstructing a three-dimensional virtual model, a computer device, and a storage medium.

Background

With the development of computer technology, artificial intelligence technology (Artificial Intelligence, AI) is a theory, method, technology and application system that simulates, extends and expands human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, acquires knowledge and uses the knowledge to obtain an optimal result. Artificial intelligence is currently being developed and applied in a number of fields, for example, by which reconstruction of three-dimensional models is achieved. The reconstruction of the three-dimensional model is often applied to the aspects of virtual reality scenes, human special effects, human detection and the like.

The traditional three-dimensional virtual model reconstruction mode is obtained by matching and aligning irregular point clouds of a depth map with a three-dimensional human body regular grid model. However, the matching result of the method is seriously dependent on the quality of the depth map, and if the resolution ratio of the depth map is low, the three-dimensional virtual model obtained by reconstruction is inaccurate.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a three-dimensional virtual model reconstruction method, apparatus, computer device, and storage medium that can improve the accuracy of reconstruction.

A method of three-dimensional virtual model reconstruction, the method comprising:

acquiring an image of a target object, the target object having a movable limb;

extracting features of the image, and carrying out graph convolution processing on the extracted features to obtain point cloud coordinates with different scales;

generating three-dimensional parameters of the target object according to the point cloud coordinates of different scales;

reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

A three-dimensional virtual model reconstruction apparatus, the apparatus comprising:

The image acquisition module is used for acquiring an image of a target object, wherein the target object is provided with a movable limb;

the feature extraction module is used for extracting features of the image, and carrying out graph convolution processing on the extracted features to obtain point cloud coordinates with different scales;

the generation module is used for generating three-dimensional parameters of the target object according to the point cloud coordinates of the different scales;

the reconstruction module is used for reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

acquiring an image of a target object, the target object having a movable limb;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring an image of a target object, the target object having a movable limb;

According to the three-dimensional virtual model reconstruction method, the device, the computer equipment and the storage medium, through obtaining the image of the target object, the target object has movable limbs, extracting the characteristics of the image, carrying out graph rolling processing on the extracted characteristics to obtain the point cloud coordinates with different scales, generating the three-dimensional parameters of the target object according to the point cloud coordinates with different scales, and accurately generating the three-dimensional parameters of the target object in the image through graph rolling. Reconstructing a three-dimensional virtual model of the target object based on three-dimensional parameters of the target object, the three-dimensional virtual model having a limb morphology that matches the target object in the image, thereby improving accuracy of the three-dimensional virtual model reconstruction.

A training method of reconstructing a network, the method comprising:

acquiring a first training image of a first object; the first subject has an active limb;

extracting features of the first training image through a reconstruction network to be trained, and carrying out graph convolution processing on the extracted features to obtain point cloud coordinates with different scales;

generating predicted three-dimensional parameters of the first object based on the point cloud coordinates of the different scales;

constructing a target loss function according to the point cloud coordinates of the different scales and the predicted three-dimensional parameters;

training the reconstruction network to be trained based on the target loss function, and obtaining a trained reconstruction network when the training stopping condition is met; the trained reconstruction network is used for reconstructing an object with a movable limb in an image into a three-dimensional virtual model with a limb shape matched with the object.

A training apparatus to reconstruct a network, the apparatus comprising:

the training image acquisition module is used for acquiring a first training image of a first object; the first subject has an active limb;

the input module is used for extracting the characteristics of the first training image through a reconstruction network to be trained, and carrying out graph convolution processing on the extracted characteristics to obtain point cloud coordinates with different scales;

The prediction module is used for generating predicted three-dimensional parameters of the first object based on the point cloud coordinates of the different scales;

the construction module is used for constructing a target loss function according to the point cloud coordinates of the different scales and the predicted three-dimensional parameters;

the training module is used for training the reconstruction network to be trained based on the target loss function, and obtaining a trained reconstruction network when the training stopping condition is met; the trained reconstruction network is used for reconstructing an object with a movable limb in an image into a three-dimensional virtual model with a limb shape matched with the object.

According to the training method, the device, the computer equipment and the storage medium of the reconstruction network, the first training image of the first object with the movable limb is obtained, the feature extraction is carried out on the first training image through the reconstruction network to be trained, the graph convolution processing is carried out on the extracted feature, the point cloud coordinates with different scales are obtained, the predicted three-dimensional parameters of the first object are generated based on the point cloud coordinates with different scales, the target loss function is constructed according to the point cloud coordinates with different scales and the predicted three-dimensional parameters, the reconstruction network to be trained is trained based on the target loss function, and when the training stop condition is met, the trained reconstruction network is obtained, so that the prediction of the three-dimensional parameters of the target object in the two-dimensional image by the trained reconstruction network is more accurate. And accurately predicting the three-dimensional parameters of the target object in the two-dimensional image by using the trained reconstruction network, so as to accurately reconstruct a three-dimensional virtual model corresponding to the target object according to the three-dimensional parameters.

Drawings

FIG. 1 is an application environment diagram of a three-dimensional virtual model reconstruction method in one embodiment;

FIG. 2 is a flow chart of a method for reconstructing a three-dimensional virtual model in one embodiment;

FIG. 3 is a flowchart illustrating steps of extracting features from an image and performing a graph convolution process on the extracted features to obtain point cloud coordinates with different scales in one embodiment;

FIG. 4 is a schematic diagram of the convolution of the diagrams in one embodiment;

FIG. 5 is a flowchart illustrating steps for acquiring an image of a target object in one embodiment;

FIG. 6 is a flow chart of a three-dimensional virtual model reconstruction method in one embodiment;

FIG. 7 (a) is a frame diagram of a reconstructed three-dimensional virtual model in one embodiment;

FIG. 7 (b) is a flowchart of reconstructing a three-dimensional virtual model corresponding to a target human body in a video in real time according to one embodiment;

FIG. 8 is a flow diagram of a training method for reconstructing a network in one embodiment;

FIG. 9 is a flowchart showing steps for constructing an objective loss function based on point cloud coordinates of different scales and predicted three-dimensional parameters, in one embodiment;

FIG. 10 is a frame diagram of a three-dimensional virtual model reconstruction of a human body in a two-dimensional image in one embodiment;

FIG. 11 is a block diagram of a three-dimensional virtual model reconstruction apparatus in one embodiment;

FIG. 12 is a block diagram of a training apparatus for reconstructing a network in one embodiment;

fig. 13 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The three-dimensional virtual model reconstruction method provided by the application can be applied to an application environment shown in figure 1. The terminal 102 acquires an image of a target object having an active limb. The terminal sends the image to the server 104, the server 104 extracts the characteristics of the image through a reconstruction network, and performs graph convolution processing on the extracted characteristics to obtain point cloud coordinates with different scales. The server 104 generates three-dimensional parameters of the target object according to the point cloud coordinates of different scales through the reconstruction network. The server 104 then returns the three-dimensional parameters of the target object to the terminal 102. The terminal 102 rebuilds a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches a target object in the image.

The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms. The terminal 102 may be, but is not limited to, a smart phone, tablet, notebook, desktop, smart box, smart watch, etc. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

In one embodiment, the three-dimensional virtual model reconstruction method is applicable to human three-dimensional virtual model reconstruction, and comprises the following steps:

the method comprises the steps that a terminal obtains a human body training image of a first object, and point cloud coordinates of different scales in the human body training image are obtained through feature extraction and picture rolling lamination processing of the human body training image. The terminal can generate three-dimensional parameters and camera parameters corresponding to the first object in the human training image based on the point cloud coordinates of the different scales. The terminal takes the point cloud coordinates with different scales as point cloud labels with different scales, and takes the three-dimensional parameters as a first three-dimensional human body label corresponding to the first object in the human body training image. The three-dimensional parameters comprise three-dimensional human body posture parameters and three-dimensional human body shape parameters, and the terminal takes the three-dimensional human body posture parameters as three-dimensional human body posture labels corresponding to the human body training images.

The terminal converts the three-dimensional human body posture parameter corresponding to the first object in the human body training image into a two-dimensional human body posture parameter through the camera parameter, and takes the two-dimensional human body posture parameter as a two-dimensional human body posture label corresponding to the human body training image.

The terminal may acquire three-dimensional parameters of the first object through the motion capture device, the three-dimensional parameters also including three-dimensional body posture parameters and three-dimensional body shape parameters. And the terminal takes the three-dimensional parameter as a second three-dimensional human body label corresponding to the first object.

The method comprises the steps that a terminal inputs a human body training image into a reconstruction network to be trained, and the training steps of the reconstruction network comprise:

the terminal performs feature extraction on the human training image through a feature extraction layer of the reconstruction network to be trained to obtain a corresponding feature map.

And the terminal carries out graph rolling processing on the feature graph through a graph rolling layer of the reconstructed network to obtain point cloud features with different scales.

And the terminal carries out regression processing on the point cloud features with different scales through the graph convolution layer to obtain point cloud coordinates with different scales.

The terminal generates predicted three-dimensional parameters of a first object in the human body training image and predicted camera parameters corresponding to the human body training image based on point cloud coordinates of different scales through the graph convolution layer. The predicted three-dimensional parameters include predicted three-dimensional human body posture parameters and predicted three-dimensional human body shape parameters.

And the terminal constructs a first loss function according to the point cloud coordinates of different scales and the point cloud labels of corresponding scales.

And the terminal constructs a second loss function according to the predicted three-dimensional parameter and the first three-dimensional human body label.

And the terminal constructs a third loss function according to the predicted three-dimensional parameter and the second three-dimensional human body label.

Then, the terminal converts the predicted three-dimensional human body posture parameters into predicted two-dimensional human body posture parameters through the predicted camera parameters.

And the terminal constructs a fourth loss function according to the predicted two-dimensional human body posture parameters and the corresponding two-dimensional posture labels.

Then, the terminal acquires a human body training image of the second object, and acquires three-dimensional human body posture parameters corresponding to the second object in the human body training image. And taking the three-dimensional human body posture parameter corresponding to the second object in the human body training image as a three-dimensional human body posture label.

And the terminal inputs the human body training image of the second object into a reconstruction network to be trained to obtain the predicted three-dimensional human body posture parameter of the second object in the human body training image. The human body training image of the first object is an outdoor acquired image, the human body training image of the second object is an indoor acquired human body image, and the second object and the first object can be the same object or different objects.

And constructing a fifth loss function on the terminal according to the predicted three-dimensional human body posture parameter and the three-dimensional human body posture label corresponding to the second object in the human body training image.

And constructing an objective loss function according to the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function on the terminal.

The terminal trains the reconstruction network to be trained based on the target loss function, and the trained reconstruction network is obtained when the training stopping condition is met.

Then, the terminal uses the trained reconstruction network to reconstruct a three-dimensional virtual model corresponding to the human body in the image, comprising:

the terminal acquires an image containing a target human body, and cuts the image containing the target human body by taking the target human body as a center to obtain a human body image containing the target human body with a preset size.

The terminal inputs the human body image into a trained reconstruction network, and the image is subjected to feature extraction through a feature extraction layer of the reconstruction network to obtain a corresponding feature map.

And the terminal generates three-dimensional parameters of the target human body according to the point cloud coordinates of different scales.

The terminal rebuilds a three-dimensional virtual model of the target human body based on the three-dimensional parameters of the target human body to obtain a three-dimensional virtual model of the human body; the three-dimensional virtual model of the human body has a limb morphology that matches a target human body in the human body image.

Then, the terminal projects the three-dimensional virtual model into the virtual reality game, and displays the three-dimensional virtual model in the virtual reality game. The user performs various operations of the virtual reality game by controlling a three-dimensional virtual model in the virtual reality game.

And carrying out graph convolution processing on the human body image through the trained reconstruction network, rapidly and accurately obtaining three-dimensional human body parameters corresponding to a target human body in the human body image, realizing accurate reconstruction of the three-dimensional virtual model through the three-dimensional human body parameters, and improving the accuracy of reconstruction and the efficiency of reconstruction of the three-dimensional virtual model of the human body.

In one embodiment, as shown in fig. 2, a three-dimensional virtual model reconstruction method is provided, and the method is applied to the terminal in fig. 1 for illustration, and includes the following steps:

step 202, an image of a target object having an active extremity is acquired.

The target object is a human body or an animal needing to reconstruct a three-dimensional virtual model.

Specifically, the terminal may acquire an original image of a human body or an animal, for which a three-dimensional virtual model needs to be reconstructed, perform preprocessing on the original image, obtain an image of a target object, and so on. The preprocessing may include operations such as cropping, adjustment of resolution, scaling of image size, brightness adjustment, and/or contrast adjustment, among others. The original image and the image of the target object are two-dimensional images.

In this embodiment, the image may be a color image, where the color image has a higher resolution and richer details for the depth map, and is capable of reconstructing a three-dimensional virtual model of the human body more finely.

In this embodiment, the terminal may obtain the corresponding image by directly capturing the target object, or may obtain the corresponding image of the target object from a local or network or from a third device. The acquired image contains the target object.

And 204, extracting features of the image, and performing graph convolution processing on the extracted features to obtain point cloud coordinates with different scales.

The graph convolution processing refers to convolution processing performed on a graph, and can be implemented through a Graph Convolution Network (GCN). GCN is a neural network that operates on the graph. The Point Cloud is a massive Point set expressing the spatial distribution of the target and the characteristics of the surface of the target under the same spatial reference system, and after the spatial coordinates of each sampling Point of the surface of the object are obtained, the Point Cloud is obtained and is called as Point Cloud. In this embodiment, the point cloud refers to grid points of the target object surface.

Specifically, the terminal may perform feature extraction on the image of the target object to obtain features corresponding to the image, and obtain a feature map. And then, the terminal carries out graph convolution processing on the feature graph to obtain point cloud features with different scales. And the terminal carries out convolution processing with the channel number of 3 on the point cloud features with different scales to obtain point cloud coordinates with different scales. The point cloud coordinates are three-dimensional coordinates.

And 206, generating three-dimensional parameters of the target object according to the point cloud coordinates of different scales.

Wherein the three-dimensional parameter is a Skinned Multi-person linear parameter (SMPL, skinned Multi-Person LinearModel) parameter. The SMPL parameters contained 6890 human surface points, as well as 24 joint points.

Specifically, the terminal can perform downsampling and full connection processing on point cloud coordinates with different scales to obtain three-dimensional parameters of the target object.

In this embodiment, when the target object is a human body, the three-dimensional model of the whole human body can be reconstructed by the multi-person linear parameters based on the skin.

Step 208, reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

Specifically, the three-dimensional parameters include three-dimensional posture parameters and three-dimensional body shape parameters. The three-dimensional posture parameter is the joint point coordinate of the target object, and the three-dimensional body type parameter is the characteristic point coordinate of the surface of the target object. After the terminal obtains the three-dimensional posture parameters and the three-dimensional body type parameters corresponding to the target object in the image, a model is built in a three-dimensional space according to the three-dimensional coordinates corresponding to the three-dimensional posture parameters and the three-dimensional coordinates corresponding to the three-dimensional body type parameters, and therefore a three-dimensional virtual model is obtained.

In this embodiment, the three-dimensional virtual model can be applied to the production of virtual reality games, virtual fitting, virtual trial and video special effects, but is not limited thereto.

According to the three-dimensional virtual model reconstruction method, the image of the target object is obtained, the target object has movable limbs, the image is subjected to feature extraction, the extracted features are subjected to graph convolution processing, point cloud coordinates of different scales are obtained, three-dimensional parameters of the target object are generated according to the point cloud coordinates of different scales, and the three-dimensional parameters of the target object in the image can be accurately generated through graph convolution. Reconstructing a three-dimensional virtual model of the target object based on three-dimensional parameters of the target object, the three-dimensional virtual model having a limb morphology that matches the target object in the image, thereby improving accuracy of the three-dimensional virtual model reconstruction.

In one embodiment, as shown in fig. 3, the feature extraction is performed on the image, and the image convolution processing is performed on the extracted feature, so as to obtain point cloud coordinates with different scales, including:

and step 302, carrying out feature extraction on the image through a feature extraction layer of the reconstructed network to obtain a corresponding feature map.

Specifically, the trained reconstruction network comprises a feature extraction layer and a graph roll stacking layer. The terminal inputs the image containing the target object into a feature extraction layer in a trained reconstruction network, and performs feature extraction on the image through the feature extraction layer to obtain a feature map corresponding to the image.

And step 304, carrying out graph rolling processing on the feature graph through a graph rolling layer of the reconstructed network to obtain point cloud features with different scales.

Specifically, the terminal outputs the feature map corresponding to the image to a map convolution layer of the reconstruction network, and the feature expression output by each layer is obtained through processing each layer of the map convolution layer. The feature expression output by each layer is the point cloud feature of different scales.

In this embodiment, the feature extraction layer may be a ResNet50 network. The graph convolution layer may be a GCN network.

The terminal acquires a set of feature points and edges in the feature map through the map convolution layer, and constructs an undirected map according to the set of adjacent points of each feature point and the set of connecting edges between each feature point and the adjacent points. The undirected graph is composed of nodes and edges, and the nodes can be feature points in the feature graph. And then, the terminal acquires the characterization data of each node in the undirected graph through the graph convolution layer, calculates the distance between any two nodes, and takes the distance between any two nodes as the weight of the edge between the two nodes. An adjacency matrix is generated from the weights. The graph convolution layer performs graph convolution operation on the undirected graph through an activation function, and the graph convolution layer comprises the following steps: and the terminal calculates the feature expression of different scales according to the activation function, the characterization data of the nodes and the weight. The feature expressions with different scales are the point cloud features with different scales.

For example, the terminal may calculate the point cloud features of different scales corresponding to the image of the target object according to the following formula (1):

where i and j represent nodes in the undirected graph,for the characteristic expression of node i at the current layer, < >>Is the characteristic expression of the node i in the first layer; c _ij Is a normalization factor; n (N) _i Is a set of neighboring points of node i; />Representing the weight of node j. Sigma represents an activation function, which can be sigmoid or tanh.

And 306, carrying out regression processing on the point cloud features with different scales through the graph convolution layer to obtain point cloud coordinates with different scales.

Specifically, after the point cloud features with different scales are obtained through calculation, the three-dimensional point cloud coordinates with different scales are obtained through convolution processing of the feature channel number of the graph convolution layer being 3. The characteristic channels are x, y and z.

For example, the terminal may calculate the point cloud characteristics of different scales by the following formula:

wherein p is a three-dimensional coordinate of a point cloud of a scale, A is an adjacent matrix of nodes, A is an N×N real symmetric matrix, W is a weight matrix,is a point cloud feature of one scale.

In this embodiment, after the point cloud coordinates of different scales are obtained by calculation, the terminal may perform downsampling processing and full-connection processing on the point cloud coordinates of different scales, and output three-dimensional parameters of the target object through the full-connection layer.

It will be appreciated that the re-established network may be applied on a terminal or on a server. The server may be a cloud server. When the reconstruction network is applied to the cloud server, the terminal sends the image of the target object to the cloud server. And the cloud server outputs the three-dimensional parameters of the target object through the reconstruction network and returns the three-dimensional parameters to the terminal. The reconstruction network is applied to the cloud server, so that the storage space of the terminal can be saved.

FIG. 4 is a schematic diagram of graph convolution in which an undirected graph constructed from feature points and characterization data in a feature graph is shown, in one embodiment. And reconstructing a graph volume layer in the network to obtain point cloud characteristics of different scales based on each node in the undirected graph and the characterization data of the nodes.

In this embodiment, the features of the image are extracted through the trained reconstruction network, so as to obtain the key information of the image. And carrying out graph rolling processing on the extracted key information through a reconstruction network, converting the key information into point cloud features with different scales, and carrying out regression processing on the point cloud features with different scales through the graph rolling layer to obtain point cloud coordinates with different scales so as to accurately output the coordinates of the key feature information with each scale. And features are extracted through the reconstruction network, and multi-scale point cloud coordinates are output, so that the efficiency is high and the accuracy is high.

In one embodiment, the method further comprises: determining camera parameters corresponding to the image based on the point cloud coordinates of different scales; and projecting the three-dimensional virtual model into a two-dimensional image according to camera parameters corresponding to the image.

The camera parameters refer to parameters for establishing a geometric model of camera imaging. Camera parameters can be generally divided into external parameters (camera extrinsic matrix) and internal parameters (camera intrinsic matrix). The external parameters determine the position and orientation of the camera in a certain three-dimensional space, from which it can be determined how the real world point (i.e. world coordinates) has been rotated and translated and then dropped onto another real world point (i.e. camera coordinates). The internal parameters refer to parameters inside the camera, and according to the internal parameters, it can be known how the real world point is changed into a pixel point through the lens of the camera after the external parameters are acted on, and through pinhole imaging and electronic conversion. For example, taking a human body as an example, the camera parameters may include a rotation matrix R corresponding to the orientation of the human body, and a translation matrix t of the human body mapped to the two-dimensional image coordinates. In addition, scaling factors may be included. Wherein, the proportionality coefficient is an internal parameter, and the rotation matrix R and the translation matrix t are external parameters.

Specifically, the terminal performs graph rolling processing on the feature graph through a graph rolling layer of the reconstructed network to obtain point cloud features with different scales. And carrying out regression processing on the point cloud features with different scales through the graph convolution layer to obtain point cloud coordinates with different scales. The terminal generates three-dimensional parameters of the target object and camera parameters corresponding to the image based on point cloud coordinates of different scales through the graph convolution layer. Then, the terminal projects the three-dimensional virtual model from the three-dimensional space to the two-dimensional space according to the camera parameters, and a two-dimensional image is obtained.

In this embodiment, the terminal renders the three-dimensional virtual model into a two-dimensional image through camera parameters. And the terminal inversely transforms the two-dimensional image according to the clipping and scaling information of the image of the target object to obtain a rendered two-dimensional image with the same size as the original image. The original image refers to an image before the image of the target object is cut out.

In this embodiment, camera parameters corresponding to the image are determined based on point cloud coordinates of different scales, and the three-dimensional virtual model is projected into a two-dimensional image according to the camera parameters corresponding to the image, so that the display mode is more attractive and visual. And the superposition degree between the two-dimensional image projected by the three-dimensional virtual model and the image of the target object can be intuitively displayed, so that the visualization of the three-dimensional virtual model is realized.

In one embodiment, acquiring an image of a target object includes: acquiring an image containing a target object in a video of the target object;

the method further comprises the steps of: projecting a three-dimensional virtual model into the image to replace the target object in the image; and generating the target video based on the image after each frame in the video replaces the target object.

Specifically, the terminal acquires a video corresponding to a target object of which the three-dimensional virtual model needs to be reconstructed, and acquires an image of the target object contained in the video. The terminal inputs the image containing the target object into a reconstruction network, and outputs the three-dimensional parameters corresponding to the target object in the image through the reconstruction network. The terminal reconstructs a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object in the image. Then, the terminal projects the three-dimensional virtual model of the target object into the image to replace the target object in the image, and an image after replacing the target object is obtained.

And carrying out the same processing on each frame of image containing the target object in the video to obtain an image of each frame of image after replacing the target object. And the terminal generates a target video according to the image of each frame in the video after the target object is replaced.

Further, the terminal acquires a video corresponding to a target object of which the three-dimensional virtual model needs to be reconstructed, and acquires each frame of image containing the target object in the video. And the terminal inputs each frame of image containing the target object into a reconstruction network, and outputs three-dimensional parameters corresponding to the target object in each frame of image through the reconstruction network. Then, the terminal reconstructs a three-dimensional virtual model based on the three-dimensional parameters of the target object in each frame of image to obtain a three-dimensional virtual model corresponding to the target object in each frame of image. The terminal projects the three-dimensional virtual model into the corresponding image, and the three-dimensional virtual model is utilized to replace a target object in the corresponding image, so that an image containing the three-dimensional virtual model in each frame is obtained. Then, the terminal can replace the image of the corresponding target object in the video with the image of each frame containing the three-dimensional virtual model to obtain the target video.

In this embodiment, after obtaining a three-dimensional virtual model corresponding to a target object in each frame of image, the terminal projects each three-dimensional virtual model into a corresponding two-dimensional image through camera parameters. And then, the terminal replaces corresponding frames of images in the video by utilizing the two-dimensional images to obtain the target video.

In this embodiment, by acquiring an image in which each frame in the video of the target object includes the target object, three-dimensional parameters corresponding to the target object in each frame image are output in real time through the reconstruction network. The three-dimensional virtual model corresponding to the target object in each frame of image is obtained based on three-dimensional parameter reconstruction, the three-dimensional virtual model corresponding to each image is projected into the corresponding image to replace the target object in the corresponding image, and a target video is obtained, so that the three-dimensional virtual model can be projected into the application of a human-computer interaction somatosensory game or a short video, and the reality of a three-dimensional virtual reality special effect in the human-computer interaction in the somatosensory game or the short video is enhanced.

In one embodiment, as shown in fig. 5, acquiring an image of a target object includes:

step 502, each frame of image of the target object is acquired in the video of the target object.

Reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches a target object in the image, comprising:

step 504, generating a three-dimensional parameter sequence based on the three-dimensional parameters of the target object in each frame of image.

Specifically, the terminal acquires a video corresponding to a target object of which the three-dimensional virtual model needs to be reconstructed, and acquires each frame of image containing the target object in the video to obtain an image sequence. The terminal inputs the image sequence into a reconstruction network, and the reconstruction network performs feature extraction on the images for each frame of image in the image sequence. And performing graph convolution processing on the extracted features through a reconstruction network to obtain point cloud coordinates of different scales corresponding to each frame of image. And generating three-dimensional parameters corresponding to the target objects of each frame of image in the image sequence based on the point cloud coordinates of different scales corresponding to each frame of image respectively, and obtaining a three-dimensional parameter sequence corresponding to the target objects.

Step 506, generating a three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence; the three-dimensional virtual model in the three-dimensional virtual model sequence has a limb morphology that matches the target object in the corresponding image.

Specifically, the terminal reconstructs each three-dimensional virtual model based on each three-dimensional parameter in the three-dimensional parameter sequence, and generates a three-dimensional virtual model sequence according to the ordering order of each three-dimensional parameter in the three-dimensional parameter sequence. The three-dimensional virtual model in the sequence of three-dimensional virtual models has a limb morphology that matches a target object in a corresponding image in the video.

Further, for each three-dimensional parameter in the three-dimensional parameter sequence, reconstructing each three-dimensional virtual model of the target object according to each three-dimensional parameter, thereby obtaining a three-dimensional virtual model sequence. Each three-dimensional virtual model in the sequence of three-dimensional virtual models has a limb morphology that matches a corresponding target object in the sequence of images.

In this embodiment, according to each three-dimensional parameter in the three-dimensional parameter sequence, each three-dimensional virtual model is sequentially generated, so as to obtain a three-dimensional virtual model sequence corresponding to the three-dimensional parameter sequence.

In this embodiment, each frame of image including the target object in the video of the target object is obtained, and three-dimensional parameters corresponding to the target object in each frame of image are output in real time through the reconstruction network, so that the three-dimensional virtual model of the target object in each frame of image is reconstructed in real time, a three-dimensional virtual model sequence is obtained, and the efficiency of reconstructing the three-dimensional virtual model is improved.

In one embodiment, as shown in fig. 6, after generating the three-dimensional parameter sequence corresponding to the target object in each frame of image, the method further includes:

step 602, obtaining corresponding time of each frame image in the video, and obtaining a time sequence.

Specifically, the reconstruction network further includes a filter layer. The terminal acquires corresponding moments of each frame of image containing the target object in the video through a filter layer in the reconstruction network, and sequences the moments according to the sequence of the moments to obtain a time sequence.

And step 604, performing filtering processing on the three-dimensional parameter sequence according to the time sequence to obtain a filtered three-dimensional parameter sequence.

Specifically, the terminal performs filtering processing on the three-dimensional parameter sequence based on the time sequence through the filtering layer, so that smoothing of inter-frame transition is realized, and the three-dimensional parameter sequence after filtering is obtained.

Further, the filtering layer sequentially takes all three-dimensional parameters in the three-dimensional parameter sequence as current three-dimensional parameters, acquires the previous three-dimensional parameters of the current three-dimensional parameters, and acquires the time corresponding to the current three-dimensional parameters and the time corresponding to the previous three-dimensional parameters. And filtering the current three-dimensional parameter based on the previous three-dimensional parameter, the time corresponding to the previous three-dimensional parameter and the time corresponding to the current three-dimensional parameter. And the current three-dimensional parameter and the previous three-dimensional parameter after the filtering treatment realize transition smoothing in time. And obtaining the three-dimensional parameter sequence after filtering according to the same processing mode. The filtering process may be, but is not limited to, bilateral filtering, gaussian filtering, conditional filtering, straight-through filtering, or random sample consensus filtering.

In this embodiment, the filtering layer in the reconstruction network may be a noise reduction self-encoder, so as to implement a process of downsampling and upsampling. The network structure of the filtering layer is the RNN structure of seq2seq, the input of the training process of the filtering layer is the three-dimensional parameter time sequence added with noise, and the output is the three-dimensional parameter time sequence after denoising. And the trained filtering layer receives the three-dimensional parameter time sequence corresponding to the multi-frame discontinuous image, and predicts and outputs the continuous three-dimensional parameter time sequence.

The generating a three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence comprises the following steps:

and step 606, generating a three-dimensional virtual model sequence corresponding to the target object according to the filtered three-dimensional parameter sequence.

Specifically, the terminal reconstructs each three-dimensional virtual model based on the filtered three-dimensional parameter sequence, and generates a three-dimensional virtual model sequence according to the ordering order of each three-dimensional parameter in the three-dimensional parameter sequence. Further, for each three-dimensional parameter in the filtered three-dimensional parameter sequence, reconstructing according to each three-dimensional parameter to obtain a corresponding three-dimensional virtual model, thereby obtaining a three-dimensional virtual model sequence.

In this embodiment, the corresponding time of each frame image in the video is obtained to obtain a time sequence, and the three-dimensional parameter sequence is filtered according to the time sequence to obtain a filtered three-dimensional parameter sequence, so that inter-frame smoothing of the three-dimensional parameters can be realized according to time association. Based on the three-dimensional parameter sequence after filtering, reconstructing a three-dimensional virtual model of a target object in each frame of image, so that a reconstruction network can output a three-dimensional virtual model corresponding to a continuous target object, smooth transition between every two three-dimensional virtual models is realized, and the accuracy and continuity of three-dimensional virtual model reconstruction are improved.

In this embodiment, when the target object in the color image is not on the front side, for example, when the color image is a side image and a back image of the target object, the inter-frame three-dimensional parameter time smoothing processing can be used to smooth the parameters of the unstable angle. The output result of the current frame image and the output result of the front frame and the rear frame tend to be consistent, and the stability of human body reconstruction can be improved.

As shown in fig. 7 (a), a frame diagram of a three-dimensional virtual model is reconstructed in one embodiment. The terminal acquires an original image of each frame in the video, wherein each frame contains a target human body, and an original image sequence is obtained. The terminal inputs the image sequence into a reconstruction network, the reconstruction network detects the human body of the image sequence, and the target human body in the image is marked by the detection frame. Human detection can be achieved through a lightweight network ResNet-18. And then, preprocessing each original image in the marked image sequence by the reconstruction network, namely cutting out the image with the preset size by taking the target human body as the center for each original image. Then, for each frame of clipped image, the feature extraction layer of the reconstruction network performs feature extraction on the clipped image, and outputs the extracted features to the image convolution layer. And the graph convolution layer carries out graph convolution processing on the feature graph to obtain point cloud coordinates with different scales. The reconstruction network outputs a three-dimensional parameter sequence and camera parameters corresponding to a target human body in the image sequence through the full connection layer by carrying out down sampling and full connection processing on point cloud coordinates with different scales. Next, the three-dimensional parameter sequence is input into the filter layer. And the reconstruction network acquires the corresponding time of each frame of image in the video to obtain a time sequence. And the filtering layer in the reconstruction network carries out filtering treatment on the three-dimensional parameter sequence based on the time sequence to obtain a filtered three-dimensional parameter sequence.

And then, reconstructing each three-dimensional virtual model corresponding to the target human body by the terminal through the filtered three-dimensional parameter sequence to obtain a three-dimensional virtual model sequence. Then, the terminal can render the three-dimensional virtual model sequence into a two-dimensional human body sequence according to camera parameters, and replace the target human body in the corresponding cut images by using the two-dimensional human body sequence. Then, the terminal can perform inverse transformation on the replaced image to generate a target image sequence with the same size as the original image sequence, and replace the original image sequence in the video with the target image sequence. Further, using the target image in the target image sequence, the original image of the original image sequence in the replacement video is correspondingly replaced.

As shown in fig. 7 (b), a flowchart of reconstructing a three-dimensional virtual model corresponding to a target human body in a video in real time is shown in an embodiment. The terminal acquires original images of each frame in the video, wherein each frame contains a target human body, and the original images are sequentially input into a reconstruction network. And the reconstruction network sequentially reconstructs a three-dimensional virtual model of the target human body in the original image. Specifically, the reconstruction network sequentially takes the input original images as the current frame images, detects human bodies of the current frame original images, and marks out target human bodies in the images through detection frames. Then, the scaled image is cut out by taking the target human body as the center for the marked current frame original image. And (3) extracting the characteristics of the cropping scaled image and carrying out graph rolling processing through a reconstruction network, and outputting the SMPL parameters and the camera parameters. And then, the reconstruction network performs time smoothing (i.e. filtering) on the SMPL parameters corresponding to the original image of the current frame through the time when the original image of the current frame is in the video and the time when the original image of the previous frame is in the video, so as to obtain the filtered SMPL parameters. And reconstructing the three-dimensional virtual model by the terminal based on the SMPL parameters to obtain the three-dimensional virtual model corresponding to the target human body in the current frame image. Then, the terminal renders the three-dimensional virtual model into a two-dimensional human body based on camera parameters, and replaces the target human body in the cropping scaled image. And then, the terminal carries out inverse cropping and scaling on the replaced image according to the cropping and scaling information to obtain an image with the same size as the original image of the current frame. And according to the same processing mode, obtaining the three-dimensional virtual model which is sequentially output, thereby obtaining the image which is sequentially output and subjected to reverse shearing, cutting and scaling. The reconstruction network is lightweight, the single frame time of the whole reconstruction process consumes about 20 milliseconds, and the real-time reconstruction of the three-dimensional virtual model can be realized.

In one embodiment, as shown in fig. 8, a training method for reconstructing a network is provided, and the method is applied to the terminal in fig. 1 for illustration, and includes the following steps:

step 802, acquiring a first training image of a first object; the first subject has an active limb.

In particular, the terminal may acquire a first training image comprising a first subject having an active limb. For example, the first object is a human or animal. Further, the terminal may obtain the first training image by directly photographing the first object, or may obtain the first training image from a local or network or from a third device.

In this embodiment, the terminal may acquire any image, and screen out an image in which a human body or an animal exists in any image, thereby obtaining a first training image.

In this example, the terminal may collect three-dimensional parameters corresponding to the target object in the first training image. The three-dimensional parameters include three-dimensional pose parameters and three-dimensional body shape parameters. Then, the terminal can take the three-dimensional parameter as a three-dimensional label corresponding to the first training image, and take the three-dimensional gesture parameter as a three-dimensional gesture label. And, the terminal can collect the point cloud characteristics of the first training image and determine the point cloud coordinates of different scales. And then, the terminal takes the point cloud coordinates with different scales as the point cloud labels corresponding to the first training image.

In this embodiment, the terminal may acquire three-dimensional parameters of the target object through the motion capture device. The terminal can set the three-dimensional parameters of the target object obtained by capturing and collecting as a label. Further, the terminal can set the corresponding three-dimensional parameter in the first training image as a first three-dimensional label, and set the three-dimensional parameter of the target object obtained by capturing and collecting as a second three-dimensional label.

And step 804, extracting features of the first training image through a reconstruction network to be trained, and performing graph convolution processing on the extracted features to obtain point cloud coordinates with different scales.

Specifically, the reconstruction network to be trained comprises a feature extraction layer and a picture volume lamination layer. The terminal inputs the first training image into a reconstruction network to be trained. And the feature extraction layer of the reconstruction network to be trained performs feature extraction on the output first training image to obtain a corresponding feature map. And then, carrying out graph convolution processing on the feature graph by a graph convolution layer of the reconstruction network to be trained to obtain point cloud coordinates with different scales.

In this embodiment, feature extraction is performed on a first training image through a reconstruction network to be trained, and graph convolution processing is performed on the extracted features to obtain point cloud coordinates with different scales, including:

Performing feature extraction on the first training image through a feature extraction layer of the reconstruction network to obtain a corresponding feature map; carrying out graph rolling processing on the feature graph through a graph rolling layer of the reconstruction network to obtain point cloud features with different scales; and carrying out regression processing on the point cloud features with different scales through the picture volume lamination layer to obtain point cloud coordinates with different scales.

At step 806, predicted three-dimensional parameters of the first object are generated based on the point cloud coordinates of the different scales.

Specifically, the terminal may perform downsampling and full connection processing on the point cloud coordinates of different scales to obtain predicted three-dimensional parameters corresponding to the first object.

And 808, constructing a target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameters.

Specifically, the terminal acquires point cloud labels with different scales, and determines differences between point cloud coordinates with different scales and point cloud labels with corresponding scales. The terminal may obtain a three-dimensional tag, determine a difference between the three-dimensional tag and the predicted three-dimensional parameter. The terminal can construct a loss function according to the difference between the point cloud coordinates and the point cloud labels and the difference between the predicted three-dimensional parameters and the three-dimensional labels.

Step 810, training a reconstruction network to be trained based on a target loss function, and obtaining a trained reconstruction network when a training stop condition is met; the trained reconstruction network is used for reconstructing an object with a movable limb in an image into a three-dimensional virtual model with a limb shape matched with the object.

Specifically, the terminal trains the reconstructed network to be trained based on the target loss function. And adjusting parameters of the reconstruction network in the training process, and continuing training until the reconstruction network meets the training stopping condition, and stopping training to obtain the trained reconstruction network. The trained reconstruction network is used for reconstructing an object with a moving limb in an image into a three-dimensional virtual model with a limb form matched with the object.

In this embodiment, the training stop condition may be that the loss error of the reconstructed network is less than or equal to the loss threshold, or the training stop condition is that the number of iterations of reconstructing the network reaches a preset number of iterations.

For example, the loss error generated in each training is calculated through the target loss function, the parameters of the reconstruction network are adjusted based on the difference between the loss error and the loss threshold value, training is continued until training is stopped under the training stop condition, and the trained reconstruction network is obtained.

And the terminal calculates the iteration times of the reconstruction network in the training process, and stops training when the iteration times of the terminal in the training process reach the preset iteration times, so as to obtain the trained reconstruction network.

In this embodiment, a first training image of a first object with a movable limb is obtained, feature extraction is performed on the first training image through a reconstruction network to be trained, and graph convolution processing is performed on the extracted features to obtain point cloud coordinates with different scales, so that three-dimensional parameters corresponding to the object in the image can be accurately generated through graph convolution. And constructing a loss function by combining the point cloud coordinates of different scales and the three-dimensional parameters. The reconstruction network to be trained is trained based on the target loss function, and loss caused by factors in all aspects to the network can be integrated in the training process, so that the loss in all aspects is reduced to the minimum through training, the accuracy of the trained reconstruction network is higher, the generalization capability is stronger, and the prediction of the trained reconstruction network on the three-dimensional parameters of the target object in the two-dimensional image is more accurate. And accurately predicting the three-dimensional parameters of the target object in the two-dimensional image by using the trained reconstruction network, so as to accurately reconstruct a three-dimensional virtual model corresponding to the target object according to the three-dimensional parameters.

In one embodiment, as shown in fig. 9, constructing the target loss function from the point cloud coordinates and the predicted three-dimensional parameters of different scales includes:

step 902, obtaining point cloud labels, and constructing a first loss function according to point cloud coordinates of different scales and the point cloud labels of corresponding scales.

The point cloud labels are preset point cloud coordinates of different scales corresponding to the first training image.

Specifically, the terminal acquires point cloud labels of different scales corresponding to the first training image, calculates an L2 norm according to point cloud coordinates of different scales and the point cloud labels of corresponding scales output by the reconstruction network, and sums the L2 norms of different scales to obtain a first loss function.

For example, the first loss function constructed by the terminal is as follows:

L _graph ＝∑ _i ||f _i (Φ _i )-d _i (M(θ,β))|| ₂ (1)

wherein phi is _i Features representing the ith layerExpression, f _i (Φ _i ) The point cloud coordinates of the ith layer after the convolution processing are shown, and M (θ, β) represents the point cloud label of the ith layer. d, d _i (M (θ, β) represents the point cloud label of the i-th layer after downsampling.

Step 904, obtaining a first three-dimensional label corresponding to the first training image, and constructing a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label.

The first three-dimensional label is a preset three-dimensional parameter corresponding to a first object in the first training image.

Specifically, the terminal acquires a first three-dimensional label corresponding to the first training image, determines an L2 norm between a predicted three-dimensional parameter output by the reconstruction network and the corresponding first three-dimensional label, and obtains a second loss function.

For example, the second loss function constructed by the terminal is as follows:

θ represents the three-dimensional attitude parameter predicted by the reconstruction network,Representing three-dimensional gesture labels, i.e., true three-dimensional parameters. Beta represents three-dimensional body shape parameters predicted by reconstruction network, < ->Representing three-dimensional body type labels, i.e., true value body type parameters.

In this embodiment, global translational rotation is required for supervised training of 3D joints (i.e., three-dimensional pose parameters). That is, the 3D joint point coordinates are originally coordinates in the world coordinate system, and removing the global translation refers to subtracting the world coordinates of the pelvic joint point from the world coordinates of the 3D joint point, resulting in coordinates in the coordinate system centered on the pelvis. The training of the network is more stable, and the training optimizer adopts an Adam algorithm to minimize the loss function until the reconstructed network converges.

Step 906, constructing an objective loss function according to the first loss function and the second loss function.

Specifically, the terminal may obtain a weight corresponding to the first loss function and a weight corresponding to the second loss function. The terminal multiplies the first loss function and the corresponding weight, multiplies the second loss function and the corresponding weight, and sums the multiplied results to obtain the target loss function.

In this embodiment, a first loss function is constructed according to point cloud coordinates of different scales and point cloud labels of corresponding scales, a first three-dimensional label corresponding to a first training image is obtained, a second loss function is constructed according to predicted three-dimensional parameters and the first three-dimensional label, a target loss function is constructed according to the first loss function and the second loss function, and a constructed target loss function can be constructed based on two aspects of the point cloud coordinates of different scales and the predicted three-dimensional parameters obtained by prediction of the image, so that the constructed target loss function is more accurate, and a reconstructed network obtained by training is more accurate.

In one embodiment, the method further comprises: acquiring a second three-dimensional label, and constructing a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label; the second three-dimensional label is a three-dimensional parameter acquired by capturing the motion of the first object;

constructing an objective loss function from the first loss function and the second loss function, comprising: and constructing an objective loss function according to the first loss function, the second loss function and the third loss function.

The second three-dimensional tag is a three-dimensional parameter acquired by performing motion capture on the first object through the motion capture device, and the three-dimensional parameter comprises a three-dimensional posture parameter and a three-dimensional body type parameter acquired by performing motion capture on the first object through the motion capture device.

Specifically, the reconstruction network in the training process further comprises an countermeasure layer generation step, the terminal acquires a second three-dimensional label, and the second three-dimensional label and the predicted three-dimensional parameters are input to generate the countermeasure layer. And generating a countermeasure layer to judge the input predicted three-dimensional parameter and the second three-dimensional label, wherein the judging result is true or false. For example, generating an antagonism network output 1 indicates that the predicted three-dimensional parameter is true, and an output 0 indicates that the predicted three-dimensional parameter is false. The method can also be set according to the requirement, wherein 1 is used for indicating that the predicted three-dimensional parameter is false, and 0 is used for indicating that the predicted three-dimensional parameter is true.

In this embodiment, the second three-dimensional tag is true in generating the discrimination result of the countermeasure layer output. It can be understood that all preset labels are true in generating the discrimination result of the output of the countermeasure layer.

And the terminal constructs a third loss function by generating a discrimination result corresponding to the predicted three-dimensional parameter output by the countermeasure network and a discrimination result corresponding to the second three-dimensional label.

Further, the terminal calculates the negative logarithm of the discrimination result corresponding to the predicted three-dimensional parameter, and calculates the expectation of the negative logarithm. And the terminal calculates the negative logarithm of the difference value of the discrimination results corresponding to the 1 and the second three-dimensional label, and calculates the expectation of the negative logarithm. The terminal sums the two expectations to obtain a third loss function.

For example, the third loss function constructed by the terminal is as follows:

the logarithmic loss is used to generate the antagonism network:

wherein x is _r Representing a second three-dimensional label, D (x _r ) Representing the generation of a discrimination result for the first three-dimensional tag by the discriminator in the countermeasure network.The expected value of the negative logarithm of the output of the discriminator. X is x _f Refers to reconstructing predicted three-dimensional parameters of network output, D (x _f ) Representing the discrimination result of the discriminator on the predicted three-dimensional parameter.

Then, the terminal may obtain a weight corresponding to the first loss function, a weight corresponding to the second loss function, and a weight corresponding to the third loss function. The terminal multiplies the first loss function and the corresponding weight, multiplies the second loss function and the corresponding weight, multiplies the third loss function and the corresponding weight, and sums the 3 multiplied results to obtain the target loss function.

In this embodiment, the reconstructing network further includes generating an countermeasure layer to distinguish the predicted three-dimensional parameter and the second three-dimensional label, so that a loss function can be constructed by the three-dimensional parameter obtained by the predicted image and the three-dimensional data directly collected by the real human body, so as to determine whether the predicted three-dimensional parameter output by the network accords with the real situation. And constructing a loss function based on the difference between the judging result of the predicted three-dimensional parameter and the judging result of the second three-dimensional label, and constructing three loss functions through three factors to obtain a target loss function. The objective loss function integrates various characteristics, so that the accuracy of the reconstructed network obtained by training is higher.

In one embodiment, the method further comprises: generating camera parameters corresponding to the first training image based on the point cloud coordinates of different scales; converting the three-dimensional gesture parameters into predicted two-dimensional gesture parameters according to the camera parameters; constructing a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude labels;

constructing a target loss function according to point cloud coordinates of different scales and predicted three-dimensional parameters, wherein the method comprises the following steps of: and constructing a target loss function according to the fourth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

The three-dimensional attitude parameter is a three-dimensional joint point coordinate corresponding to the first object.

Specifically, the reconstruction network generates predicted three-dimensional parameters corresponding to a first object in the first training image based on point cloud coordinates of different scales, and generates predicted camera parameters corresponding to the first training image. The predicted three-dimensional parameters include predicted three-dimensional posture parameters. And then, the terminal can map the predicted three-dimensional gesture parameters from the three-dimensional space to the two-dimensional space according to the predicted camera parameters to obtain predicted two-dimensional gesture parameters corresponding to the predicted three-dimensional gesture parameters. The predicted two-dimensional pose parameter refers to the predicted two-dimensional articulation point coordinates of the first object.

The terminal may determine an L2 norm between the predicted two-dimensional pose parameter and the corresponding two-dimensional pose tag, resulting in a fourth loss function.

For example, the fourth loss function constructed by the terminal is as follows:

the two-dimensional attitude loss is the loss of the joint point and the truth value mark after the projection of the camera parameters:

wherein the II is a kind of a _c And representing the predicted two-dimensional attitude parameters of the predicted three-dimensional attitude parameters output by the reconstruction network, wherein the predicted two-dimensional attitude parameters are obtained by predicting camera parameter projections, namely predicted 2D articulation points obtained by predicting 3D articulation point projections.Is a two-dimensional gesture label, namely a 2D joint point true value. />

And then, the terminal acquires point cloud labels, and a first loss function is constructed according to the point cloud coordinates of different scales and the point cloud labels of corresponding scales. And the terminal acquires a first three-dimensional label corresponding to the first training image, and constructs a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label.

And constructing an objective loss function according to the first loss function and the second loss function.

Then, the terminal may obtain a weight corresponding to the first loss function, a weight corresponding to the second loss function, and a weight corresponding to the fourth loss function. The terminal multiplies the first loss function and the corresponding weight, multiplies the second loss function and the corresponding weight, multiplies the fourth loss function and the corresponding weight, and sums the 3 multiplied results to obtain the target loss function.

In the embodiment, camera parameters corresponding to the first training image are generated based on point cloud coordinates of different scales; converting the three-dimensional gesture parameters into predicted two-dimensional gesture parameters according to the camera parameters; the method comprises the steps of constructing a fourth loss function according to predicted two-dimensional gesture parameters and corresponding two-dimensional gesture labels, constructing a target loss function according to the fourth loss function, point cloud coordinates of different scales and predicted three-dimensional parameters, and constructing the target loss function based on the three characteristics of the predicted two-dimensional gesture parameters, the point cloud coordinates of different scales and the predicted three-dimensional parameters of an image, so that the constructed target loss function is more accurate, and a reconstruction network obtained through training is more accurate.

In one embodiment, the method further comprises: inputting the second training image into a reconstruction network to be trained to obtain predicted three-dimensional attitude parameters of a second object in the second training image; the first training image and the second training image are images acquired in different environments; acquiring a three-dimensional attitude label corresponding to the second object, and constructing a fifth loss function according to the predicted three-dimensional attitude parameter and the three-dimensional attitude label;

constructing a target loss function according to point cloud coordinates of different scales and predicted three-dimensional parameters, wherein the method comprises the following steps of: and constructing a target loss function according to the fifth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

The first training image is an image collected outdoors, and the second training image is an image collected indoors. The first object in the first training image and the second object in the second training image may be the same object or different objects.

Specifically, the terminal inputs the second training image into a reconstruction network to be trained, and performs feature extraction on the second training image through a feature extraction layer in the reconstruction network to be trained to obtain a feature map of the second training image. Carrying out graph convolution processing on the feature graph of the second training image through a graph convolution layer in the reconstruction network to be trained to obtain point cloud features with different scales; and carrying out graph convolution processing on the point cloud features with different scales through the graph convolution layer to obtain the point cloud coordinates with different scales. The terminal generates predicted three-dimensional attitude parameters corresponding to the second object in the second training image based on the point cloud coordinates of different scales through the graph convolution layer.

And then, the terminal acquires a three-dimensional gesture label corresponding to the second object, determines an L2 norm between the predicted three-dimensional gesture parameter and the three-dimensional gesture label, and obtains a fifth loss function.

For example, the fifth loss function constructed by the terminal is as follows:

The fifth loss function is the loss between the predicted three-dimensional attitude parameter and the three-dimensional attitude label output by the network:

three-dimensional pose loss is the loss of the node of the SMPL model and the truth label:

wherein R is _θ (beta) represents the predicted three-dimensional pose parameter, i.e. the 3D joint point coordinate, among the predicted three-dimensional parameters output by the reconstruction network.Representing three-dimensional gesture labels, namely real 3D joint point coordinates.

And constructing an objective loss function according to the fifth loss function, the first loss function and the second loss function.

Then, the terminal may obtain a weight corresponding to the first loss function, a weight corresponding to the second loss function, and a weight corresponding to the fifth loss function. The terminal multiplies the first loss function and the corresponding weight, multiplies the second loss function and the corresponding weight, multiplies the fifth loss function and the corresponding weight, and sums the 3 multiplied results to obtain the target loss function.

In the embodiment, a loss function is constructed based on three-dimensional gesture parameters obtained by prediction of a reconstruction network and corresponding three-dimensional gesture labels, and the three-dimensional gesture parameters are predicted, and the three-dimensional gesture parameters are combined to construct a target loss function, so that losses generated in the training process of the reconstruction network by a plurality of factors can be integrated, the influence of the factors on the reconstruction network in each aspect on the prediction is minimum, and the three-dimensional virtual model reconstructed by the reconstruction network is more accurate.

In one embodiment, the method further comprises: inputting the third training image into a reconstruction network to be trained to obtain corresponding predicted two-dimensional attitude parameters; constructing a fifth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude labels;

constructing a target loss function based on point cloud coordinates of different scales and predicted three-dimensional parameters, comprising: and constructing a target loss function based on the fifth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

Specifically, the third training image may be an image with a complex background, where the third training image includes a third object.

In one embodiment, inputting the third training image into the reconstruction network to be trained to obtain the corresponding predicted two-dimensional pose parameter, including:

Inputting the third training image into a reconstruction network to be trained, and extracting the characteristics of the third training image through a characteristic extraction layer in the reconstruction network to be trained to obtain a characteristic diagram of the third training image; carrying out regression processing on the feature images of the third training image through a graph roll lamination layer in the reconstruction network to be trained to obtain point cloud features with different scales; regression processing is carried out on the point cloud features with different scales through the graph convolution layer, so that point cloud coordinates with different scales are obtained; generating predicted three-dimensional attitude parameters and camera parameters based on point cloud coordinates of different scales through a graph convolution layer; and converting the predicted three-dimensional gesture parameters into corresponding predicted two-dimensional gesture parameters through camera parameters.

In one embodiment, the terminal may construct the target loss function according to the first loss function, the second loss function, the third loss function, the fourth loss function, and the fifth loss function, and weight parameters corresponding to the loss functions.

The objective loss function constructed by the terminal is as follows:

L _total ＝λ ₁ L _j2d +λ ₂ L _j3d +λ ₃ L _graph +λ ₄ L _smpl +λ ₅ L _adv (6)

wherein λ is a weight parameter corresponding to each loss function.

The target loss function is constructed through 5 loss functions, factors affecting network performance in more aspects are integrated, the influence of each aspect is minimized, and the reconstructed network is more accurate.

In one embodiment, the training data set used in the training process to reconstruct the network comprises a 3DPW human parameter data set (3D poses in the wild dataset, outdoor 3D human data set,); poseTrack, pennyAction, etc., 3D pose data sets (indoor collected data sets) disclosed by human3.6m, mpi_inf_3dhp, etc., and SMPL parameter data sets obtained by Mosh, wherein the SMPL human body parameters include pose parameters and body shape parameters, the pose parameters have 72 dimensions, and the body shape parameters have 10 dimensions; the attitude parameter is rotation information of 24 nodes, and rotation information of each node is represented by a 3-dimensional axis angle vector, so that 24×3-dimensional parameters are taken together. SMPL is a human parameter based on the skin transformation that can be converted into a pose, represented by a 72-dimensional vector, and a body shape, represented by a 10-dimensional vector. Mosh is a motion capture means (Motion And Shape Capture) that captures human motion and body shape data sets, relying on motion capture devices to capture data of human surface points, resulting in SMPL parameter data sets.

As shown in fig. 10, a frame diagram of three-dimensional virtual model reconstruction of a human body in a two-dimensional image is shown in one embodiment. The terminal acquires a two-dimensional image containing a human body, and inputs the two-dimensional image into a reconstruction network. And carrying out feature extraction on the two-dimensional image by a feature extraction layer in the reconstruction network, and carrying out camera parameter regression based on the extracted features to obtain camera parameters corresponding to the two-dimensional image. And inputting the extracted features into a graph convolution layer to perform graph convolution layer processing, and performing regression processing on the point cloud features of each scale to obtain point cloud coordinates of different scales. The image convolution layer carries out human parameter regression based on the point cloud coordinates of different scales to obtain three-dimensional parameters corresponding to the human body in the two-dimensional image. And (3) carrying out reprojection on the three-dimensional articulation point (namely the three-dimensional attitude parameter) in the three-dimensional parameters to obtain the two-dimensional articulation point (namely the two-dimensional attitude parameter). Inputting the three-dimensional parameters and the corresponding three-dimensional labels into the generated countermeasure layer, and judging the three-dimensional parameters by a judging device in the generated countermeasure layer to obtain a judging result, wherein the judging result is true or false. And constructing a target loss function according to the difference between the three-dimensional parameters and the corresponding three-dimensional labels, the difference between the point cloud coordinates of different scales and the point cloud labels of corresponding scales, the difference between the three-dimensional node and the corresponding labels and the difference between the two-dimensional node and the corresponding labels, and training and reconstructing a network based on the target loss function.

In one embodiment, a three-dimensional virtual model reconstruction method is provided, comprising:

the server obtains a first training image of a first object. The first subject has an active limb.

And then, the server extracts the features of the first training image through a reconstruction network to be trained, and performs graph convolution processing on the extracted features to obtain point cloud coordinates with different scales.

Then, the server generates predicted three-dimensional parameters and corresponding predicted camera parameters of the first object based on the point cloud coordinates of the different scales.

Then, the server acquires point cloud labels, and a first loss function is constructed according to point cloud coordinates of different scales and the point cloud labels of corresponding scales.

Further, the server acquires a first three-dimensional label corresponding to the first training image, and constructs a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label.

Then, the server acquires a second three-dimensional label, and constructs a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label. The second three-dimensional label is a three-dimensional parameter acquired by capturing the motion of the first object.

Then, the server converts the three-dimensional gesture parameters into predicted two-dimensional gesture parameters according to the camera parameters.

Further, the server constructs a fourth loss function according to the predicted two-dimensional pose parameters and the corresponding two-dimensional pose labels.

And then, the server inputs the second training image into a reconstruction network to be trained to obtain predicted three-dimensional posture parameters of a second object in the second training image. The first training image and the second training image are images acquired in different environments.

And then, the server acquires a three-dimensional gesture label corresponding to the second object, and a fifth loss function is constructed according to the predicted three-dimensional gesture parameter and the three-dimensional gesture label.

The server constructs a target loss function according to the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function and weight parameters corresponding to the loss functions.

Further, the server trains the reconstruction network to be trained based on the target loss function, and when the training stopping condition is met, the trained reconstruction network is obtained. The trained reconstruction network is used for reconstructing the object with the movable limb in the image into a three-dimensional virtual model with the limb shape matched with the object.

The trained reconstruction network is applied to a terminal, and the terminal acquires images of the target object contained in each frame in the video of the acquired target object, wherein the target object has movable limbs.

And then, the terminal performs feature extraction on each frame of image through a trained feature extraction layer of the reconstructed network to obtain a feature map corresponding to each frame of image.

Further, the terminal carries out graph convolution processing on the feature graphs of each frame of image through a graph convolution layer of the reconstruction network to obtain point cloud features of different scales corresponding to each frame of image.

And then, the terminal carries out regression processing on the point cloud features of different scales corresponding to each frame of image through the image convolution layer to obtain the point cloud coordinates of different scales corresponding to each frame of image.

And then, the terminal generates three-dimensional parameters and corresponding camera parameters of the target object in each frame image according to the point cloud coordinates of different scales corresponding to each frame image.

Further, the terminal reconstructs a three-dimensional virtual model of the target object in each frame of image based on the three-dimensional parameters of the target object, the three-dimensional virtual model having a limb shape matching the target object in the corresponding image.

Further, the terminal projects the three-dimensional virtual model into a two-dimensional image according to camera parameters corresponding to the image, and replaces a target object in the object image in the video according to the two-dimensional image to obtain the target video.

In this embodiment, the training process of reconstructing the network has a high requirement on the device graphics card, and may be completed on a server. For the three-dimensional parameter data set, supervision training can be performed by predicting three-dimensional attitude parameters and corresponding labels. In order to improve the generalization capability of the reconstruction network, a two-dimensional gesture data set and a three-dimensional gesture data set are added for semi-supervised training. In order to improve the quality of the output result of the reconstruction network, the construction factor discriminator discriminates the data predicted by the network. In order to improve the accuracy of the network, multi-scale point cloud coordinate supervision is added to the training loss function. In order to improve the smoothness of the results between video image frames, the single frame results are smoothed over time. In the training process, various factors are considered, so that the influence of various aspects is minimized through training, and the accuracy of the reconstructed network is improved.

The two-dimensional image of the target object to be reconstructed is predicted through the trained reconstruction network, so that the three-dimensional parameters of the target object can be accurately obtained, and a three-dimensional virtual model can be accurately constructed according to the three-dimensional parameters.

It should be understood that, although the steps in the flowcharts of fig. 2 to 10 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps of fig. 2-10 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.

In one embodiment, as shown in fig. 11, a three-dimensional virtual model reconstruction apparatus is provided, where the apparatus may use a software module or a hardware module, or a combination of both, to form a part of a computer device, and specifically includes: an image acquisition module 1102, a feature extraction module 1104, a generation module 1106, and a reconstruction module 1108, wherein:

an image acquisition module 1102 is configured to acquire an image of a target object, the target object having an active extremity.

The feature extraction module 1104 is configured to perform feature extraction on the image, and perform graph convolution processing on the extracted feature to obtain point cloud coordinates with different scales.

The generating module 1106 is configured to generate three-dimensional parameters of the target object according to the point cloud coordinates of the different scales.

A reconstruction module 1108, configured to reconstruct a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

In this embodiment, by acquiring an image of a target object, where the target object has a moving limb, extracting features from the image, and performing graph rolling processing on the extracted features, point cloud coordinates with different scales are obtained, and three-dimensional parameters of the target object are generated according to the point cloud coordinates with different scales, so that the three-dimensional parameters of the target object in the image can be accurately generated through graph rolling. Reconstructing a three-dimensional virtual model of the target object based on three-dimensional parameters of the target object, the three-dimensional virtual model having a limb morphology that matches the target object in the image, thereby improving accuracy of the three-dimensional virtual model reconstruction.

In one embodiment, the feature extraction module 1104 is configured to: extracting the characteristics of the image through a characteristic extraction layer of the reconstruction network to obtain a corresponding characteristic diagram; carrying out graph rolling processing on the feature graph through a graph rolling layer of the reconstructed network to obtain point cloud features with different scales; and carrying out regression processing on the point cloud features with different scales through the graph convolution layer to obtain point cloud coordinates with different scales.

In this embodiment, the features of the image are extracted through the trained reconstruction network, so as to obtain the key information of the image. And carrying out graph rolling processing on the extracted key information through a reconstruction network, converting the key information into point cloud features with different scales, and carrying out regression processing on the point cloud features with different scales through the graph rolling layer to obtain point cloud coordinates with different scales so as to accurately output the coordinates of the key feature information with each scale.

In one embodiment, the apparatus further comprises: a projection module for: determining camera parameters corresponding to the image based on the point cloud coordinates of different scales; and projecting the three-dimensional virtual model into a two-dimensional image according to camera parameters corresponding to the image.

In one embodiment, the image acquisition module 1102 is further configured to: acquiring an image containing a target object in a video of the target object;

the apparatus further comprises: a projection module, the projection module further configured to: projecting the three-dimensional virtual model into the image to replace the target object in the image; and generating the target video based on the image after each frame in the video replaces the target object.

In one embodiment, the image acquisition module 1102 is further configured to: acquiring each frame of image containing a target object in a video of the target object;

The generating module 1106 is further configured to: generating a three-dimensional parameter sequence based on the three-dimensional parameters of the target object in each frame of image;

the reconstruction module 1108 is also configured to: and generating a three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence.

In this embodiment, each frame in the video of the target object includes an image of the target object, and a three-dimensional parameter sequence corresponding to the target object in each frame image is generated according to the point cloud coordinates of different scales corresponding to each frame image, so that three-dimensional parameters corresponding to the target object in each frame image can be output in real time through the reconstruction network, and thus, a three-dimensional virtual model of the target object in each frame image is reconstructed in real time, and the efficiency of reconstructing the three-dimensional virtual model is improved.

In one embodiment, the generating module 1106 is further configured to: acquiring corresponding moments of each frame of image in the video to obtain a time sequence; filtering the three-dimensional parameter sequence according to the time sequence to obtain a filtered three-dimensional parameter sequence;

the reconstruction module 1108 is also configured to: generating a three-dimensional virtual model sequence corresponding to the target object according to the filtered three-dimensional parameter sequence; the three-dimensional virtual model in the three-dimensional virtual model sequence has a limb morphology that matches the target object in the corresponding image.

For specific limitations of the three-dimensional virtual model reconstruction device, reference may be made to the above limitation of the three-dimensional virtual model reconstruction method, and no further description is given here. The above-described respective modules in the three-dimensional virtual model reconstruction apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, as shown in fig. 12, a training apparatus for reconstructing a network is provided, where the apparatus may use a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes: a training image acquisition module 1202, an input module 1204, a prediction module 1206, a construction module 1208, and a training module 1210, wherein:

a training image acquisition module 1202 for acquiring a first training image of a first object; the first subject has an active limb.

And the input module 1204 is used for extracting the characteristics of the first training image through a reconstruction network to be trained, and carrying out graph convolution processing on the extracted characteristics to obtain point cloud coordinates with different scales.

A prediction module 1206 is configured to generate predicted three-dimensional parameters of the first object based on the point cloud coordinates of the different scales.

A construction module 1208 is configured to construct a target loss function according to the point cloud coordinates of the different scales and the predicted three-dimensional parameter.

The training module 1210 is configured to train the reconstruction network to be trained based on the objective loss function, and obtain a trained reconstruction network when a training stop condition is satisfied; the trained reconstruction network is used for reconstructing an object with a movable limb in an image into a three-dimensional virtual model with a limb shape matched with the object.

In one embodiment, the construction module 1208 is further configured to: acquiring point cloud labels, and constructing a first loss function according to point cloud coordinates of different scales and the point cloud labels of corresponding scales; acquiring a first three-dimensional label corresponding to the first training image, and constructing a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label; and constructing a target loss function according to the first loss function and the second loss function.

In one embodiment, the construction module 1208 is further configured to: acquiring a second three-dimensional label, and constructing a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label; the second three-dimensional label is a three-dimensional parameter acquired by capturing the motion of the first object; and constructing an objective loss function according to the first loss function, the second loss function and the third loss function.

In this embodiment, the reconstructing the network further includes generating an countermeasure layer to distinguish the predicted three-dimensional parameter and the second three-dimensional tag, so as to determine whether the predicted three-dimensional parameter output by the network meets the actual situation. And constructing a loss function based on the difference between the judging result of the predicted three-dimensional parameter and the judging result of the second three-dimensional label, and constructing three loss functions through three factors to obtain a target loss function. The objective loss function integrates various characteristics, so that the accuracy of the reconstructed network obtained by training is higher.

In one embodiment, the construction module 1208 is further configured to: generating camera parameters corresponding to the first training image based on the point cloud coordinates of different scales; converting the three-dimensional gesture parameters into predicted two-dimensional gesture parameters through camera parameters; constructing a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude labels; and constructing a target loss function according to the fourth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

In one embodiment, the construction module 1208 is further configured to: inputting the second training image into a reconstruction network to be trained to obtain predicted three-dimensional attitude parameters of a second object in the second training image; the first training image and the second training image are images acquired in different environments; acquiring a three-dimensional attitude label corresponding to the second object, and constructing a fifth loss function according to the predicted three-dimensional attitude parameter and the three-dimensional attitude label; and constructing a target loss function according to the fifth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

For specific limitations on the training apparatus for reconstructing the network, reference may be made to the above limitation on the training method for reconstructing the network, which is not described herein. The above-described respective modules in the training apparatus for reconstructing a network may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing training data of a reconstruction network and reconstruction data of a three-dimensional virtual model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a training method for reconstructing a network and a three-dimensional virtual model reconstruction method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 13. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program, when executed by a processor, implements a training method for reconstructing a network and a three-dimensional virtual model reconstruction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 13 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for reconstructing a three-dimensional virtual model, the method comprising:

acquiring each frame of image containing a target object in a video of the target object, wherein the target object is provided with a movable limb;

Generating a three-dimensional parameter sequence based on the three-dimensional parameters of the target object in the image of each frame;

acquiring corresponding moments of the images of each frame in the video to obtain a time sequence;

filtering the three-dimensional parameter sequence according to the time sequence to obtain a filtered three-dimensional parameter sequence;

generating a three-dimensional virtual model sequence corresponding to the target object according to the filtered three-dimensional parameter sequence; the three-dimensional virtual model in the three-dimensional virtual model sequence has a limb morphology that matches the target object in the corresponding image.

2. The method of claim 1, wherein the extracting features from the image and performing a graph convolution process on the extracted features to obtain point cloud coordinates with different scales includes:

extracting the characteristics of the image through a characteristic extraction layer of the reconstruction network to obtain a corresponding characteristic diagram;

carrying out graph rolling processing on the feature graph through a graph rolling layer of the reconstruction network to obtain point cloud features with different scales;

and carrying out regression processing on the point cloud features with different scales through the picture volume lamination layer to obtain point cloud coordinates with different scales.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

determining camera parameters corresponding to the image based on the point cloud coordinates of different scales;

and projecting the three-dimensional virtual model into a two-dimensional image according to camera parameters corresponding to the image.

4. The method according to claim 1, wherein the method further comprises:

projecting the three-dimensional virtual model into the corresponding image to replace the target object in the image;

and generating a target video based on the images of the frames in the video after replacing the target object.

5. A training method for reconstructing a network, the method comprising:

Training the reconstruction network to be trained based on the target loss function, and obtaining a trained reconstruction network when the training stopping condition is met;

the trained reconstruction network is used for generating a three-dimensional virtual model sequence corresponding to a target object, the three-dimensional virtual model sequence is generated according to a filtered three-dimensional parameter sequence, the filtered three-dimensional parameter sequence is obtained by filtering the three-dimensional parameter sequence according to a time sequence, and the time sequence is formed based on the corresponding moment of each frame of image containing the target object in a video of the target object; the three-dimensional parameter sequence is generated based on three-dimensional parameters of the target object in each frame of the image, the three-dimensional parameters are generated according to point cloud coordinates of different scales, the point cloud coordinates of different scales are obtained by extracting features of the image and carrying out graph convolution processing on the extracted features; the three-dimensional virtual model in the three-dimensional virtual model sequence has a limb morphology that matches the target object in the corresponding image.

6. The method of claim 5, wherein said constructing an objective loss function from said point cloud coordinates of different scales and said predicted three-dimensional parameters comprises:

Acquiring point cloud labels, and constructing a first loss function according to the point cloud coordinates of different scales and the point cloud labels of corresponding scales;

acquiring a first three-dimensional label corresponding to the first training image, and constructing a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label;

7. The method of claim 6, wherein the method further comprises:

acquiring a second three-dimensional label, and constructing a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label; the second three-dimensional tag is a three-dimensional parameter acquired by capturing the motion of the first object;

said constructing a target loss function from said first loss function and said second loss function, comprising:

and constructing an objective loss function according to the first loss function, the second loss function and the third loss function.

8. The method of claim 5, wherein the predicted three-dimensional parameters include three-dimensional pose parameters that are three-dimensional joint point coordinates of the first object; the method further comprises the steps of:

Generating camera parameters corresponding to the first training image based on the point cloud coordinates of different scales;

converting the three-dimensional gesture parameters into predicted two-dimensional gesture parameters according to the camera parameters;

constructing a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude labels;

the constructing a target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameters comprises the following steps:

and constructing a target loss function according to the fourth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

9. The method of claim 5, wherein the method further comprises:

inputting a second training image into the reconstruction network to be trained to obtain predicted three-dimensional attitude parameters of a second object in the second training image; the first training image and the second training image are images acquired in different environments;

acquiring a three-dimensional gesture label corresponding to the second object, and constructing a fifth loss function according to the predicted three-dimensional gesture parameter and the three-dimensional gesture label;

And constructing a target loss function according to the fifth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

10. A three-dimensional virtual model reconstruction apparatus, the apparatus comprising:

the image acquisition module is used for acquiring each frame of image containing a target object in the video of the target object, wherein the target object is provided with a movable limb;

the generation module is used for generating three-dimensional parameters of the target object according to the point cloud coordinates of the different scales; generating a three-dimensional parameter sequence based on the three-dimensional parameters of the target object in the image of each frame; acquiring corresponding moments of the images of each frame in the video to obtain a time sequence; filtering the three-dimensional parameter sequence according to the time sequence to obtain a filtered three-dimensional parameter sequence;

the reconstruction module is used for generating a three-dimensional virtual model sequence corresponding to the target object according to the filtered three-dimensional parameter sequence; the three-dimensional virtual model in the three-dimensional virtual model sequence has a limb morphology that matches the target object in the corresponding image.

11. The apparatus of claim 10, wherein the feature extraction module is further configured to perform feature extraction on the image through a feature extraction layer of a reconstruction network to obtain a corresponding feature map; carrying out graph rolling processing on the feature graph through a graph rolling layer of the reconstruction network to obtain point cloud features with different scales; and carrying out regression processing on the point cloud features with different scales through the picture volume lamination layer to obtain point cloud coordinates with different scales.

12. The apparatus according to claim 10 or 11, characterized in that the apparatus further comprises:

the projection module is used for determining camera parameters corresponding to the image based on the point cloud coordinates of the different scales; and projecting the three-dimensional virtual model into a two-dimensional image according to camera parameters corresponding to the image.

13. The apparatus of claim 10, wherein the apparatus further comprises:

a projection module, configured to project the three-dimensional virtual model into the corresponding image to replace the target object in the image; and generating a target video based on the images of the frames in the video after replacing the target object.

14. A training apparatus for reconstructing a network, the apparatus comprising:

the training module is used for training the reconstruction network to be trained based on the target loss function, and obtaining a trained reconstruction network when the training stopping condition is met; the trained reconstruction network is used for generating a three-dimensional virtual model sequence corresponding to a target object, the three-dimensional virtual model sequence is generated according to a filtered three-dimensional parameter sequence, the filtered three-dimensional parameter sequence is obtained by filtering the three-dimensional parameter sequence according to a time sequence, and the time sequence is formed based on the corresponding moment of each frame of image containing the target object in a video of the target object; the three-dimensional parameter sequence is generated based on three-dimensional parameters of the target object in each frame of the image, the three-dimensional parameters are generated according to point cloud coordinates of different scales, the point cloud coordinates of different scales are obtained by extracting features of the image and carrying out graph convolution processing on the extracted features; the three-dimensional virtual model in the three-dimensional virtual model sequence has a limb morphology that matches the target object in the corresponding image.

15. The apparatus of claim 14, wherein the construction module is further configured to obtain a point cloud label, and construct a first loss function according to the point cloud coordinates of different scales and the point cloud label of corresponding scales; acquiring a first three-dimensional label corresponding to the first training image, and constructing a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label; and constructing an objective loss function according to the first loss function and the second loss function.

16. The apparatus of claim 15, wherein the construction module is further configured to obtain a second three-dimensional tag, and construct a third loss function based on the predicted three-dimensional parameter and the second three-dimensional tag; the second three-dimensional tag is a three-dimensional parameter acquired by capturing the motion of the first object; and constructing an objective loss function according to the first loss function, the second loss function and the third loss function.

17. The apparatus of claim 14, wherein the predicted three-dimensional parameters comprise three-dimensional pose parameters that are three-dimensional joint point coordinates of the first object; the building module is further used for generating camera parameters corresponding to the first training image based on the point cloud coordinates of different scales; converting the three-dimensional gesture parameters into predicted two-dimensional gesture parameters according to the camera parameters; constructing a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude labels; and constructing a target loss function according to the fourth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

18. The apparatus of claim 14, wherein the building module is further configured to input a second training image into the reconstruction network to be trained, to obtain predicted three-dimensional pose parameters of a second object in the second training image; the first training image and the second training image are images acquired in different environments; acquiring a three-dimensional gesture label corresponding to the second object, and constructing a fifth loss function according to the predicted three-dimensional gesture parameter and the three-dimensional gesture label; and constructing a target loss function according to the fifth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

19. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.

20. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 9.