CN111598998A

CN111598998A - Three-dimensional virtual model reconstruction method and device, computer equipment and storage medium

Info

Publication number: CN111598998A
Application number: CN202010400447.3A
Authority: CN
Inventors: 葛志鹏; 曹煊; 葛彦昊; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-08-28
Anticipated expiration: 2040-05-13
Also published as: CN111598998B

Abstract

The application relates to a three-dimensional virtual model reconstruction method, a three-dimensional virtual model reconstruction device, computer equipment and a storage medium. The method comprises the following steps: acquiring an image of a target object, the target object having a moving limb; extracting the features of the image, and performing graph convolution processing on the extracted features to obtain point cloud coordinates of different scales; generating three-dimensional parameters of the target object according to the point cloud coordinates of different scales; reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image. By adopting the method, the accuracy of the reconstruction of the three-dimensional virtual model can be improved.

Description

Three-dimensional virtual model reconstruction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for reconstructing a three-dimensional virtual model, a computer device, and a storage medium.

Background

With the development of computer technology, Artificial Intelligence (AI) has emerged, which is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. Artificial intelligence is currently being studied and applied in a number of areas, for example, to achieve reconstruction of three-dimensional models through artificial intelligence. The reconstruction of the three-dimensional model is often applied to the aspects of virtual reality scenes, human body special effects, human body detection and the like.

The traditional three-dimensional virtual model reconstruction method is obtained by matching and aligning irregular point clouds of a depth map with a three-dimensional human body regular grid model. However, the result of matching and aligning in this way depends heavily on the quality of the depth map, and if the resolution of the depth map is low, the reconstructed three-dimensional virtual model is not accurate.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a three-dimensional virtual model reconstruction method, apparatus, computer device, and storage medium capable of improving reconstruction accuracy.

A method of three-dimensional virtual model reconstruction, the method comprising:

acquiring an image of a target object, the target object having a moving limb;

extracting the features of the image, and performing graph convolution processing on the extracted features to obtain point cloud coordinates of different scales;

generating three-dimensional parameters of the target object according to the point cloud coordinates of different scales;

reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

A three-dimensional virtual model reconstruction apparatus, the apparatus comprising:

an image acquisition module for acquiring an image of a target object, the target object having a moving limb;

the characteristic extraction module is used for extracting the characteristics of the image and performing graph convolution processing on the extracted characteristics to obtain point cloud coordinates with different scales;

the generating module is used for generating three-dimensional parameters of the target object according to the point cloud coordinates of different scales;

a reconstruction module for reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an image of a target object, the target object having a moving limb;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an image of a target object, the target object having a moving limb;

According to the three-dimensional virtual model reconstruction method, the three-dimensional virtual model reconstruction device, the computer equipment and the storage medium, the image of the target object is obtained, the target object is provided with movable limbs, the image is subjected to feature extraction, the extracted features are subjected to image convolution processing to obtain point cloud coordinates of different scales, the three-dimensional parameters of the target object are generated according to the point cloud coordinates of different scales, and the three-dimensional parameters of the target object in the image can be accurately generated through image convolution. And reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object, wherein the three-dimensional virtual model has a limb shape matched with the target object in the image, so that the reconstruction accuracy of the three-dimensional virtual model is improved.

A training method of a reconstructed network, the method comprising:

acquiring a first training image of a first object; the first object has a moving limb;

extracting features of the first training image through a reconstruction network to be trained, and performing graph convolution processing on the extracted features to obtain point cloud coordinates of different scales;

generating a predicted three-dimensional parameter of the first object based on the point cloud coordinates of different scales;

constructing a target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameters;

training the reconstruction network to be trained based on the target loss function, and obtaining the trained reconstruction network when the training stopping condition is met; the trained reconstruction network is used for reconstructing an object with movable limbs in the image into a three-dimensional virtual model with matched limb forms with the object.

A training apparatus to reconstruct a network, the apparatus comprising:

a training image acquisition module for acquiring a first training image of a first object; the first object has a moving limb;

the input module is used for extracting the characteristics of the first training image through a reconstruction network to be trained and carrying out graph convolution processing on the extracted characteristics to obtain point cloud coordinates with different scales;

a prediction module for generating a predicted three-dimensional parameter of the first object based on the point cloud coordinates of different scales;

the construction module is used for constructing a target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameters;

the training module is used for training the reconstruction network to be trained based on the target loss function, and obtaining the trained reconstruction network when the training stopping condition is met; the trained reconstruction network is used for reconstructing an object with movable limbs in the image into a three-dimensional virtual model with matched limb forms with the object.

According to the training method, the training device, the computer equipment and the storage medium for the reconstruction network, the first training image of the first object with the movable limbs is obtained, the characteristics of the first training image are extracted through the reconstruction network to be trained, graph convolution processing is carried out on the extracted characteristics, point cloud coordinates of different scales are obtained, predicted three-dimensional parameters of the first object are generated based on the point cloud coordinates of different scales, a target loss function is built according to the point cloud coordinates of different scales and the predicted three-dimensional parameters, the reconstruction network to be trained is trained based on the target loss function, and the trained reconstruction network is obtained when the training stopping condition is met, so that the trained reconstruction network can predict the three-dimensional parameters of the target object in the two-dimensional image more accurately. And accurately predicting the three-dimensional parameters of the target object in the two-dimensional image by using the trained reconstruction network, thereby accurately reconstructing the three-dimensional virtual model corresponding to the target object according to the three-dimensional parameters.

Drawings

FIG. 1 is a diagram of an application environment of a method for reconstructing a three-dimensional virtual model according to an embodiment;

FIG. 2 is a schematic flow chart illustrating a method for reconstructing a three-dimensional virtual model according to an embodiment;

FIG. 3 is a schematic flow chart illustrating steps of extracting features of an image and performing graph convolution processing on the extracted features to obtain point cloud coordinates of different scales in one embodiment;

FIG. 4 is a schematic diagram of graph convolution in one embodiment;

FIG. 5 is a flowchart illustrating the steps of obtaining an image of a target object in one embodiment;

FIG. 6 is a schematic flow chart of a three-dimensional virtual model reconstruction method according to an embodiment;

FIG. 7(a) is a block diagram of a reconstructed three-dimensional virtual model in an embodiment;

FIG. 7(b) is a flowchart of reconstructing a three-dimensional virtual model corresponding to a target human body in a video in real time according to an embodiment;

FIG. 8 is a schematic flow chart diagram illustrating a training method for reconstructing a network according to an embodiment;

FIG. 9 is a flowchart of the steps for constructing a target loss function based on point cloud coordinates and predicted three-dimensional parameters at different scales in one embodiment;

FIG. 10 is a block diagram of a three-dimensional virtual model reconstruction of a human body in a two-dimensional image according to an embodiment;

FIG. 11 is a block diagram showing an example of a three-dimensional virtual model reconstructing apparatus;

FIG. 12 is a block diagram of a training apparatus for reconstructing a network according to an embodiment;

FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The three-dimensional virtual model reconstruction method provided by the application can be applied to the application environment shown in fig. 1. The terminal 102 acquires an image of a target object having a moving limb. The terminal sends the image to the server 104, the server 104 extracts the features of the image through a reconstruction network, and performs graph convolution processing on the extracted features to obtain point cloud coordinates of different scales. The server 104 generates three-dimensional parameters of the target object according to the point cloud coordinates of different scales through the reconstruction network. Then, the server 104 returns the three-dimensional parameters of the target object to the terminal 102. The terminal 102 reconstructs a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal 102 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, and the application is not limited thereto.

In one embodiment, the three-dimensional virtual model reconstruction method can be applied to human body three-dimensional virtual model reconstruction, and comprises the following steps:

the terminal obtains a human body training image of a first object, and performs feature extraction and graph convolution layer processing on the human body training image to obtain point cloud coordinates of different scales in the human body training image. The terminal can generate three-dimensional parameters and camera parameters corresponding to the first object in the human body training image based on the point cloud coordinates of different scales. And the terminal takes the point cloud coordinates with different scales as point cloud labels with different scales and takes the three-dimensional parameters as a first three-dimensional human body label corresponding to the first object in the human body training image. The three-dimensional parameters comprise three-dimensional human body posture parameters and three-dimensional human body shape parameters, and the terminal takes the three-dimensional human body posture parameters as three-dimensional human body posture labels corresponding to the human body training images.

The terminal converts the three-dimensional human body posture parameters corresponding to the first object in the human body training image into two-dimensional human body posture parameters through the camera parameters, and the two-dimensional human body posture parameters are used as two-dimensional human body posture labels corresponding to the human body training image.

The terminal may acquire three-dimensional parameters of the first object through the motion capture device, the three-dimensional parameters also including three-dimensional human body posture parameters and three-dimensional human body shape parameters. And the terminal takes the three-dimensional parameters as a second three-dimensional human body label corresponding to the first object.

The terminal inputs the human body training image into a reconstruction network to be trained, and the training step of the reconstruction network comprises the following steps:

and the terminal extracts the characteristics of the human body training image through a characteristic extraction layer of the reconstruction network to be trained to obtain a corresponding characteristic diagram.

And the terminal carries out graph convolution processing on the characteristic graph through a graph convolution layer of the reconstructed network to obtain point cloud characteristics with different scales.

And the terminal performs regression processing on the point cloud characteristics of different scales through the graph convolution layer to obtain point cloud coordinates of different scales.

And the terminal generates a predicted three-dimensional parameter of a first object in the human body training image and a predicted camera parameter corresponding to the human body training image based on point cloud coordinates of different scales through the graph convolution layer. The predicted three-dimensional parameters comprise predicted three-dimensional human body posture parameters and predicted three-dimensional human body shape parameters.

And the terminal constructs a first loss function according to the point cloud coordinates of different scales and the point cloud labels of corresponding scales.

And the terminal constructs a second loss function according to the predicted three-dimensional parameters and the first three-dimensional human body label.

And the terminal constructs a third loss function according to the predicted three-dimensional parameters and the second three-dimensional human body label.

And then, the terminal converts the predicted three-dimensional human body posture parameters into predicted two-dimensional human body posture parameters through the predicted camera parameters.

And the terminal constructs a fourth loss function according to the predicted two-dimensional human body posture parameters and the corresponding two-dimensional posture labels.

And then, the terminal acquires a human body training image of the second object and acquires three-dimensional human body posture parameters corresponding to the second object in the human body training image. And taking the three-dimensional human body posture parameter corresponding to the second object in the human body training image as a three-dimensional human body posture label.

And the terminal inputs the human body training image of the second object into a reconstruction network to be trained to obtain the predicted three-dimensional human body posture parameter of the second object in the human body training image. The human body training image of the first object is an image acquired outdoors, the human body training image of the second object is a human body image acquired indoors, and the second object and the first object can be the same object or different objects.

And constructing a fifth loss function on the terminal according to the predicted three-dimensional human body posture parameters and the three-dimensional human body posture labels corresponding to the second object in the human body training image.

And constructing a target loss function on the terminal according to the first loss function, the second loss function, the third loss function, the fourth loss function and the fifth loss function.

And training the reconstruction network to be trained by the terminal based on the target loss function, and obtaining the trained reconstruction network when the training stopping condition is met.

Then, the terminal uses the trained reconstruction network to reconstruct a three-dimensional virtual model corresponding to the human body in the image, and the method comprises the following steps:

the terminal obtains an image containing a target human body, and cuts the image containing the target human body by taking the target human body as a center to obtain a human body image containing the target human body with a preset size.

And the terminal inputs the human body image into the trained reconstruction network, and performs characteristic extraction on the image through a characteristic extraction layer of the reconstruction network to obtain a corresponding characteristic diagram.

And the terminal generates three-dimensional parameters of the target human body according to the point cloud coordinates of different scales.

The terminal reconstructs a three-dimensional virtual model of the target human body based on the three-dimensional parameters of the target human body to obtain a human body three-dimensional virtual model; the three-dimensional virtual model of the human body has a limb shape matched with a target human body in the human body image.

Next, the terminal projects the three-dimensional virtual model onto the virtual-reality motion-sensing game, and displays the three-dimensional virtual model on the virtual-reality motion-sensing game. And the user executes each operation of the virtual reality motion sensing game by controlling the three-dimensional virtual model in the virtual reality motion sensing game.

The trained reconstruction network is used for carrying out graph convolution processing on the human body image, three-dimensional human body parameters corresponding to a target human body in the human body image are quickly and accurately obtained, accurate reconstruction of a three-dimensional virtual model is achieved through the three-dimensional human body parameters, and reconstruction accuracy and reconstruction efficiency of the human body three-dimensional virtual model are improved.

In an embodiment, as shown in fig. 2, a three-dimensional virtual model reconstruction method is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:

step 202, an image of a target object is acquired, the target object having a moving limb.

Wherein, the target object is a human body or an animal needing to reconstruct a three-dimensional virtual model.

Specifically, the terminal may obtain an original image of a human or animal whose three-dimensional virtual model needs to be reconstructed, and perform preprocessing on the original image to obtain an image of the target object, and so on. The preprocessing may include operations such as cropping, resolution adjustment, image size scaling, brightness adjustment, and/or contrast adjustment. The original image and the image of the target object are two-dimensional images.

In the embodiment, the image can be a color image, and the color image has higher resolution and richer details for a depth map, and can be used for reconstructing a three-dimensional virtual model of a human body more finely.

In this embodiment, the terminal may obtain the corresponding image by directly shooting the target object, or may obtain the image corresponding to the target object from a local device, a network device, or a third device. The acquired image includes the target object.

And 204, extracting the features of the image, and performing graph convolution processing on the extracted features to obtain point cloud coordinates with different scales.

The graph convolution processing refers to performing convolution processing on a graph, and can be realized by a Graph Convolution Network (GCN). The GCN is a neural network that operates on the graph. The Point Cloud is a massive Point set which expresses the target space distribution and the target surface characteristics in the same space reference system, and after the space coordinates of each sampling Point on the surface of the object are obtained, the Point set is obtained and is called as the Point Cloud. In the present embodiment, the point cloud refers to a grid point of the target object surface.

Specifically, the terminal may perform feature extraction on an image of the target object to obtain features corresponding to the image, so as to obtain a feature map. And then, the terminal performs graph convolution processing on the feature map to obtain point cloud features with different scales. And the terminal performs convolution processing with the channel number being 3 on the point cloud characteristics with different scales to obtain point cloud coordinates with different scales. The point cloud coordinates are three-dimensional coordinates.

And step 206, generating three-dimensional parameters of the target object according to the point cloud coordinates of different scales.

Wherein the three-dimensional parameter is a Skinned Multi-person linear parameter (SMPL, Skinned Multi-PersonLinearModel). The SMPL parameters contain 6890 body surface points, and 24 joint points.

Specifically, the terminal can perform down-sampling and full-connection processing on point cloud coordinates of different scales to obtain three-dimensional parameters of the target object.

In this embodiment, when the target object is a human body, a three-dimensional model of the human body as a whole can be reconstructed by covering the human body with the multi-person linear parameters.

Step 208, reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

Specifically, the three-dimensional parameters include three-dimensional pose parameters and three-dimensional body type parameters. The three-dimensional posture parameter is the joint point coordinate of the target object, and the three-dimensional body type parameter is the characteristic point coordinate of the surface of the target object. After the terminal obtains the three-dimensional posture parameters and the three-dimensional body type parameters corresponding to the target object in the image, a model is constructed in a three-dimensional space according to the three-dimensional coordinates corresponding to the three-dimensional posture parameters and the three-dimensional coordinates corresponding to the three-dimensional body type parameters, and therefore a three-dimensional virtual model is obtained.

In this embodiment, the three-dimensional virtual model can be applied to virtual reality motion sensing games, virtual fitting, virtual hair try and video special effect production, but is not limited thereto.

In the three-dimensional virtual model reconstruction method, the image of the target object is obtained, the target object has movable limbs, the image is subjected to feature extraction, the extracted features are subjected to graph convolution processing to obtain point cloud coordinates of different scales, the three-dimensional parameters of the target object are generated according to the point cloud coordinates of different scales, and the three-dimensional parameters of the target object in the image can be accurately generated through graph convolution. And reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object, wherein the three-dimensional virtual model has a limb shape matched with the target object in the image, so that the reconstruction accuracy of the three-dimensional virtual model is improved.

In one embodiment, as shown in fig. 3, the extracting features of the image and performing graph convolution processing on the extracted features to obtain point cloud coordinates of different scales includes:

and step 302, performing feature extraction on the image through a feature extraction layer of the reconstruction network to obtain a corresponding feature map.

Specifically, the trained reconstruction network includes a feature extraction layer and a graph convolution layer. And the terminal inputs the image containing the target object into a feature extraction layer in the trained reconstruction network, and performs feature extraction on the image through the feature extraction layer to obtain a feature map corresponding to the image.

And 304, carrying out graph convolution processing on the feature graph through a graph convolution layer of the reconstructed network to obtain point cloud features with different scales.

Specifically, the terminal outputs the feature map corresponding to the image to a map convolution layer of a reconstruction network, and obtains feature expression output by each layer through processing of each layer of the map convolution layer. The feature expression output by each layer is the point cloud features with different scales.

In this embodiment, the feature extraction layer may be a ResNet50 network. The graph convolutional layer may be a GCN network.

The terminal acquires feature points and edge sets in the feature map through the map convolutional layer, and constructs an undirected graph according to the adjacent point set of each feature point and the connecting edge set between each feature point and each adjacent point. An undirected graph is composed of nodes, which can be feature points in a feature graph, and edges. And then, the terminal acquires the representation data of each node in the undirected graph through the graph convolution layer, calculates the distance between any two nodes, and takes the distance between any two nodes as the weight of the edge between the two nodes. And generating an adjacency matrix according to the weights. The graph volume layer carries out graph volume operation on the undirected graph through an activation function, and the graph volume operation comprises the following steps: and the terminal calculates the feature expressions of different scales according to the activation function, the characterization data of the nodes and the weight. The feature expressions of different scales are point cloud features of different scales.

For example, the terminal may calculate point cloud features of different scales corresponding to the image of the target object by the following formula (1):

where i and j represent nodes in an undirected graph,

for the feature expression of node i at the current level,

is the characteristic expression of the node i at the l-th layer; c. C_ijIs a normalization factor; n is a radical of_iIs a neighbor of node iGathering;

representing the weight of node j. σ denotes an activation function, which may be sigmoid or tanh.

And step 306, performing regression processing on the point cloud features with different scales through the graph convolution layer to obtain point cloud coordinates with different scales.

Specifically, after point cloud features of different scales are obtained through calculation, three-dimensional point cloud coordinates of different scales are obtained through convolution processing with the number of feature channels of the graph convolution layer being 3. The characteristic channels are x, y and z.

For example, the terminal may calculate point cloud features of different scales by the following formula:

wherein p is a point cloud three-dimensional coordinate of a scale, A is an adjacent matrix of nodes, A is a real symmetric matrix of N × N, W is a weight matrix,

is a scale of point cloud features.

In this embodiment, after the point cloud coordinates of different scales are obtained through calculation, the terminal can perform downsampling processing and full-link processing on the point cloud coordinates of different scales, and output three-dimensional parameters of the target object through the full-link layer.

It will be appreciated that the re-establishment network may be applied on a terminal or on a server. The server may be a cloud server. When the reconstruction network is applied to the cloud server, the terminal sends the image of the target object to the cloud server. And the cloud server outputs the three-dimensional parameters of the target object through the reconstruction network and returns the three-dimensional parameters to the terminal. The reconstruction network is applied to the cloud server, so that the storage space of the terminal can be saved.

Fig. 4 is a schematic diagram of graph convolution in an embodiment, and shows an undirected graph constructed from feature points and characterization data in a feature graph. And reconstructing a graph convolution layer in the network to obtain point cloud characteristics with different scales based on each node in the undirected graph and the characterization data of the nodes.

In this embodiment, the features of the image are extracted through the trained reconstruction network to obtain the key information of the image. And carrying out graph convolution processing on the extracted key information through a reconstruction network, converting the key information into point cloud features with different scales, carrying out regression processing on the point cloud features with different scales through the graph convolution layer to obtain point cloud coordinates with different scales so as to accurately output the coordinates of the key feature information of each scale. And the characteristics are extracted and the multi-scale point cloud coordinates are output through a reconstruction network, so that the efficiency is high and the accuracy is high.

In one embodiment, the method further comprises: determining camera parameters corresponding to the image based on point cloud coordinates of different scales; and projecting the three-dimensional virtual model into a two-dimensional image according to the camera parameters corresponding to the image.

The camera parameters refer to parameters for establishing a geometric model of camera imaging. The camera parameters can be generally classified into an external reference (camera extrinsic matrix) and an internal reference (camera intrinsic matrix). The external parameters determine the position and orientation of the camera in a certain three-dimensional space, from which it can be determined how a real-world point (i.e. world coordinates) has undergone rotation and translation and then falls onto another real-world point (i.e. camera coordinates). The internal reference refers to parameters inside the camera, and how the real world point is converted into a pixel point through the lens of the camera, pinhole imaging and electronic conversion after the action of the external reference can be known according to the internal reference. For example, taking a human body as an example, the camera parameters may include a rotation matrix R corresponding to the orientation of the human body, and a translation matrix t mapped by the human body to the two-dimensional image coordinates. Further, a scaling factor may also be included. Wherein, the proportionality coefficient is an internal parameter, and the rotation matrix R and the translation matrix t are external parameters.

Specifically, the terminal carries out graph convolution processing on the feature graph through a graph convolution layer of the reconstructed network to obtain point cloud features of different scales. And performing regression processing on the point cloud characteristics of different scales through the graph convolution layer to obtain point cloud coordinates of different scales. And the terminal generates three-dimensional parameters of the target object and camera parameters corresponding to the image based on point cloud coordinates of different scales through the image convolution layer. And then, the terminal projects the three-dimensional virtual model from the three-dimensional space to the two-dimensional space according to the camera parameters to obtain a two-dimensional image.

In the embodiment, the terminal renders the three-dimensional virtual model into a two-dimensional image through the camera parameters. And the terminal performs inverse transformation on the two-dimensional image according to the cutting scaling information of the image of the target object to obtain a rendered two-dimensional image with the same size as the original image. The original image is an image before the image of the target object is cropped.

In the embodiment, the camera parameters corresponding to the image are determined based on the point cloud coordinates of different scales, and the three-dimensional virtual model is projected into the two-dimensional image according to the camera parameters corresponding to the image, so that the display mode is more attractive and visual. Moreover, the degree of coincidence between the two-dimensional image projected by the three-dimensional virtual model and the image of the target object can be visually displayed, and the three-dimensional virtual model can be visualized.

In one embodiment, acquiring an image of a target object comprises: acquiring an image containing a target object in a video of the target object;

the method further comprises the following steps: projecting a three-dimensional virtual model into the image to replace the target object in the image; and generating a target video based on the image of each frame in the video after replacing the target object.

Specifically, the terminal acquires a video corresponding to a target object of which the three-dimensional virtual model needs to be reconstructed, and acquires an image of the target object included in the video. And the terminal inputs the image containing the target object into a reconstruction network and outputs the three-dimensional parameters corresponding to the target object in the image through the reconstruction network. And the terminal reconstructs a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object in the image. Then, the terminal projects the three-dimensional virtual model of the target object into the image to replace the target object in the image, so as to obtain an image after replacing the target object.

And carrying out the same processing on each frame of image containing the target object in the video to obtain an image of each frame of image replacing the target object. And the terminal generates a target video according to the image of each frame in the video after replacing the target object.

Further, the terminal acquires a video corresponding to a target object needing to reconstruct the three-dimensional virtual model, and acquires each frame of image containing the target object in the video. And the terminal inputs each frame of image containing the target object into a reconstruction network and outputs the three-dimensional parameters corresponding to the target object in each frame of image through the reconstruction network. Then, for the three-dimensional parameters of the target object in each frame of image, the terminal reconstructs the three-dimensional virtual model based on the three-dimensional parameters to obtain the three-dimensional virtual model corresponding to the target object in each frame of image. And the terminal projects the three-dimensional virtual model to the corresponding image, and the three-dimensional virtual model is used for replacing a target object in the corresponding image to obtain the image containing the three-dimensional virtual model in each frame. Then, the terminal can replace the image of the corresponding target object in the video with the image of each frame containing the three-dimensional virtual model to obtain the target video.

In this embodiment, after the terminal obtains the three-dimensional virtual model corresponding to the target object in each frame of image, the three-dimensional virtual model is projected to be a corresponding two-dimensional image through the camera parameters. And then, the terminal replaces each corresponding frame image in the video by each two-dimensional image to obtain the target video.

In this embodiment, each frame in the video of the target object includes an image of the target object, and the three-dimensional parameter corresponding to the target object in each frame of the image is output in real time through the reconstruction network. And reconstructing based on the three-dimensional parameters to obtain a three-dimensional virtual model corresponding to the target object in each frame of image, projecting the three-dimensional virtual model corresponding to each image into the corresponding image to replace the target object in the corresponding image to obtain a target video, so that the three-dimensional virtual model can be projected into the application of a human-computer interaction somatosensory game or a short video, and the reality of a human-computer interaction in the somatosensory game or a three-dimensional virtual reality special effect in the short video is enhanced.

In one embodiment, as shown in FIG. 5, acquiring an image of a target object includes:

step 502, each frame of image containing the target object in the video of the target object is obtained.

Reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb shape matched with a target object in the image, and comprises:

step 504, a three-dimensional parameter sequence is generated based on the three-dimensional parameters of the target object in each frame of image.

Specifically, the terminal acquires a video corresponding to a target object of which the three-dimensional virtual model needs to be reconstructed, and acquires each frame of image containing the target object in the video to obtain an image sequence. And the terminal inputs the image sequence into a reconstruction network, and the reconstruction network performs feature extraction on each frame of image in the image sequence. And performing image convolution processing on the extracted features through a reconstruction network to obtain point cloud coordinates of different scales corresponding to each frame of image. And generating three-dimensional parameters respectively corresponding to the target object of each frame of image in the image sequence based on the point cloud coordinates of each frame of image respectively corresponding to different scales to obtain a three-dimensional parameter sequence corresponding to the target object.

Step 506, generating a three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence; the three-dimensional virtual models in the sequence of three-dimensional virtual models have limb morphology matching the target object in the corresponding image.

Specifically, the terminal reconstructs to obtain each three-dimensional virtual model based on each three-dimensional parameter in the three-dimensional parameter sequence, and generates a three-dimensional virtual model sequence according to the sequencing order of each three-dimensional parameter in the three-dimensional parameter sequence. The three-dimensional virtual models in the three-dimensional virtual model sequence have limb shapes matched with the target objects in the corresponding images in the video.

Further, for each three-dimensional parameter in the three-dimensional parameter sequence, reconstructing each three-dimensional virtual model of the target object according to each three-dimensional parameter, thereby obtaining a three-dimensional virtual model sequence. Each three-dimensional virtual model in the sequence of three-dimensional virtual models has a limb morphology that matches a corresponding target object in the sequence of images.

In this embodiment, each three-dimensional virtual model is sequentially generated according to each three-dimensional parameter in the three-dimensional parameter sequence, so as to obtain a three-dimensional virtual model sequence corresponding to the three-dimensional parameter sequence.

In this embodiment, each frame of image including the target object in the video of the target object is obtained, and the three-dimensional parameters corresponding to the target object in each frame of image are output in real time through the reconstruction network, so that the three-dimensional virtual model of the target object in each frame of image is reconstructed in real time, a three-dimensional virtual model sequence is obtained, and the efficiency of reconstructing the three-dimensional virtual model is improved.

In one embodiment, as shown in fig. 6, after generating the three-dimensional parameter sequence corresponding to the target object in each frame image, the method further includes:

step 602, acquiring a corresponding time of each frame of image in the video to obtain a time sequence.

Specifically, the reconstruction network further includes a filtering layer. And the terminal acquires corresponding moments of each frame of image containing the target object in the video through a filtering layer in the reconstruction network, and sequences the moments according to the sequence of the moments to obtain a time sequence.

And step 604, filtering the three-dimensional parameter sequence according to the time sequence to obtain a filtered three-dimensional parameter sequence.

Specifically, the terminal performs filtering processing on the three-dimensional parameter sequence through the filtering layer based on the time sequence, so as to realize smoothing of inter-frame transition and obtain the filtered three-dimensional parameter sequence.

Further, the filtering layer takes the three-dimensional parameters in the three-dimensional parameter sequence as the current three-dimensional parameters in sequence, obtains the previous three-dimensional parameters of the current three-dimensional parameters, and obtains the time corresponding to the current three-dimensional parameters and the time corresponding to the previous three-dimensional parameters. And performing filtering processing on the current three-dimensional parameter based on the previous three-dimensional parameter, the time corresponding to the previous three-dimensional parameter and the time corresponding to the current three-dimensional parameter. And the current three-dimensional parameter and the previous three-dimensional parameter after the filtering treatment realize transition smoothing in time. And obtaining the filtered three-dimensional parameter sequence according to the same processing mode. The filtering process may be bilateral filtering, gaussian filtering, conditional filtering, straight-through filtering, or random sampling consistency filtering, but is not limited thereto.

In this embodiment, the filter layer in the reconstruction network may be a noise reduction self-encoder, which implements a process of down-sampling and up-sampling. The network structure of the filter layer is that the input of the training process of the RNN structure filter layer of seq2seq is a three-dimensional parameter time sequence added with noise, and the output is the three-dimensional parameter time sequence after denoising. And the trained filtering layer receives the three-dimensional parameter time sequence corresponding to the discontinuous images of the plurality of frames and predicts and outputs the continuous three-dimensional parameter time sequence.

The generating of the three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence includes:

step 606, generating a three-dimensional virtual model sequence corresponding to the target object according to the filtered three-dimensional parameter sequence.

Specifically, the terminal reconstructs a three-dimensional virtual model based on the filtered three-dimensional parameter sequence, and generates a three-dimensional virtual model sequence according to the sequence of the three-dimensional parameters in the three-dimensional parameter sequence. Further, for each three-dimensional parameter in the filtered three-dimensional parameter sequence, a corresponding three-dimensional virtual model is obtained according to each three-dimensional parameter through reconstruction, and therefore a three-dimensional virtual model sequence is obtained.

In this embodiment, the corresponding time of each frame of image in the video is obtained to obtain a time sequence, and the three-dimensional parameter sequence is filtered according to the time sequence to obtain a filtered three-dimensional parameter sequence, so that inter-frame smoothing of the three-dimensional parameters can be realized according to time correlation. And reconstructing the three-dimensional virtual model of the target object in each frame of image based on the filtered three-dimensional parameter sequence, so that the reconstruction network can output continuous three-dimensional virtual models corresponding to the target object, and smooth transition between every two three-dimensional virtual models is realized, thereby improving the reconstruction precision and the continuity of the three-dimensional virtual models.

In the present embodiment, when the target object in the color image is not on the front side, for example, when the color image is a side image and a back image of the target object, the inter-frame three-dimensional parameter time smoothing process can smooth the unstable angle parameter. The output result of the current frame image and the output results of the previous and the next frames tend to be consistent, and the stability of human body reconstruction can be improved.

Fig. 7(a) is a frame diagram of a three-dimensional virtual model in one embodiment. The terminal obtains original images of frames in the video, wherein the frames comprise the target human body, and an original image sequence is obtained. And the terminal inputs the image sequence into a reconstruction network, the reconstruction network carries out human body detection on the image sequence, and a target human body in the image is marked through a detection frame. Human body detection can be realized through a lightweight network ResNet-18. Then, the reconstruction network preprocesses each original image in the marked image sequence, namely, cuts out an image with a preset size by taking the target human body as the center of each original image. Then, for each frame of the cut image, the feature extraction layer of the reconstruction network extracts the features of the cut image and outputs the extracted features to the image convolution layer. And carrying out graph convolution processing on the characteristic graph by the graph convolution layer to obtain point cloud coordinates with different scales. The reconstruction network performs downsampling and full-connection processing on point cloud coordinates of different scales so as to output a three-dimensional parameter sequence and camera parameters corresponding to a target human body in an image sequence through a full-connection layer. Next, the three-dimensional parameter sequence is input into the filter layer. And the reconstruction network acquires the corresponding moment of each frame of image in the video to obtain a time sequence. And a filtering layer in the reconstruction network carries out filtering processing on the three-dimensional parameter sequence based on the time sequence to obtain a filtered three-dimensional parameter sequence.

And then, the terminal reconstructs each three-dimensional virtual model corresponding to the target human body through the filtered three-dimensional parameter sequence to obtain a three-dimensional virtual model sequence. Then, the terminal can render the three-dimensional virtual model sequence into a two-dimensional human body sequence according to the camera parameters, and replace the target human body in each cut corresponding image by using the two-dimensional human body sequence. Then, the terminal can perform inverse transformation on the image obtained after the replacement to generate a target image sequence with the same size as the original image sequence, and replace the original image sequence in the video with the target image sequence. Further, the original image of the original image sequence in the video is replaced by the target image in the target image sequence.

Fig. 7(b) is a flowchart of reconstructing a three-dimensional virtual model corresponding to a target human body in a video in real time in an embodiment. And the terminal acquires original images of frames in the video, including the target human body, and sequentially inputs the original images into the reconstruction network. And the reconstruction network sequentially carries out three-dimensional virtual model reconstruction on the target human body in the original image. Specifically, the reconstruction network sequentially uses the input original images as current frame images, performs human body detection on the current frame original images, and marks out a target human body in the images through a detection frame. And then, cutting the scaled image of the labeled current frame original image by taking the target human body as the center. And (4) extracting the characteristics of the cut and scaled image through a reconstruction network and carrying out image convolution processing, and outputting an SMPL parameter and a camera parameter. Then, the reconstruction network performs time smoothing processing (i.e. filtering processing) on the SMPL parameter corresponding to the current frame original image through the time of the current frame original image in the video and the time of the previous frame original image in the video, so as to obtain the filtered SMPL parameter. And the terminal reconstructs the three-dimensional virtual model based on the SMPL parameters to obtain the three-dimensional virtual model corresponding to the target human body in the current frame image. Then, the terminal renders the three-dimensional virtual model into a two-dimensional human body based on the camera parameters, and replaces the target human body in the cut and scaled image. And then, the terminal performs reverse cropping scaling on the replaced image according to the cropping scaling information to obtain an image with the same size as the original image of the current frame. And according to the same processing mode, obtaining three-dimensional virtual models which are output in sequence, thereby obtaining reverse-cutting scaled images which are output in sequence. The reconstruction network is light in weight, single-frame time consumption in the whole reconstruction process is about 20 milliseconds, and real-time reconstruction of the three-dimensional virtual model can be realized.

In an embodiment, as shown in fig. 8, a training method for reconstructing a network is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:

step 802, obtaining a first training image of a first object; the first object has a moving limb.

In particular, the terminal may acquire a first training image including a first object having a moving limb. For example the first object is a human or an animal. Further, the terminal may obtain the first training image by directly shooting the first object, or may obtain the first training image from a local or network or from a third device.

In this embodiment, the terminal may acquire any image, and screen out an image in which a human body or an animal exists in the any image, thereby obtaining the first training image.

In this example, the terminal may acquire a three-dimensional parameter corresponding to the target object in the first training image. The three-dimensional parameters include three-dimensional pose parameters and three-dimensional body type parameters. Then, the terminal may use the three-dimensional parameter as a three-dimensional label corresponding to the first training image, and use the three-dimensional pose parameter as a three-dimensional pose label. And the terminal can collect point cloud characteristics of the first training image and determine point cloud coordinates of different scales. And then, the terminal takes the point cloud coordinates with different scales as point cloud labels corresponding to the first training image.

In this embodiment, the terminal may acquire the three-dimensional parameters of the target object through the motion capture device. The terminal can set the three-dimensional parameters of the target object acquired by capturing as a label. Further, the terminal can set the corresponding three-dimensional parameters in the first training image as a first three-dimensional label, and set the three-dimensional parameters of the target object obtained by capturing and collecting as a second three-dimensional label.

And 804, extracting features of the first training image through a reconstruction network to be trained, and performing graph convolution processing on the extracted features to obtain point cloud coordinates of different scales.

Specifically, the reconstruction network to be trained includes a feature extraction layer and a graph convolution layer. And the terminal inputs the first training image into a reconstruction network to be trained. And the feature extraction layer of the reconstruction network to be trained performs feature extraction on the output first training image to obtain a corresponding feature map. And then, carrying out graph convolution processing on the characteristic graph by the graph convolution layer of the reconstruction network to be trained to obtain point cloud coordinates with different scales.

In this embodiment, performing feature extraction on the first training image through a reconstruction network to be trained, and performing graph convolution processing on the extracted features to obtain point cloud coordinates of different scales, including:

extracting the features of the first training image through a feature extraction layer of a reconstruction network to obtain a corresponding feature map; carrying out graph convolution processing on the characteristic graph through a graph convolution layer of the reconstruction network to obtain point cloud characteristics with different scales; and performing regression processing on the point cloud characteristics with different scales through the graph convolution layer to obtain point cloud coordinates with different scales.

Step 806, generating a predicted three-dimensional parameter of the first object based on the point cloud coordinates of different scales.

Specifically, the terminal can perform downsampling and full-connection processing on point cloud coordinates of different scales to obtain a predicted three-dimensional parameter corresponding to the first object.

And 808, constructing a target loss function according to the point cloud coordinates and the predicted three-dimensional parameters of different scales.

Specifically, the terminal acquires point cloud labels capable of acquiring different scales, and determines the difference between point cloud coordinates of different scales and point cloud labels of corresponding scales. The terminal can obtain the three-dimensional label and determine the difference between the three-dimensional label and the predicted three-dimensional parameter. The terminal can construct a loss function according to the difference between the point cloud coordinates and the point cloud labels and the difference between the predicted three-dimensional parameters and the three-dimensional labels.

Step 810, training the reconstruction network to be trained based on the target loss function, and obtaining the trained reconstruction network when the training stopping condition is met; the trained reconstruction network is used for reconstructing an object with movable limbs in an image into a three-dimensional virtual model with limb shapes matched with the object.

Specifically, the terminal trains the reconstruction network to be trained based on the target loss function. And adjusting parameters of the reconstructed network in the training process and continuing training until the reconstructed network meets the training stopping condition, so as to obtain the trained reconstructed network. The trained reconstruction network is used for reconstructing the object with the movable limb in the image into a three-dimensional virtual model with the limb shape matched with the object.

In this embodiment, the training stop condition may be that a loss error of the reconstructed network is less than or equal to a loss threshold, or that the number of iterations of the reconstructed network reaches a preset number of iterations.

For example, the loss error generated in each training is calculated through the target loss function, the parameters of the reconstruction network are adjusted based on the difference between the loss error and the loss threshold value, and the training is continued until the training is stopped under the training stopping condition, so that the trained reconstruction network is obtained.

And the terminal calculates the iteration times of the reconstructed network in the training process, and stops training when the iteration times of the terminal in the training process reach the preset iteration times to obtain the trained reconstructed network.

In this embodiment, a first training image of a first object with a moving limb is obtained, feature extraction is performed on the first training image through a reconstruction network to be trained, graph convolution processing is performed on the extracted features, point cloud coordinates of different scales are obtained, and three-dimensional parameters corresponding to the object in the image can be accurately generated through graph convolution. And constructing a loss function by combining point cloud coordinates and three-dimensional parameters of different scales. The reconstruction network to be trained is trained based on the target loss function, loss caused by factors of all aspects to the network can be integrated in the training process, so that loss of all aspects is reduced to the minimum through training, the trained reconstruction network is higher in precision and higher in generalization capability, and the trained reconstruction network can predict three-dimensional parameters of the target object in the two-dimensional image more accurately. And accurately predicting the three-dimensional parameters of the target object in the two-dimensional image by using the trained reconstruction network, thereby accurately reconstructing the three-dimensional virtual model corresponding to the target object according to the three-dimensional parameters.

In one embodiment, as shown in fig. 9, constructing the target loss function according to the point cloud coordinates and the predicted three-dimensional parameters at different scales includes:

and 902, acquiring point cloud labels, and constructing a first loss function according to point cloud coordinates of different scales and the point cloud labels of corresponding scales.

The point cloud labels are point cloud coordinates with different scales corresponding to the preset first training image.

Specifically, the terminal obtains point cloud labels of different scales corresponding to the first training image, calculates L2 norms according to point cloud coordinates of different scales output by the reconstruction network and the point cloud labels of corresponding scales, and sums the L2 norms of different scales to obtain a first loss function.

For example, the first loss function constructed by the terminal is as follows:

L_graph＝∑_i||f_i(Φ_i)-d_i(M(θ,β))||₂(1)

wherein phi_iRepresenting a characteristic expression of the i-th layer, f_i(Φ_i) Represents the point cloud coordinates of the i-th layer after the graph convolution processing, and M (theta, β) represents the point cloud label of the i-th layer d_i(M (θ, β) represents the point cloud label of the i-th layer that is down-sampled.

And 904, acquiring a first three-dimensional label corresponding to the first training image, and constructing a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label.

The first three-dimensional label is a preset three-dimensional parameter corresponding to a first object in the first training image.

Specifically, the terminal obtains a first three-dimensional label corresponding to the first training image, determines an L2 norm between a predicted three-dimensional parameter output by the reconstruction network and the corresponding first three-dimensional label, and obtains a second loss function.

For example, the second loss function constructed by the terminal is as follows:

theta represents the three-dimensional attitude parameter predicted by the reconstructed network,

Representing three-dimensional attitude tags, i.e. true three-dimensional parameters β representing reconstructionThree-dimensional body type parameters predicted by network,

And (4) representing a three-dimensional body type label, namely a true body type parameter.

In this embodiment, supervised training for 3D joint points (i.e. three-dimensional pose parameters) requires global translational rotation. That is, the 3D joint point coordinates are originally coordinates in the world coordinate system, and the elimination of the global translation means that the world coordinates of the pelvic joint point are subtracted from the world coordinates of the 3D joint point to generate coordinates in the coordinate system centered on the pelvis. The training of the network is more stable, and the loss function is minimized by the aid of the Adam algorithm in the training optimizer until the reconstructed network is converged.

Step 906, constructing an objective loss function according to the first loss function and the second loss function.

Specifically, the terminal may obtain a weight corresponding to the first loss function, and a weight corresponding to the second loss function. And the terminal multiplies the first loss function by the corresponding weight, multiplies the second loss function by the corresponding weight, and sums up the multiplication results to obtain the target loss function.

In the embodiment, a first loss function is constructed according to point cloud coordinates of different scales and point cloud labels of corresponding scales, a first three-dimensional label corresponding to a first training image is obtained, a second loss function is constructed according to predicted three-dimensional parameters and the first three-dimensional label, a target loss function is constructed according to the first loss function and the second loss function, the constructed target loss function can be constructed on the basis of two aspects of point cloud coordinates of different scales and predicted three-dimensional parameters obtained through image prediction, the constructed target loss function is more accurate, and therefore a reconstructed network obtained through training is more accurate.

In one embodiment, the method further comprises: acquiring a second three-dimensional label, and constructing a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label; the second three-dimensional label is a three-dimensional parameter acquired by capturing the motion of the first object;

constructing a target loss function from the first loss function and the second loss function, comprising: and constructing a target loss function according to the first loss function, the second loss function and the third loss function.

The second three-dimensional tag is a three-dimensional parameter acquired by motion capture of the first object through the motion capture device, and the three-dimensional parameter comprises a three-dimensional posture parameter and a three-dimensional body type parameter acquired by motion capture of the first object through the motion capture device.

Specifically, the reconstruction network in the training process further comprises a generation countermeasure layer, the terminal obtains a second three-dimensional label, and the second three-dimensional label and the predicted three-dimensional parameters are input to generate the countermeasure layer. And the generation countermeasure layer judges the input predicted three-dimensional parameters and the second three-dimensional label, and the judgment result is true or false. For example, generating the countermeasure network output 1 indicates that the predicted three-dimensional parameter is true, and output 0 indicates that the predicted three-dimensional parameter is false. It can also be set according to the requirement, and 1 is used to represent that the predicted three-dimensional parameter is false, and 0 is used to represent that the predicted three-dimensional parameter is true.

In this embodiment, the second three-dimensional label is true in the determination result of the generation of the countermeasure layer output. It can be understood that all the preset labels are true in generating the discrimination result output by the countermeasure layer.

And the terminal constructs a third loss function according to the judgment result corresponding to the predicted three-dimensional parameter output by the generated countermeasure network and the judgment result corresponding to the second three-dimensional label.

Further, the terminal calculates the negative logarithm of the discrimination result corresponding to the predicted three-dimensional parameter, and calculates the expectation of the negative logarithm. And the terminal calculates the negative logarithm of the difference value of the discrimination results corresponding to the 1 and the second three-dimensional label, and calculates the expectation of the negative logarithm. The terminal sums the two expectations to obtain a third loss function.

For example, the terminal constructs a third loss function as follows:

the generation of the countermeasure network takes the log loss:

wherein x is_rRepresenting a second three-dimensional tag, D (x)_r) To representAnd generating a discrimination result of the discriminator in the countermeasure network on the first three-dimensional label.

The expected value of the negative logarithm of the discriminator output. x is the number of_fRefers to the predicted three-dimensional parameter, D (x), of the reconstructed network output_f) And representing the discrimination result of the discriminator on the predicted three-dimensional parameter.

Next, the terminal may obtain a weight corresponding to the first loss function, a weight corresponding to the second loss function, and a weight corresponding to the third loss function. And the terminal multiplies the first loss function by the corresponding weight, multiplies the second loss function by the corresponding weight, multiplies the third loss function by the corresponding weight, and sums up the 3 multiplication results to obtain the target loss function.

In this embodiment, the network is reconstructed, and the generation countermeasure layer is further included to discriminate the predicted three-dimensional parameter and the second three-dimensional tag, so that a loss function can be constructed for the three-dimensional parameter obtained by the predicted image and the three-dimensional data directly acquired through the real human body, so as to determine whether the predicted three-dimensional parameter output by the network conforms to the real situation. And constructing a loss function based on the difference between the judgment result of the predicted three-dimensional parameter and the judgment result of the second three-dimensional label, and constructing three loss functions through three factors to obtain a target loss function. The target loss function integrates characteristics in multiple aspects, so that the accuracy of the reconstructed network obtained through training is higher.

In one embodiment, the method further comprises: generating camera parameters corresponding to the first training image based on point cloud coordinates of different scales; converting the three-dimensional attitude parameters into predicted two-dimensional attitude parameters according to the camera parameters; constructing a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude tags;

constructing a target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameters, wherein the method comprises the following steps: and constructing a target loss function according to the fourth loss function, the point cloud coordinates with different scales and the predicted three-dimensional parameters.

The three-dimensional posture parameter is a three-dimensional joint point coordinate corresponding to the first object.

Specifically, the reconstruction network generates a predicted three-dimensional parameter corresponding to a first object in a first training image based on point cloud coordinates of different scales, and generates a predicted camera parameter corresponding to the first training image. The predicted three-dimensional parameters comprise predicted three-dimensional attitude parameters. And then, the terminal can map the predicted three-dimensional attitude parameters from the three-dimensional space to the two-dimensional space according to the predicted camera parameters to obtain predicted two-dimensional attitude parameters corresponding to the predicted three-dimensional attitude parameters. The predicted two-dimensional pose parameters refer to predicted two-dimensional joint coordinates of the first object.

The terminal can determine an L2 norm between the predicted two-dimensional pose parameter and the corresponding two-dimensional pose label to obtain a fourth loss function.

For example, the fourth loss function constructed by the terminal is as follows:

the two-dimensional attitude loss is the loss of the joint point after being projected by the camera parameters and marked by a truth value:

wherein, П_cAnd the predicted three-dimensional attitude parameters output by the reconstruction network are represented by predicted two-dimensional attitude parameters obtained by projection of predicted camera parameters, namely predicted 2D joint points obtained by projection of predicted 3D joint points.

Is a two-dimensional posture label, namely a 2D joint point truth value.

And then, the terminal acquires point cloud labels, and a first loss function is constructed according to the point cloud coordinates of different scales and the point cloud labels of corresponding scales. And the terminal acquires a first three-dimensional label corresponding to the first training image, and constructs a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label.

And constructing a target loss function according to the first loss function and the second loss function.

Next, the terminal may obtain a weight corresponding to the first loss function, a weight corresponding to the second loss function, and a weight corresponding to the fourth loss function. And the terminal multiplies the first loss function by the corresponding weight, multiplies the second loss function by the corresponding weight, multiplies the fourth loss function by the corresponding weight, and sums up the 3 multiplication results to obtain the target loss function.

In the embodiment, camera parameters corresponding to the first training image are generated based on point cloud coordinates of different scales; converting the three-dimensional attitude parameters into predicted two-dimensional attitude parameters according to the camera parameters; the method comprises the steps of constructing a fourth loss function according to predicted two-dimensional attitude parameters and corresponding two-dimensional attitude labels, constructing a target loss function according to the fourth loss function, point cloud coordinates of different scales and predicted three-dimensional parameters, and constructing the target loss function based on characteristics of the predicted two-dimensional attitude parameters, the point cloud coordinates of different scales and the predicted three-dimensional parameters of images, so that the constructed target loss function is more accurate, and a reconstructed network obtained by training is more accurate.

In one embodiment, the method further comprises: inputting a second training image into a reconstruction network to be trained to obtain a predicted three-dimensional attitude parameter of a second object in the second training image; the first training image and the second training image are images collected in different environments; acquiring a three-dimensional attitude tag corresponding to the second object, and constructing a fifth loss function according to the predicted three-dimensional attitude parameter and the three-dimensional attitude tag;

constructing a target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameters, wherein the method comprises the following steps: and constructing a target loss function according to the fifth loss function, the point cloud coordinates with different scales and the predicted three-dimensional parameters.

The first training image is an image acquired outdoors, and the second training image is an image acquired indoors. The first object in the first training image and the second object in the second training image may be the same object or may be different objects.

Specifically, the terminal inputs the second training image into the reconstruction network to be trained, and performs feature extraction on the second training image through a feature extraction layer in the reconstruction network to be trained to obtain a feature map of the second training image. Carrying out graph convolution processing on the feature graph of the second training image through a graph convolution layer in the reconstruction network to be trained to obtain point cloud features of different scales; and carrying out graph convolution processing on the point cloud characteristics of different scales through the graph convolution layer to obtain point cloud coordinates of different scales. And the terminal generates a predicted three-dimensional attitude parameter corresponding to a second object in a second training image through the graph convolution layer based on point cloud coordinates of different scales.

Then, the terminal obtains a three-dimensional attitude tag corresponding to the second object, determines an L2 norm between the predicted three-dimensional attitude parameter and the three-dimensional attitude tag, and obtains a fifth loss function.

For example, the fifth loss function constructed by the terminal is as follows:

the fifth loss function is the loss between the predicted three-dimensional pose parameters and the three-dimensional pose labels output by the network:

three-dimensional pose loss is the loss of the joint points and truth labels of the SMPL model:

wherein R is_θ(β) represents predicted three-dimensional posture parameters, i.e., 3D joint coordinates, among the predicted three-dimensional parameters output from the reconstruction network.

Representing three-dimensional pose tags, i.e. the real 3D joint coordinates.

And constructing a target loss function according to the fifth loss function, the first loss function and the second loss function.

Next, the terminal may obtain a weight corresponding to the first loss function, a weight corresponding to the second loss function, and a weight corresponding to the fifth loss function. And the terminal multiplies the first loss function by the corresponding weight, multiplies the second loss function by the corresponding weight, multiplies the fifth loss function by the corresponding weight, and sums up the 3 multiplication results to obtain the target loss function.

In this embodiment, a loss function is constructed based on three-dimensional attitude parameters obtained by predicting a reconstructed network and corresponding three-dimensional attitude tags, and a target loss function is constructed by combining three factors in the three aspects of predicting the three-dimensional parameters, point cloud coordinates of different scales and predicting the three-dimensional attitude parameters, so that losses generated by multiple factors in a training process of the reconstructed network can be integrated, the influence of the factors on predicting the reconstructed network is ensured to be minimum, and the reconstructed three-dimensional virtual model of the reconstructed network is more accurate.

In one embodiment, the method further comprises: inputting the third training image into a reconstruction network to be trained to obtain a corresponding predicted two-dimensional attitude parameter; constructing a fifth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude tags;

constructing a target loss function based on point cloud coordinates of different scales and predicted three-dimensional parameters, wherein the method comprises the following steps: and constructing a target loss function based on the fifth loss function, the point cloud coordinates with different scales and the predicted three-dimensional parameters.

Specifically, the third training image may be an image with a complex background, and the third training image includes a third object.

In one embodiment, inputting the third training image into a reconstruction network to be trained to obtain a corresponding predicted two-dimensional pose parameter includes:

inputting the third training image into a reconstruction network to be trained, and performing feature extraction on the third training image through a feature extraction layer in the reconstruction network to be trained to obtain a feature map of the third training image; performing regression processing on the feature map of the third training image through a map convolution layer in the reconstruction network to be trained to obtain point cloud features of different scales; performing regression processing on the point cloud characteristics of different scales through the graph convolution layer to obtain point cloud coordinates of different scales; generating predicted three-dimensional attitude parameters and camera parameters based on point cloud coordinates of different scales through the graph convolution layer; and converting the predicted three-dimensional attitude parameters into corresponding predicted two-dimensional attitude parameters through the camera parameters.

In one embodiment, the terminal may construct the target loss function according to the first loss function, the second loss function, the third loss function, the fourth loss function, the fifth loss function, and the weight parameters corresponding to the loss functions.

The terminal constructs an objective loss function as follows:

L_total＝λ₁L_j2d+λ₂L_j3d+λ₃L_graph+λ₄L_smpl+λ₅L_adv(6)

wherein λ is a weight parameter corresponding to each loss function.

The target loss function is constructed through the 5 loss functions, factors influencing network performance in more aspects are integrated, influence in all aspects is reduced to the minimum, and the network reconstruction is more accurate.

In one embodiment, the training dataset used in the training process to reconstruct the network comprises a 3DPW human parameters dataset (3D spots in the world dataset); 2D attitude data sets disclosed by Posetrack, PennyAction and the like, 3D attitude data sets (indoor acquired data sets) disclosed by Human3.6M, MPI _ INF _3DHP and the like, and an SMPL parameter data set acquired by Mosh, wherein SMPL human body parameters comprise attitude parameters and body type parameters, the attitude parameters have 72 dimensions, and the body type parameters have 10 dimensions; the attitude parameter is rotation information of 24 joint points, and the rotation information of each joint point is represented by a 3-dimensional axis angle vector, so that 24 × 3-dimensional parameters are shared. SMPL is a human parameter based on the epidermal transformation, which can be converted into a posture represented by a 72-dimensional vector and a body shape represented by a 10-dimensional vector. Mosh is a Motion And Shape Capture means (Motion And Shape Capture) human Motion And body Shape Capture dataset, relying on Motion Capture devices to Capture data of human surface points, resulting in a SMPL parameter dataset.

Fig. 10 is a frame diagram of a three-dimensional virtual model reconstruction of a human body in a two-dimensional image according to an embodiment. The terminal acquires a two-dimensional image containing a human body and inputs the two-dimensional image into a reconstruction network. And a feature extraction layer in the reconstruction network performs feature extraction on the two-dimensional image, and performs camera parameter regression based on the extracted features to obtain camera parameters corresponding to the two-dimensional image. Inputting the extracted features into a graph convolution layer to perform graph convolution layer processing, and performing regression processing on the point cloud features of each scale to obtain point cloud coordinates of different scales. And performing human body parameter regression on the graph volume layer based on point cloud coordinates of different scales to obtain three-dimensional parameters corresponding to the human body in the two-dimensional image. And (3) carrying out re-projection on the three-dimensional joint points (namely the three-dimensional attitude parameters) in the three-dimensional parameters to obtain two-dimensional joint points (namely the two-dimensional attitude parameters). And inputting the three-dimensional parameters and the corresponding three-dimensional labels into a generated countermeasure layer, and judging the three-dimensional parameters by a discriminator in the generated countermeasure layer to obtain a judgment result, wherein the judgment result is true or false. And constructing a target loss function according to the difference between the three-dimensional parameters and the corresponding three-dimensional labels, the difference between the point cloud coordinates of different scales and the point cloud labels of corresponding scales, the difference between the three-dimensional joint points and the corresponding labels, and the difference between the two-dimensional joint points and the corresponding labels, and training a reconstruction network based on the target loss function.

In one embodiment, there is provided a three-dimensional virtual model reconstruction method, including:

the server obtains a first training image of a first object. The first object has a limb that is active.

And then, the server extracts the features of the first training image through a reconstruction network to be trained, and performs graph convolution processing on the extracted features to obtain point cloud coordinates of different scales.

Next, the server generates predicted three-dimensional parameters and corresponding predicted camera parameters of the first object based on the point cloud coordinates of different scales.

And then, the server acquires point cloud labels, and a first loss function is constructed according to the point cloud coordinates of different scales and the point cloud labels of corresponding scales.

Further, the server obtains a first three-dimensional label corresponding to the first training image, and a second loss function is constructed according to the predicted three-dimensional parameter and the first three-dimensional label.

And then, the server acquires the second three-dimensional label and constructs a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label. The second three-dimensional tag is a three-dimensional parameter acquired by motion capture of the first object.

The server then converts the three-dimensional pose parameters to predicted two-dimensional pose parameters based on the camera parameters.

Further, the server constructs a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude tags.

And then, the server inputs the second training image into a reconstruction network to be trained to obtain the predicted three-dimensional posture parameter of the second object in the second training image. The first training image and the second training image are images acquired in different environments.

And then, the server acquires a three-dimensional attitude tag corresponding to the second object, and constructs a fifth loss function according to the predicted three-dimensional attitude parameter and the three-dimensional attitude tag.

And the server constructs a target loss function according to the first loss function, the second loss function, the third loss function, the fourth loss function, the fifth loss function and the weight parameters corresponding to the loss functions.

Further, the server trains the reconstruction network to be trained based on the target loss function, and the trained reconstruction network is obtained when the training stopping condition is met. And the trained reconstruction network is used for reconstructing the object with the movable limb in the image into a three-dimensional virtual model with the matched limb shape with the object.

And applying the trained reconstruction network to a terminal, and acquiring images of each frame in the video of the target object, which contain the target object, by the terminal, wherein the target object has movable limbs.

And then, the terminal extracts the features of each frame of image through the trained feature extraction layer of the reconstruction network to obtain a feature map corresponding to each frame of image.

Further, the terminal carries out image convolution processing on the feature map of each frame of image through an image convolution layer of a reconstruction network to obtain point cloud features of different scales corresponding to each frame of image.

And then, the terminal carries out regression processing on the point cloud characteristics of different scales respectively corresponding to each frame of image through the image convolution layer to obtain point cloud coordinates of different scales respectively corresponding to each frame of image.

And then, the terminal generates three-dimensional parameters of the target object in each frame of image and corresponding camera parameters according to the point cloud coordinates of different scales corresponding to each frame of image.

Further, the terminal reconstructs a three-dimensional virtual model of the target object in each frame image based on the three-dimensional parameters of the target object, wherein the three-dimensional virtual model has a limb shape matched with the target object in the corresponding image.

Further, the terminal projects the three-dimensional virtual model into a two-dimensional image according to camera parameters corresponding to the image, and replaces a target object in an object image in the video according to the two-dimensional image to obtain a target video.

In this embodiment, the training process of reconstructing the network has a high requirement on the device graphics card, and can be completed on the server. For the three-dimensional parameter data set, supervised training can be performed by predicting three-dimensional attitude parameters and corresponding labels. In order to improve the generalization capability of the reconstruction network, a two-dimensional attitude data set and a three-dimensional attitude data set are added for semi-supervised training. In order to improve the quality of the output result of the reconstructed network, a structure factorization discriminator discriminates the data of the network prediction. In order to improve the accuracy of the network, multi-scale point cloud coordinate supervision is added into a training loss function. In order to improve the smoothness of the inter-frame result of the video image, the single-frame result is smoothed by time. In the training process, various factors are considered, so that the influence of various aspects is reduced to the minimum through training, and the accuracy of the reconstructed network is improved.

The two-dimensional image of the target object to be reconstructed is predicted through the trained reconstruction network, the three-dimensional parameters of the target object can be accurately obtained, and therefore the three-dimensional virtual model can be accurately constructed according to the three-dimensional parameters.

It should be understood that although the various steps in the flowcharts of fig. 2-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 11, there is provided a three-dimensional virtual model reconstruction apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and specifically includes: an image acquisition module 1102, a feature extraction module 1104, a generation module 1106, and a reconstruction module 1108, wherein:

an image acquisition module 1102 for acquiring an image of a target object, the target object having a moving limb.

And the feature extraction module 1104 is configured to perform feature extraction on the image, and perform graph convolution processing on the extracted features to obtain point cloud coordinates of different scales.

A generating module 1106, configured to generate three-dimensional parameters of the target object according to the point cloud coordinates of different scales.

A reconstruction module 1108 for reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology that matches the target object in the image.

In this embodiment, an image of a target object having a moving limb is obtained, feature extraction is performed on the image, graph convolution processing is performed on the extracted features, point cloud coordinates of different scales are obtained, three-dimensional parameters of the target object are generated according to the point cloud coordinates of different scales, and the three-dimensional parameters of the target object in the image can be accurately generated through graph convolution. And reconstructing a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object, wherein the three-dimensional virtual model has a limb shape matched with the target object in the image, so that the reconstruction accuracy of the three-dimensional virtual model is improved.

In one embodiment, the feature extraction module 1104 is configured to: extracting the features of the image through a feature extraction layer of a reconstruction network to obtain a corresponding feature map; carrying out graph convolution processing on the characteristic graph through a graph convolution layer of a reconstruction network to obtain point cloud characteristics with different scales; and performing regression processing on the point cloud characteristics of different scales through the graph convolution layer to obtain point cloud coordinates of different scales.

In this embodiment, the features of the image are extracted through the trained reconstruction network to obtain the key information of the image. And carrying out graph convolution processing on the extracted key information through a reconstruction network, converting the key information into point cloud features with different scales, carrying out regression processing on the point cloud features with different scales through the graph convolution layer to obtain point cloud coordinates with different scales so as to accurately output the coordinates of the key feature information of each scale.

In one embodiment, the apparatus further comprises: a projection module to: determining camera parameters corresponding to the image based on point cloud coordinates of different scales; and projecting the three-dimensional virtual model into a two-dimensional image according to the camera parameters corresponding to the image.

In one embodiment, the image acquisition module 1102 is further configured to: acquiring an image containing a target object in a video of the target object;

the device also includes: a projection module, the projection module further configured to: projecting the three-dimensional virtual model into the image to replace the target object in the image; and generating a target video based on the image of each frame in the video after replacing the target object.

In one embodiment, the image acquisition module 1102 is further configured to: acquiring each frame of image containing a target object in a video of the target object;

the generation module 1106 is further configured to: generating a three-dimensional parameter sequence based on the three-dimensional parameters of the target object in each frame of image;

the reconstruction module 1108 is further configured to: and generating a three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence.

In this embodiment, each frame in the video of the target object includes an image of the target object, and a three-dimensional parameter sequence corresponding to the target object in each frame image is generated according to point cloud coordinates of different scales corresponding to each frame image, so that three-dimensional parameters corresponding to the target object in each frame image can be output in real time through a reconstruction network, a three-dimensional virtual model of the target object in each frame image is reconstructed in real time, and the efficiency of reconstructing the three-dimensional virtual model is improved.

In one embodiment, the generation module 1106 is further configured to: acquiring corresponding moments of each frame of image in a video to obtain a time sequence; filtering the three-dimensional parameter sequence according to the time sequence to obtain a filtered three-dimensional parameter sequence;

the reconstruction module 1108 is further configured to: generating a three-dimensional virtual model sequence corresponding to the target object according to the filtered three-dimensional parameter sequence; the three-dimensional virtual models in the sequence of three-dimensional virtual models have limb morphology matching the target object in the corresponding image.

For specific limitations of the three-dimensional virtual model reconstruction device, reference may be made to the above limitations of the three-dimensional virtual model reconstruction method, which are not described herein again. The modules in the three-dimensional virtual model reconstruction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, as shown in fig. 12, there is provided a training apparatus for reconstructing a network, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and specifically includes: a training image acquisition module 1202, an input module 1204, a prediction module 1206, a construction module 1208, and a training module 1210, wherein:

a training image acquisition module 1202 for acquiring a first training image of a first object; the first object has a moving limb.

An input module 1204, configured to perform feature extraction on the first training image through a reconstruction network to be trained, and perform graph convolution processing on the extracted features to obtain point cloud coordinates of different scales.

A prediction module 1206 for generating a predicted three-dimensional parameter of the first object based on the point cloud coordinates of the different scales.

A constructing module 1208, configured to construct a target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameter.

A training module 1210, configured to train the reconstructed network to be trained based on the target loss function, and obtain a trained reconstructed network when a training stop condition is met; the trained reconstruction network is used for reconstructing an object with movable limbs in the image into a three-dimensional virtual model with limb shapes matched with the object.

In one embodiment, the building module 1208 is further configured to: acquiring point cloud labels, and constructing a first loss function according to point cloud coordinates of different scales and the point cloud labels of corresponding scales; acquiring a first three-dimensional label corresponding to the first training image, and constructing a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label; and constructing a target loss function according to the first loss function and the second loss function.

In one embodiment, the building module 1208 is further configured to: acquiring a second three-dimensional label, and constructing a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label; the second three-dimensional label is a three-dimensional parameter acquired by capturing the motion of the first object; and constructing a target loss function according to the first loss function, the second loss function and the third loss function.

In this embodiment, reconstructing the network further includes generating a countermeasure layer to discriminate the predicted three-dimensional parameter and the second three-dimensional label, so as to determine whether the predicted three-dimensional parameter output by the network conforms to a real situation. And constructing a loss function based on the difference between the judgment result of the predicted three-dimensional parameter and the judgment result of the second three-dimensional label, and constructing three loss functions through three factors to obtain a target loss function. The target loss function integrates characteristics in multiple aspects, so that the accuracy of the reconstructed network obtained through training is higher.

In one embodiment, the building module 1208 is further configured to: generating camera parameters corresponding to the first training image based on point cloud coordinates of different scales; converting the three-dimensional attitude parameters into predicted two-dimensional attitude parameters through the camera parameters; constructing a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude tags; and constructing a target loss function according to the fourth loss function, the point cloud coordinates with different scales and the predicted three-dimensional parameters.

In one embodiment, the building module 1208 is further configured to: inputting a second training image into a reconstruction network to be trained to obtain a predicted three-dimensional attitude parameter of a second object in the second training image; the first training image and the second training image are images collected in different environments; acquiring a three-dimensional attitude tag corresponding to the second object, and constructing a fifth loss function according to the predicted three-dimensional attitude parameter and the three-dimensional attitude tag; and constructing a target loss function according to the fifth loss function, the point cloud coordinates with different scales and the predicted three-dimensional parameters.

For specific limitations of the training apparatus for reconstructing the network, reference may be made to the above limitations of the training method for reconstructing the network, which are not described herein again. The modules in the training apparatus for reconstructing a network may be implemented in whole or in part by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing training data of a reconstruction network and reconstruction data of a three-dimensional virtual model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a training method for reconstructing a network and a three-dimensional virtual model reconstruction method.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a training method for reconstructing a network and a three-dimensional virtual model reconstruction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of reconstructing a three-dimensional virtual model, the method comprising:

acquiring an image of a target object, the target object having a moving limb;

2. The method of claim 1, wherein the extracting features from the image and performing a graph convolution process on the extracted features to obtain point cloud coordinates of different scales comprises:

extracting the features of the image through a feature extraction layer of a reconstruction network to obtain a corresponding feature map;

carrying out graph convolution processing on the characteristic graph through a graph convolution layer of the reconstruction network to obtain point cloud characteristics with different scales;

and performing regression processing on the point cloud characteristics with different scales through the graph convolution layer to obtain point cloud coordinates with different scales.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

determining camera parameters corresponding to the image based on the point cloud coordinates of different scales;

and projecting the three-dimensional virtual model into a two-dimensional image according to the camera parameters corresponding to the image.

4. The method of claim 1, wherein the acquiring an image of a target object comprises: acquiring an image containing a target object in a video of the target object;

the method further comprises the following steps:

projecting the three-dimensional virtual model into the image to replace the target object in the image;

and generating a target video based on the image of each frame in the video after replacing the target object.

5. The method of claim 1, wherein the acquiring an image of a target object comprises:

acquiring each frame of image containing a target object in a video of the target object;

reconstructing, by the computing device, a three-dimensional virtual model of the target object based on the three-dimensional parameters of the target object; the three-dimensional virtual model has a limb morphology matching the target object in the image, including:

generating a three-dimensional parameter sequence based on the three-dimensional parameters of the target object in each frame of image;

generating a three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence; the three-dimensional virtual models in the three-dimensional virtual model sequence have limb morphology matched with the target object in the corresponding image.

6. The method of claim 5, wherein after the generating of the three-dimensional parameter sequence corresponding to the target object in each frame of the image, the method further comprises:

acquiring the corresponding time of each frame of the image in the video to obtain a time sequence;

filtering the three-dimensional parameter sequence according to the time sequence to obtain a filtered three-dimensional parameter sequence;

generating a three-dimensional virtual model sequence corresponding to the target object according to the three-dimensional parameter sequence, wherein the three-dimensional virtual model sequence comprises the following steps:

and generating a three-dimensional virtual model sequence corresponding to the target object according to the filtered three-dimensional parameter sequence.

7. A training method for reconstructing a network, the method comprising:

8. The method of claim 7, wherein constructing an objective loss function from the point cloud coordinates of different scales and the predicted three-dimensional parameters comprises:

acquiring point cloud labels, and constructing a first loss function according to the point cloud coordinates of different scales and the point cloud labels of corresponding scales;

acquiring a first three-dimensional label corresponding to the first training image, and constructing a second loss function according to the predicted three-dimensional parameter and the first three-dimensional label;

9. The method of claim 8, further comprising:

acquiring a second three-dimensional label, and constructing a third loss function according to the predicted three-dimensional parameter and the second three-dimensional label; the second three-dimensional label is a three-dimensional parameter acquired by performing motion capture on the first object;

the constructing a target loss function according to the first loss function and the second loss function includes:

and constructing a target loss function according to the first loss function, the second loss function and the third loss function.

10. The method of claim 7, further comprising:

generating camera parameters corresponding to the first training image based on the point cloud coordinates of different scales;

converting the three-dimensional attitude parameters into predicted two-dimensional attitude parameters according to the camera parameters;

constructing a fourth loss function according to the predicted two-dimensional attitude parameters and the corresponding two-dimensional attitude tags;

the constructing of the target loss function according to the point cloud coordinates of different scales and the predicted three-dimensional parameters comprises the following steps:

and constructing a target loss function according to the fourth loss function, the point cloud coordinates with different scales and the predicted three-dimensional parameters.

11. The method of claim 7, further comprising:

inputting a second training image into the reconstruction network to be trained to obtain a predicted three-dimensional posture parameter of a second object in the second training image; the first training image and the second training image are images acquired in different environments;

acquiring a three-dimensional attitude tag corresponding to the second object, and constructing a fifth loss function according to the predicted three-dimensional attitude parameter and the three-dimensional attitude tag;

and constructing a target loss function according to the fifth loss function, the point cloud coordinates of different scales and the predicted three-dimensional parameters.

12. An apparatus for reconstructing a three-dimensional virtual model, the apparatus comprising:

13. A training apparatus for reconstructing a network, the apparatus comprising:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.