CN113379896A

CN113379896A - Three-dimensional reconstruction method and device, electronic equipment and storage medium

Info

Publication number: CN113379896A
Application number: CN202110660956.4A
Authority: CN
Inventors: 曹智杰; 汪旻; 刘文韬; 钱晨; 马利庄
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-09-10

Abstract

The disclosure relates to a three-dimensional reconstruction method and apparatus, an electronic device, and a storage medium. The method comprises the following steps: acquiring an image to be processed corresponding to a target object; predicting the image to be processed by utilizing a first neural network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed, wherein the first neural network is initialized in advance based on a second neural network and is obtained by training according to a first image sequence corresponding to the target object, and the second neural network is obtained by training in advance according to a plurality of second image sequences corresponding to a plurality of objects; and generating a first three-dimensional model of the target object in the image to be processed according to the first posture parameter and the first shape parameter.

Description

Three-dimensional reconstruction method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a three-dimensional reconstruction method and apparatus, an electronic device, and a storage medium.

Background

Three-dimensional reconstruction refers to the recovery of a three-dimensional model of a three-dimensional object from a two-dimensional image. The three-dimensional model obtained by three-dimensional reconstruction can be used for computer representation, processing, display and the like. Three-dimensional reconstruction is an important problem in the fields of computer vision, computer graphics and the like, and has wide application in the fields of augmented reality, virtual reality and the like. The method has important significance for improving the accuracy of three-dimensional reconstruction.

Disclosure of Invention

The present disclosure provides a three-dimensional reconstruction technique.

According to an aspect of the present disclosure, there is provided a three-dimensional reconstruction method including:

acquiring an image to be processed corresponding to a target object;

predicting the image to be processed by utilizing a first neural network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed, wherein the first neural network is initialized in advance based on a second neural network and is obtained by training according to a first image sequence corresponding to the target object, and the second neural network is obtained by training in advance according to a plurality of second image sequences corresponding to a plurality of objects;

and generating a first three-dimensional model of the target object in the image to be processed according to the first posture parameter and the first shape parameter.

A second neural network (namely a universal network) is obtained through training according to a plurality of second image sequences corresponding to a plurality of objects, then the first neural network is initialized based on the second neural network, and the first neural network is trained according to the first image sequence corresponding to the target object, so that the first neural network (namely an exclusive network) which is more suitable for the target object and can capture the exclusive characteristics of the target object can be obtained. The first neural network obtained through training is adopted to process the image to be processed corresponding to the target object, and the accuracy of the first posture parameter and the first shape parameter corresponding to the target object obtained through prediction can be improved. And generating a three-dimensional model of the target object in the image to be processed based on the first posture parameter and the first shape parameter, so that the accuracy of the generated three-dimensional model can be improved.

In a possible implementation manner, the predicting the image to be processed by using the first neural network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed includes:

extracting a first feature of the image to be processed and a second feature of the image to be processed by utilizing a first neural network;

fusing the first feature and the second feature to obtain a third feature;

and determining a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed based on the third feature.

In the implementation mode, the first characteristic and the second characteristic of the image to be processed are extracted through the first neural network, and the first posture parameter and the first shape parameter corresponding to the target object in the image to be processed are determined based on the third characteristic obtained by fusing the first characteristic and the second characteristic, so that the accuracy of the determined first posture parameter and the determined first shape parameter can be improved.

In a possible implementation manner, the first neural network is further configured to predict and obtain a first camera parameter corresponding to the image to be processed;

after the generating of the first three-dimensional model of the target object in the image to be processed, the method further comprises:

and projecting the first three-dimensional model to a two-dimensional plane according to the first camera parameter to obtain a two-dimensional model corresponding to the first three-dimensional model.

According to this implementation, after generating the three-dimensional model of the target object in the image to be processed, the display of the three-dimensional model of the target object on the two-dimensional plane can be achieved.

In one possible implementation, before the predicting the image to be processed by using the first neural network, the method further includes:

for any training image in the first image sequence, predicting the training image by using the first neural network to obtain at least one of the following: a second pose parameter corresponding to the target object in the training image, a second shape parameter corresponding to the target object in the training image, and a second camera parameter corresponding to the training image;

determining a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter;

and training the first neural network according to the values of the loss functions corresponding to the training images in the first image sequence.

According to the implementation manner, the first neural network can learn the capability of accurately determining the posture parameters and the shape parameters of the target object on the basis of the general network.

In one possible implementation manner, the loss function corresponding to the training image includes a first loss function corresponding to the training image;

determining a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter, including:

obtaining first coordinate prediction data of the two-dimensional key point set of the target object in the training image according to the second posture parameter, the second shape parameter and the second camera parameter;

performing two-dimensional attitude estimation on the training image to obtain a two-dimensional attitude estimation result corresponding to the target object in the training image, wherein the two-dimensional attitude estimation result comprises a coordinate label of a two-dimensional key point set of the target object in the training image;

and determining a value of a first loss function corresponding to the training image according to the coordinate difference between the first coordinate prediction data and the corresponding key point in the coordinate label.

In this implementation, the coordinate labels of the two-dimensional key point set of the target object in the training image obtained by the two-dimensional pose estimation are used for supervision, so that the first neural network capable of capturing the exclusive feature of the target object can be obtained by training with the unmarked first image sequence.

In one possible implementation, the obtaining first coordinate prediction data of a two-dimensional keypoint set of the target object in the training image according to the second pose parameter, the second shape parameter, and the second camera parameter includes:

generating a second three-dimensional model of the target object in the training image according to the second posture parameter and the second shape parameter;

obtaining second coordinate prediction data of the three-dimensional key point set of the target object in the training image through regression according to the second three-dimensional model;

and projecting the second coordinate prediction data to a two-dimensional plane according to the second camera parameters to obtain first coordinate prediction data of the two-dimensional key point set of the target object in the training image.

In the implementation mode, the three-dimensional key point set of the target object can be supervised by using the coordinate label of the two-dimensional key point set of the target object in the training image obtained by two-dimensional attitude estimation, so that the accuracy of the first neural network in predicting the attitude parameters and the shape parameters is improved.

In one possible implementation manner, the two-dimensional attitude estimation result further includes confidence degrees corresponding to the key points in the coordinate tags one to one;

determining a value of a first loss function corresponding to the training image according to a coordinate difference between the first coordinate prediction data and a corresponding key point in the coordinate label, including:

and determining the value of a first loss function corresponding to the training image according to the coordinate difference between the first coordinate prediction data and the corresponding key point in the coordinate label and the confidence coefficient.

In this implementation, the value of the first loss function is determined by combining the confidence, which is helpful for improving the training effect of the first neural network, and thus is helpful for further improving the accuracy of the first neural network in predicting the pose parameters and the shape parameters.

In one possible implementation manner, the loss function corresponding to the training image includes a second loss function corresponding to the training image;

and judging the rationality of the second posture parameter and the second shape parameter pair to obtain a value of a second loss function corresponding to the training image.

According to the implementation mode, reasonability of the posture parameters and the shape parameters predicted by the first neural network is improved.

In a possible implementation manner, the performing the rationality judgment on the second posture parameter and the second shape parameter pair to obtain a value of a second loss function corresponding to the training image includes:

and carrying out rationality judgment on the second posture parameter and the second shape parameter by using a pre-trained third neural network to obtain a value of a second loss function corresponding to the training image.

According to the implementation mode, the accuracy of reasonably judging the attitude parameters and the shape parameters of the target object can be improved by utilizing the prior information.

In a possible implementation manner, the loss function corresponding to the training image includes a third loss function corresponding to the training image;

under the condition that the training image is not the first image in the first image sequence, obtaining a third shape parameter corresponding to an image before the training image in the first image sequence;

and determining a value of a third loss function corresponding to the training image according to the second shape parameter and the third shape parameter.

In this implementation, by performing the constraint of the shape parameters in time series, the target object can be constrained in the depth direction in the training of the first neural network, thereby facilitating the first neural network to mine geometric features of the target object, such as bone length, body type, and stature.

In a possible implementation manner, the third shape parameter is an average shape parameter corresponding to a previous image of the training image, and the average shape parameter corresponding to a first image in the first image sequence is a shape parameter corresponding to the target object in the image predicted by the first neural network;

determining a value of a third loss function corresponding to the training image according to the second shape parameter and the third shape parameter includes:

determining an average shape parameter corresponding to the training image according to the weighted sum of the second shape parameter and the third shape parameter;

determining a value of a third loss function corresponding to the training image according to a difference between the second shape parameter and an average shape parameter corresponding to the training image.

According to the implementation mode, the shape parameters of the target object can be restrained for a long time by using the prior information that the shape of the target object is kept unchanged, so that the stability and the accuracy of the shape parameter prediction of the target object by the first neural network are improved.

In a possible implementation manner, the loss function corresponding to the training image includes a fourth loss function corresponding to the training image;

obtaining third coordinate prediction data of the three-dimensional keypoint set of the target object in an image previous to the training image, if the training image is not the first image in the first image sequence;

and determining a value of a fourth loss function corresponding to the training image according to a difference between second coordinate prediction data and third coordinate prediction data of the three-dimensional key point set of the target object in the training image.

In the implementation mode, the coordinates of the three-dimensional key point set of the target object are subjected to short-term constraint by utilizing the continuity of the three-dimensional target object motion and the camera motion, so that the stability and the accuracy of the shape parameter prediction of the target object by the first neural network are improved.

In one possible implementation, the first neural network includes a first sub-network and a second sub-network, the second neural network includes a third sub-network and a fourth sub-network, the network structure of the first sub-network is the same as the network structure of the third sub-network, the initial parameters of the first sub-network are the same as the parameters of the third sub-network, the network structure of the second sub-network is the same as the network structure of the fourth sub-network, and the initial parameters of the second sub-network are the same as the parameters of the fourth sub-network.

According to the implementation mode, the first neural network can be effectively initialized, and the convergence rate of the first neural network can be improved.

In a possible implementation, the first neural network further includes the third sub-network and a fusion layer, wherein the fusion layer is connected to the first sub-network, the third sub-network and the second sub-network, respectively, and parameters of the third sub-network are kept fixed during training of the first neural network;

the predicting the image to be processed by using the first neural network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed includes:

extracting a first feature of the image to be processed by using the first sub-network, and extracting a second feature of the image to be processed by using the third sub-network;

fusing the first feature and the second feature through the fusion layer to obtain a third feature;

and predicting the third feature by using the second sub-network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed.

In this implementation, the first neural network further includes a third sub-network (i.e., an encoder of the second neural network) of the general network (i.e., the second neural network), so that the first neural network can learn the exclusive feature of the target object without losing the three-dimensional attitude prior information, thereby further improving the accuracy of predicting the attitude parameter and the shape parameter of the target object by the first neural network.

According to an aspect of the present disclosure, there is provided a three-dimensional reconstruction apparatus including:

the acquisition module is used for acquiring an image to be processed corresponding to the target object;

the first prediction module is used for predicting the image to be processed by utilizing a first neural network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed, wherein the first neural network is initialized in advance based on a second neural network and is obtained by training according to a first image sequence corresponding to the target object, and the second neural network is obtained by training in advance according to a plurality of second image sequences corresponding to a plurality of objects;

and the generating module is used for generating a first three-dimensional model of the target object in the image to be processed according to the first posture parameter and the first shape parameter.

In one possible implementation, the first prediction module is configured to:

fusing the first feature and the second feature to obtain a third feature;

the device comprises:

and the projection module is used for projecting the first three-dimensional model to a two-dimensional plane according to the first camera parameter to obtain a two-dimensional model corresponding to the first three-dimensional model.

In one possible implementation, the apparatus further includes:

a second prediction module, configured to, for any training image in the first image sequence, predict the training image by using the first neural network, so as to obtain at least one of the following: a second pose parameter corresponding to the target object in the training image, a second shape parameter corresponding to the target object in the training image, and a second camera parameter corresponding to the training image;

a determining module, configured to determine a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter;

and the training module is used for training the first neural network according to the values of the loss functions corresponding to the training images in the first image sequence.

the determination module is to:

In one possible implementation, the determining module is configured to:

the determination module is to:

In one possible implementation, the determining module is configured to:

the determination module is to:

the first prediction module is to:

According to an aspect of the present disclosure, there is provided an electronic device including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the present disclosure, a second neural network (i.e., a general network) is obtained by training according to a plurality of second image sequences corresponding to a plurality of objects, then the first neural network is initialized based on the second neural network, and the first neural network is trained according to the first image sequence corresponding to the target object, so that a first neural network (i.e., an exclusive network) that is more suitable for the target object and can capture the exclusive features of the target object can be obtained. The first neural network obtained through training is adopted to process the image to be processed corresponding to the target object, and the accuracy of the first posture parameter and the first shape parameter corresponding to the target object obtained through prediction can be improved. And generating a three-dimensional model of the target object in the image to be processed based on the first posture parameter and the first shape parameter, so that the accuracy of the generated three-dimensional model can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flowchart of a three-dimensional reconstruction method provided by an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of training a first neural network in a three-dimensional reconstruction method provided by an embodiment of the present disclosure.

FIG. 3 illustrates a proprietary network Φ in an application scenario provided by an embodiment of the disclosure_pSchematic representation of (a).

Fig. 4 shows a block diagram of a three-dimensional reconstruction apparatus provided by an embodiment of the present disclosure.

Fig. 5 illustrates a block diagram of an electronic device 800 provided by an embodiment of the disclosure.

Fig. 6 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

The human body three-dimensional reconstruction task usually reconstructs the pose parameters and the shape parameters of a target human body from a video or a video stream, and then obtains a three-dimensional model of the target human body according to the pose parameters and the shape parameters of the target human body through an SMPL (a Skinned Multi-Person Linear skin model). In recent years, with the development of neural networks and large-scale human body data sets, many achievements have been made in this field.

However, when the image data corresponding to the target object that needs to be three-dimensionally reconstructed is significantly different from the training data for training the universal network, it is difficult to obtain a more accurate three-dimensional model by using the universal network to three-dimensionally reconstruct the target object. For example, training data collected in a laboratory often differs in distribution from outdoor test data, such as different viewing angles, backgrounds, and lighting conditions. This results in an effective neural network on the training data, but it is difficult to obtain a higher accuracy on the outdoor test data.

In addition, in the related art, when training a neural network for three-dimensional reconstruction, a large number of image sequences corresponding to a large number of different objects are usually used to train the neural network, so as to obtain a general network. For example, when training a neural network for three-dimensional reconstruction of a human body, a large number of image sequences corresponding to a large number of human bodies of different sizes are usually used to train the neural network, so as to obtain a universal network. The generic network thus trained tends to give pose parameters and shape parameters of the human body close to the mean value, resulting in a low accuracy of the three-dimensional reconstruction of a specific human body.

The embodiment of the disclosure provides a three-dimensional reconstruction method and device, an electronic device and a storage medium, wherein a second neural network (namely a universal network) is obtained by training according to a plurality of second image sequences corresponding to a plurality of objects, a first neural network is initialized based on the second neural network, and the first neural network is trained according to a first image sequence corresponding to a target object, so that the first neural network (namely an exclusive network) which is more suitable for the target object and can capture the exclusive characteristics of the target object can be obtained. The first neural network obtained through training is adopted to process the image to be processed corresponding to the target object, and the accuracy of the first posture parameter and the first shape parameter corresponding to the target object obtained through prediction can be improved. And generating a three-dimensional model of the target object in the image to be processed based on the first posture parameter and the first shape parameter, so that the accuracy of the generated three-dimensional model can be improved.

The three-dimensional reconstruction method provided by the embodiments of the present disclosure is described in detail below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of a three-dimensional reconstruction method provided by an embodiment of the present disclosure. In one possible implementation, the three-dimensional reconstruction method may be performed by a terminal device or a server or other processing device. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, or a wearable device. In some possible implementations, the three-dimensional reconstruction method may be implemented by a processor calling computer-readable instructions stored in a memory. As shown in fig. 1, the three-dimensional reconstruction method includes steps S11 to S13.

In step S11, a to-be-processed image corresponding to the target object is acquired.

In step S12, the image to be processed is predicted by using a first neural network, so as to obtain a first pose parameter and a first shape parameter corresponding to the target object in the image to be processed, where the first neural network is initialized in advance based on a second neural network and is trained according to a first image sequence corresponding to the target object, and the second neural network is trained in advance according to a plurality of second image sequences corresponding to a plurality of objects.

In step S13, a first three-dimensional model of the target object in the image to be processed is generated according to the first pose parameter and the first shape parameter.

In the embodiments of the present disclosure, the type of the object may be a human body, an animal body, or the like. The second neural network may be a general network trained in advance from a plurality of second image sequences corresponding to a plurality of objects of one type or a plurality of types. For example, the second neural network may be a general network trained in advance based on large-scale training data, and specifically, the second neural network may be a general network trained in advance according to a large number of second image sequences corresponding to a large number of different objects of the same type. Wherein, the second image sequence may represent an image sequence for training the second neural network, and each image in the second image sequence may be respectively used as a training image for training the second neural network. After the training of the second neural network is completed, the parameters of the first neural network may be initialized with the parameters of the second neural network. The network structure of the first neural network may be different from that of the second neural network, or may be the same as that of the second neural network. The first neural network at least comprises a partial network structure of the second neural network, and the same network structure of the first neural network and the second neural network is initialized by adopting the parameters of the second neural network. For example, the first neural network may include the entire network structure of the second neural network, and the first neural network may also include other network structures. After the first neural network is initialized, the first neural network can be trained according to the first image sequence corresponding to the target object, so that the trained first neural network can capture exclusive characteristics of the appearance, body type, motion mode and the like of the target object, and distribution of the target object can be fitted accurately. The first image sequence may represent an image sequence for training the first neural network, and each image in the first image sequence may be used as a training image for training the first neural network.

In the disclosed embodiment, any image sequence includes a plurality of images. For example, a video or a video segment may be considered as a sequence of images. As another example, a video frame may be sampled and the sequence of sampled video frames may be used as a sequence of images. The images and the image sequence in the embodiment of the present disclosure may be acquired by a monocular camera or acquired by a binocular camera. Under the condition of adopting monocular camera for collection, the three-dimensional reconstruction method provided by the embodiment of the disclosure can carry out three-dimensional reconstruction based on monocular vision.

In the disclosed embodiment, the target object may represent an object that needs to be reconstructed three-dimensionally. For example, the target object may be a target human body or the like. The image to be processed may represent an image that requires three-dimensional reconstruction. The image to be processed may be an arbitrary image containing the target object. For example, the image to be processed may be any video frame in a video containing the target object. After the training of the first neural network is completed, the image to be processed can be processed through the first neural network, and a first posture parameter and a first shape parameter corresponding to a target object in the image to be processed are obtained. The first attitude parameter represents an attitude parameter corresponding to a target object in the image to be processed; the first shape parameter represents a shape parameter corresponding to a target object in the image to be processed. In the embodiment of the present disclosure, the posture parameter may represent a parameter for describing a posture of the target object, and the shape parameter may represent a parameter for describing a shape of the target object. For example, the pose parameter may be denoted as θ and the shape parameter may be denoted as β. After obtaining the first pose parameters and the first shape parameters, the first pose parameters and the first shape parameters may be input into the SMPL, via which a first three-dimensional model of the target object in the image to be processed is generated. Wherein the first three-dimensional model represents a three-dimensional model of a target object in the image to be processed. Of course, other functions or models for generating a three-dimensional model of the object from the pose parameters and the shape parameters may be designed in advance, and are not limited herein.

In a possible implementation manner, in a case that three-dimensional reconstruction of a target object needs to be performed on an offline video including the target object, the offline video may be used as a first image sequence, or the offline video may be sampled to obtain the first image sequence. After the first neural network is trained according to the first image sequence, each video frame in the offline video can be respectively used as an image to be processed, so that a first attitude parameter and a first shape parameter corresponding to a target object in each video frame in the offline video can be obtained through prediction of the first neural network, and further, a first three-dimensional model of the target object in each video frame in the offline video can be generated according to the first attitude parameter and the first shape parameter corresponding to the target object in each video frame in the offline video. For example, when an offline video including a target human body needs to be subjected to three-dimensional reconstruction of the target human body, the offline video may be used as the first image sequence, or the offline video may be sampled to obtain the first image sequence. After the first neural network is trained according to the first image sequence, each video frame in the offline video can be respectively used as an image to be processed, so that a first posture parameter and a first shape parameter corresponding to a target human body in each video frame in the offline video can be obtained through prediction of the first neural network, and further, a first three-dimensional model of the target human body in each video frame in the offline video can be generated according to the first posture parameter and the first shape parameter corresponding to the target human body in each video frame in the offline video.

In another possible implementation manner, in a case where three-dimensional reconstruction of a target object needs to be performed on an online video stream containing the target object, a segment of the online video stream may be acquired as a first image sequence. After the first neural network is trained according to the first image sequence, each video frame in the online video stream can be respectively used as an image to be processed, and the online video stream is processed in real time through the first neural network. In specific implementation, the first attitude parameter and the first shape parameter corresponding to the target object in each video frame in the online video stream can be obtained through real-time prediction of the first neural network, and then the first three-dimensional model of the target object in each video frame in the online video stream can be generated in real time according to the first attitude parameter and the first shape parameter corresponding to the target object in each video frame in the online video stream. For example, in a case where a three-dimensional reconstruction of a target human body needs to be performed on an online video stream containing the target human body, a segment of the online video stream may be acquired as a first image sequence. After the first neural network is trained according to the first image sequence, each video frame in the online video stream can be respectively used as an image to be processed, and the online video stream is processed in real time through the first neural network. In specific implementation, the first posture parameter and the first shape parameter corresponding to the target human body in each video frame in the online video stream can be obtained through real-time prediction of the first neural network, and then the first three-dimensional model of the target human body in each video frame in the online video stream can be generated in real time according to the first posture parameter and the first shape parameter corresponding to the target human body in each video frame in the online video stream.

In one possible implementation, the first neural network includes a first sub-network and a second sub-network, the second neural network includes a third sub-network and a fourth sub-network, the network structure of the first sub-network is the same as the network structure of the third sub-network, the initial parameters of the first sub-network are the same as the parameters of the third sub-network, the network structure of the second sub-network is the same as the network structure of the fourth sub-network, and the initial parameters of the second sub-network are the same as the parameters of the fourth sub-network. In this implementation, the third subnetwork may employ any network structure that is capable of feature extraction. For example, the third sub-network may employ a network structure of the signature encoding portion of ResNet-50. In this implementation, the initial parameters of the first sub-network are the same as the parameters of the third sub-network after the training of the second neural network is completed, and the initial parameters of the second sub-network are the same as the parameters of the fourth sub-network after the training of the second neural network is completed. In one example, the first sub-network may be referred to as an encoder of the first neural network, and the second sub-network may be referred to as a regressor or of the first neural network; the third sub-network may be referred to as an encoder of the second neural network and the fourth sub-network may be referred to as a regressor or decoder of the second neural network. In one example, the first sub-network may be denoted as

The second sub-network may be denoted as

The third sub-network may be denoted as

The fourth sub-network may be denoted as

As an example of this implementation, the first neural network further comprises the third sub-network and a fusion layer, wherein the fusion layer is connected to the first sub-network, the third sub-network and the second sub-network, respectively, and parameters of the third sub-network remain fixed during training of the first neural network; the predicting the image to be processed by using the first neural network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed includes: extracting a first feature of the image to be processed by using the first sub-network, and extracting a second feature of the image to be processed by using the third sub-network; fusing the first feature and the second feature through the fusion layer to obtain a third feature; and predicting the third feature by using the second sub-network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed. In this example, the fusion layer may channel-wise weight the first feature output by the first sub-network and the second feature output by the third sub-network to obtain the third feature. For example, the weight corresponding to the first feature and the weight corresponding to the second feature may be 0.5, respectively. Of course, those skilled in the art may also flexibly set the weight corresponding to the first feature and the weight corresponding to the second feature according to the requirements of the actual application scenario, which is not limited herein. In this example, the first neural network further includes a third sub-network (i.e., an encoder of the second neural network) of the general network (i.e., the second neural network), so that the first neural network can learn the proprietary features of the target object without losing the three-dimensional attitude prior information, thereby further improving the accuracy of predicting the attitude parameters and the shape parameters of the target object by the first neural network.

As another example of this implementation, the first neural network may include only the first subnetwork and the second subnetwork, excluding the third subnetwork and the fusion layer. That is, in this example, the network structure of the first neural network may be the same as the second neural network. In this example, the image to be processed may be input into a first sub-network, a fourth feature of the image to be processed is extracted via the first sub-network, the fourth feature is input into a second sub-network, and a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed are predicted via the second sub-network.

The network structure of the second neural network is not limited by the disclosed embodiments. For example, in other possible implementations, the second neural network may include more sub-networks. Accordingly, the first neural network may also comprise more sub-networks.

In a possible implementation manner, the predicting the image to be processed by using the first neural network to obtain a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed includes: extracting a first feature of the image to be processed and a second feature of the image to be processed by utilizing a first neural network; fusing the first feature and the second feature to obtain a third feature; and determining a first posture parameter and a first shape parameter corresponding to the target object in the image to be processed based on the third feature. In the implementation mode, the first characteristic and the second characteristic of the image to be processed are extracted through the first neural network, and the first posture parameter and the first shape parameter corresponding to the target object in the image to be processed are determined based on the third characteristic obtained by fusing the first characteristic and the second characteristic, so that the accuracy of the determined first posture parameter and the determined first shape parameter can be improved.

In a possible implementation manner, the first neural network is further configured to predict and obtain a first camera parameter corresponding to the image to be processed; after the generating of the first three-dimensional model of the target object in the image to be processed, the method further comprises: and projecting the first three-dimensional model to a two-dimensional plane according to the first camera parameter to obtain a two-dimensional model corresponding to the first three-dimensional model. In this implementation, the first camera parameter represents a camera parameter corresponding to the image to be processed. Therein, the camera parameters may represent parameters for projecting the three-dimensional model to the two-dimensional plane. In one example, the camera parameter may be written as π. In one example, the camera parameters may include a zoom parameter and a pan parameter. In other examples, the camera parameters may further include rotation parameters and the like, which are not limited herein. According to this implementation, after generating the three-dimensional model of the target object in the image to be processed, the display of the three-dimensional model of the target object on the two-dimensional plane can be achieved.

Of course, under the condition that the three-dimensional model of the target object does not need to be displayed on the two-dimensional plane, the first camera parameter corresponding to the image to be processed does not need to be obtained through prediction of the first neural network, and accordingly, the first three-dimensional model does not need to be projected to the two-dimensional plane according to the first camera parameter so as to obtain the two-dimensional model corresponding to the first three-dimensional model.

In one possible implementation, before the predicting the image to be processed by using the first neural network, the method further includes: for any training image in the first image sequence, predicting the training image by using the first neural network to obtain at least one of the following: a second pose parameter corresponding to the target object in the training image, a second shape parameter corresponding to the target object in the training image, and a second camera parameter corresponding to the training image; determining a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter; and training the first neural network according to the values of the loss functions corresponding to the training images in the first image sequence. In this implementation, the second pose parameter represents a pose parameter corresponding to a target object in the training image; the second shape parameter represents the shape parameter corresponding to the target object in the training image; the second camera parameters represent camera parameters corresponding to the training images. According to this implementation, the values of the loss functions corresponding to the respective training images in the first image sequence can be obtained. In this implementation, the first neural network may be trained based on values of the loss function corresponding to all or a portion of the training images in the first sequence of images. As an example of this implementation, the first neural network may be trained based on values of the loss function corresponding to each training image in the first sequence of images. According to the implementation manner, the first neural network can learn the capability of accurately determining the posture parameters and the shape parameters of the target object on the basis of the general network.

In one example, the first image sequence may be denoted as V_s，

Wherein, I_iRepresenting the ith training image in the first image sequence and N representing the number of training images in the first image sequence. First neural network phi_pThe update can be made according to equation 1:

wherein phi_p(I_i)＝{θ_i,β_i,π_i}，θ_iRepresenting a training image I_iSecond attitude parameter, β, corresponding to the target object in (1)_iRepresenting a training image I_iOf the target object, pi_iRepresenting a training image I_iCorresponding second camera parameters.

Representing a training image I_iThe corresponding loss function.

Fig. 2 shows a schematic diagram of training a first neural network in a three-dimensional reconstruction method provided by an embodiment of the present disclosure. In the example shown in FIG. 2, for a training image I in the first image sequence_iThe training image I can be paired with a first neural network_iPredicting to obtain a training image I_iSecond attitude parameter theta corresponding to middle target object_iTraining image I_iSecond shape parameter beta corresponding to the target object_iAnd training image I_iCorresponding second camera parameter pi_iAnd may be based on the second attitude parameter θ_iA second shape parameter beta_iAnd a second camera parameter pi_iDetermining a training image I_iCorresponding lossValue of function

Similarly, a value of a loss function may be determined for each training image in the first sequence of images. Based on the values of the loss functions corresponding to the training images in the first image sequence, a gradient descent back propagation method may be employed to train the first neural network.

As an example of this implementation, the loss function corresponding to the training image includes a first loss function corresponding to the training image; determining a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter, including: obtaining first coordinate prediction data of the two-dimensional key point set of the target object in the training image according to the second posture parameter, the second shape parameter and the second camera parameter; performing two-dimensional attitude estimation on the training image to obtain a two-dimensional attitude estimation result corresponding to the target object in the training image, wherein the two-dimensional attitude estimation result comprises a coordinate label of a two-dimensional key point set of the target object in the training image; and determining a value of a first loss function corresponding to the training image according to the coordinate difference between the first coordinate prediction data and the corresponding key point in the coordinate label.

In this example, the first coordinate prediction data may represent coordinate prediction data of two-dimensional keypoints of the target object in the training image. The two-dimensional pose estimate may represent the coordinates of a two-dimensional set of keypoints that estimate the object in the image. The two-dimensional attitude estimation of the training image can be used for estimating the coordinates of the two-dimensional key point set of the target object in the training image to obtain the pseudo label of the coordinates of the two-dimensional key point set of the target object in the training image. In one example, a two-dimensional pose estimation network may be used to perform two-dimensional pose estimation on the target object on the training image, so as to obtain a two-dimensional pose estimation result of the target object in the training image.

In this example, the two-dimensional set of keypoints of the target object may represent a set of two-dimensional keypoints of the target object. In one example, the two-dimensional keypoint set of the target object may include K two-dimensional keypoints, where K is greater than 1. Accordingly, the first coordinate prediction data may include coordinate prediction data for K two-dimensional keypoints of the target object in the training image, and the coordinate labels may include coordinate labels for the K two-dimensional keypoints of the target object in the training image.

In this example, the coordinate labels of the two-dimensional keypoint set of the target object in the training image obtained by the two-dimensional pose estimation are used for supervision, so that a first neural network capable of capturing the exclusive feature of the target object can be obtained by training with the unmarked first image sequence.

In one example, the obtaining first coordinate prediction data of a two-dimensional keypoint set of the target object in the training image according to the second pose parameter, the second shape parameter, and the second camera parameter includes: generating a second three-dimensional model of the target object in the training image according to the second posture parameter and the second shape parameter; obtaining second coordinate prediction data of the three-dimensional key point set of the target object in the training image through regression according to the second three-dimensional model; and projecting the second coordinate prediction data to a two-dimensional plane according to the second camera parameters to obtain first coordinate prediction data of the two-dimensional key point set of the target object in the training image. In this example, after obtaining the second pose parameters and the second shape parameters, the second pose parameters and the second shape parameters may be input into the SMPL, via which a second three-dimensional model of the target object in the training image is generated. Wherein the second three-dimensional model represents a three-dimensional model of a target object in the training image. In one example, a fourth neural network for obtaining three-dimensional key points according to a three-dimensional model regression may be trained in advance, and the second three-dimensional model may be subjected to the regression of the three-dimensional key points via the fourth neural network, so as to obtain second coordinate prediction data of the three-dimensional key point set of the target object in the training image. Wherein the second coordinate prediction data represents coordinate prediction data of a three-dimensional keypoint set of the target object in the training image. Wherein the set of three-dimensional keypoints of the target object represents a set of three-dimensional keypoints of the target object. And projecting the second coordinate prediction data according to the second camera parameters to obtain the first coordinate prediction data. In the above example, the three-dimensional key point set of the target object can be supervised by using the coordinate labels of the two-dimensional key point set of the target object in the training image obtained by two-dimensional posture estimation, thereby being beneficial to improving the accuracy of the first neural network in predicting the posture parameters and the shape parameters.

In other examples, a model or a function for obtaining the second coordinate prediction data of the three-dimensional keypoint set of the target object according to the pose parameter, the shape parameter, and the camera parameter may be designed in advance, or a model or a function for obtaining the first coordinate prediction data of the two-dimensional keypoint set of the target object according to the pose parameter, the shape parameter, and the camera parameter may be designed in advance, which is not limited herein.

In one example, the two-dimensional pose estimation result further comprises confidence degrees corresponding to the key points in the coordinate labels in a one-to-one mode; determining a value of a first loss function corresponding to the training image according to a coordinate difference between the first coordinate prediction data and a corresponding key point in the coordinate label, including: and determining the value of a first loss function corresponding to the training image according to the coordinate difference between the first coordinate prediction data and the corresponding key point in the coordinate label and the confidence coefficient. In this example, by performing two-dimensional pose estimation on the training image, the confidence levels corresponding to the key points in the coordinate labels one-to-one can also be obtained. The value of the first loss function is determined by combining the confidence coefficient, so that the training effect of the first neural network is improved, and the accuracy of predicting the attitude parameters and the shape parameters of the first neural network is further improved.

In one example, equation 2 may be used to determine training image I_iValue of the corresponding first loss function

Wherein x is_iRepresenting a training image I_iFirst coordinate prediction data of a two-dimensional keypoint set of the target object,

denotes x_iThe predicted coordinates of the kth keypoint in (a),

representing a training image I_iCoordinate labels of the two-dimensional keypoint set of the target object,

to represent

K denotes the number of key points in the two-dimensional key point set of the target object, ω_kTo represent

Corresponding confidence, i.e. ω_kTo represent

The confidence of the corresponding kth keypoint in (a).

As another example of this implementation, the first sequence of images may be a sequence of annotated images.

As an example of this implementation, the loss function corresponding to the training image includes a second loss function corresponding to the training image; determining a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter, including: judging the rationality of the second posture parameter and the second shape parameter pair to obtain the second posture parameter and the second shape parameter pairAnd training the value of the second loss function corresponding to the image. In this example, the reasonableness determination may be performed on the second posture parameter and the second shape parameter based on the prior information, so as to obtain a value of the second loss function corresponding to the training image. Wherein the more unreasonable the second posture parameter and the second shape parameter are, that is, the lower the rationality of the second posture parameter and the second shape parameter is, the larger the value of the second loss function corresponding to the training image is; the more reasonable the second pose parameter and the second shape parameter are, i.e. the higher the rationality of the second pose parameter and the second shape parameter is, the smaller the value of the second loss function corresponding to the training image is. For example, the value of the second loss function corresponding to the training image can be recorded as

According to this example, it is helpful to improve the rationality of the pose parameters and shape parameters predicted by the first neural network.

In an example, the performing the rationality judgment on the second posture parameter and the second shape parameter pair to obtain a value of a second loss function corresponding to the training image includes: and carrying out rationality judgment on the second posture parameter and the second shape parameter by using a pre-trained third neural network to obtain a value of a second loss function corresponding to the training image. In this example, a third neural network for making a rationality determination on the posture parameters and the shape parameters may be trained in advance. The third neural network may be trained in advance by using the posture parameters and the shape parameters corresponding to different objects. According to the example, the accuracy of rationality judgment on the posture parameters and the shape parameters of the target object can be improved by utilizing the prior information.

In other examples, the rationality of at least one of the first coordinate prediction data, the second coordinate prediction data, and the second three-dimensional model may be determined to obtain a value of the second loss function corresponding to the training image.

As an example of this implementation, the loss function corresponding to the training image includes a third loss function corresponding to the training image; determining a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter, including: under the condition that the training image is not the first image in the first image sequence, obtaining a third shape parameter corresponding to an image before the training image in the first image sequence; and determining a value of a third loss function corresponding to the training image according to the second shape parameter and the third shape parameter. In this example, a value of a third loss function corresponding to the training image may be determined in conjunction with a third shape parameter corresponding to one or more images preceding the training image in the first sequence of images. And the third shape parameter represents the shape parameter corresponding to the image before the training image in the first image sequence. In this example, the value of the third loss function corresponding to the training image may be positively correlated with the difference between the second shape parameter and the third shape parameter, that is, the larger the difference between the second shape parameter and the third shape parameter, the larger the value of the third loss function corresponding to the training image is, and the smaller the difference between the second shape parameter and the third shape parameter, the smaller the value of the third loss function corresponding to the training image is. In this example, by performing the constraint of the shape parameter in time series, the target object can be constrained in the depth direction in the training of the first neural network, thereby facilitating the first neural network to mine geometric features of the target object, such as bone length, body type, stature, and the like.

In an example, the third shape parameter is an average shape parameter corresponding to a previous image of the training image, and the average shape parameter corresponding to a first image in the first image sequence is a shape parameter corresponding to the target object in the image predicted by the first neural network; determining a value of a third loss function corresponding to the training image according to the second shape parameter and the third shape parameter includes: determining an average shape parameter corresponding to the training image according to the weighted sum of the second shape parameter and the third shape parameter; determining a value of a third loss function corresponding to the training image according to a difference between the second shape parameter and an average shape parameter corresponding to the training image. In this example, the sum of the weight corresponding to the second shape parameter and the weight corresponding to the third shape parameter may be 1. The average shape parameter corresponding to the training image may represent a shape parameter obtained by performing weighted average on the second shape parameter according to the average shape parameter corresponding to the previous image of the training image. The value of the third loss function corresponding to the training image may be positively correlated with the difference between the second shape parameter and the average shape parameter corresponding to the training image. That is, the larger the difference between the second shape parameter and the average shape parameter corresponding to the training image, the larger the value of the third loss function corresponding to the training image; the smaller the difference between the second shape parameter and the average shape parameter corresponding to the training image, the smaller the value of the third loss function corresponding to the training image.

For example, the first image in the first image sequence may be denoted as I₁First neural network predicted image I₁The shape parameter corresponding to the target object can be expressed as beta₁Image I₁The corresponding average shape parameter can be noted as

When I > 1, the ith image I in the first image sequence_iCorresponding average shape parameter

Can be determined using equation 3:

wherein the content of the first and second substances,

representing the I-1 st picture I in the first sequence of pictures_i-1Corresponding average shape parameter, δ

Corresponding weight, 1-delta denotes beta_iThe corresponding weight. For example, δ is 0.9. Of course, those skilled in the art can flexibly set the weight according to the requirements of the actual application scenario, and is not limited herein.

The image I in the first image sequence may be determined using equation 4_iThe value of the corresponding third loss function:

according to the above example, the shape parameters of the target object can be constrained for a long time by using the prior information that the shape of the target object remains unchanged, thereby contributing to improving the stability and accuracy of the shape parameter prediction of the target object by the first neural network.

In another example, the third shape parameter may also be a shape parameter corresponding to the target object in the previous image predicted by the first neural network.

As an example of this implementation, the loss function corresponding to the training image includes a fourth loss function corresponding to the training image; determining a value of a loss function corresponding to the training image according to at least one of the second pose parameter, the second shape parameter, and the second camera parameter, including: obtaining third coordinate prediction data of the three-dimensional keypoint set of the target object in an image previous to the training image, if the training image is not the first image in the first image sequence; and determining a value of a fourth loss function corresponding to the training image according to a difference between second coordinate prediction data and third coordinate prediction data of the three-dimensional key point set of the target object in the training image. Wherein the third coordinate prediction data represents coordinate prediction data of a three-dimensional keypoint set of the target object in a previous image of the training image. In this example, the value of the fourth loss function corresponding to the training image is positively correlated with the difference between the second coordinate prediction data and the third coordinate prediction data. That is, the larger the difference between the second coordinate prediction data and the third coordinate prediction data is, the larger the value of the fourth loss function corresponding to the training image is; the smaller the difference between the second coordinate prediction data and the third coordinate prediction data is, the smaller the value of the fourth loss function corresponding to the training image is.

In one example, equation 5 may be used to determine training image I_iValue of the corresponding fourth loss function

Wherein, X_iRepresenting second coordinate prediction data (i.e. training image I)_iCoordinate prediction data of a three-dimensional keypoint set of a medium target object), X)_i-1Representing third coordinate prediction data (i.e. training image I)_i-1Coordinate prediction data of a three-dimensional keypoint set of a medium target object).

In the above example, the coordinates of the three-dimensional key point set of the target object are short-term constrained by using the continuity of the three-dimensional target object motion and the camera motion, thereby contributing to improving the stability and accuracy of the shape parameter prediction of the target object by the first neural network.

In the foregoing implementation manner, the loss function corresponding to the training image may include at least one of a first loss function, a second loss function, a third loss function, and a fourth loss function corresponding to the training image. As an example of this implementation, the loss function corresponding to the training image may include a first loss function, a second loss function, a third loss function, and a fourth loss function corresponding to the training image. As another example of this implementation, the loss function corresponding to the training image may include a first loss function and a second loss function corresponding to the training image.

In one example, equation 6 may be used to determineTraining image I_iValue of corresponding loss function

Wherein the content of the first and second substances,

representing a training image I_iThe value of the corresponding first loss function,

representing a training image I_iThe value of the corresponding second loss function,

representing a training image I_iThe value of the corresponding third loss function,

representing a training image I_iValue of the corresponding fourth loss function, γ₁To represent

Corresponding weight, γ₂To represent

Corresponding weight, γ₃To represent

Corresponding weights, e.g. gamma₁＝3.2×10^-5，γ₂＝3.2×10^-3，γ₃＝1×10^-4。

The embodiment of the disclosure can be applied to the technical fields of augmented reality, virtual reality and the like. For example, the embodiment of the disclosure can be applied to application scenes such as virtual anchor, virtual fitting and virtual dance. The three-dimensional reconstruction method provided by the embodiment of the present disclosure is described below through a specific application scenario. In this application scenario, the type of object may be a human body.

First, a general network (i.e., a second neural network) Φ can be trained using a large number of videos corresponding to a large number of different human bodies_g. Wherein the general network phi_gComprising an encoder (i.e. a third sub-network)

And regressor (i.e. fourth subnetwork)

In general networks phi_gAfter training is completed, a proprietary network (i.e., a first neural network) Φ can be constructed_p. FIG. 3 illustrates a proprietary network Φ in an application scenario provided by an embodiment of the disclosure_pSchematic representation of (a). Wherein the exclusive network phi_pComprising an encoder

Encoder (i.e. first sub-network)

Fusion layer and regressor (i.e. second subnetwork)

Wherein the fusion layer is respectively connected with the encoder

Encoder for encoding a video signal

And regression device

And (4) connecting. Encoder for encoding a video signal

Network structure, initialization parameters and encoder

Same, regression device

Network structure, initialization parameters and regression device

The same is true. Encoder for encoding a video signal

Can be used for extracting the exclusive feature (namely, the first feature) h of the target human body in the input image_pEncoder

Can be used for extracting general characteristics (i.e. second characteristics) h of target human body in input image_gThe fused layer can be used to fuse the proprietary feature and the generic feature to obtain a fused feature (i.e., a third feature) h_f. In a private network phi_pDuring the training process, only the encoder can be updated

And regression device

Does not update the encoder

And parameters of the fused layer. Proprietary network phi_pThe learning rate α may be employed. For example, α ═ 2 × 10^-5。

Can obtain the video V corresponding to the target human body_tAnd from video V_tObtaining a video frame set by intermediate sampling

To adopt a video frame set

Training proprietary network phi_p. Wherein, the video V_tCan be an unmarked video, and accordingly, a set of video frames

Nor does it have annotation data. Video frame set V can be estimated by adopting 2D (2 Dimensions) attitude estimation network_sEach video frame I in_iRespectively carrying out 2D attitude estimation to obtain a video frame set V_sEach video frame I in_iPseudo label of 2D key point of middle target human body

Wherein f is_2d() Representing a 2D pose estimation network.

Video frame I_iInput specific network phi_pCan be via a proprietary network phi_pPredicting to obtain video frame I_iShape parameter beta of middle target human body_iVideo frame I_iAttitude parameter theta of middle target human body_iAnd video frame I_iCorresponding camera parameter pi_i. The attitude parameter theta_iAnd a shape parameter beta_iInputting SMPL to obtain video frame I_iA three-dimensional model of the human body of the medium target. For video frame I_iThe three-dimensional model of the medium target human body is regressed to obtain a video frame I_iCoordinate prediction data X of three-dimensional key point set of middle target human body_i. According to the camera parameter pi_iThe video frame I can be converted into_iProjecting the three-dimensional model of the medium target human body to a two-dimensional plane to obtain a video frame I_iCoordinate prediction data x of two-dimensional key point set of middle target human body_i。

Proprietary network phi_pThe update may be performed using equation 1 above and the penalty function may be determined using equations 2-6 above. Proprietary network phi_pT generations can be trained, i.e. epoch is T. For example, T ═ 7. Trained proprietary network Φ_pThe method can be used for three-dimensional reconstruction of the target human body in an off-line video and/or an on-line video stream. Trained proprietary network Φ_pThe special characteristics of the target human body can be captured, and good robustness is achieved for shielding.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a three-dimensional reconstruction apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the three-dimensional reconstruction methods provided by the present disclosure, and corresponding technical solutions and technical effects can be referred to in corresponding descriptions of the method sections and are not described in detail again.

Fig. 4 shows a block diagram of a three-dimensional reconstruction apparatus provided by an embodiment of the present disclosure. As shown in fig. 4, the three-dimensional reconstruction apparatus includes:

an obtaining module 41, configured to obtain an image to be processed corresponding to a target object;

a first prediction module 42, configured to predict the image to be processed by using a first neural network, so as to obtain a first pose parameter and a first shape parameter corresponding to the target object in the image to be processed, where the first neural network is initialized in advance based on a second neural network and is trained according to a first image sequence corresponding to the target object, and the second neural network is trained in advance according to a plurality of second image sequences corresponding to a plurality of objects;

a generating module 43, configured to generate a first three-dimensional model of the target object in the image to be processed according to the first pose parameter and the first shape parameter.

In one possible implementation, the first prediction module 42 is configured to:

fusing the first feature and the second feature to obtain a third feature;

the device comprises:

In one possible implementation, the apparatus further includes:

the determination module is to:

In one possible implementation, the determining module is configured to:

the determination module is to:

In one possible implementation, the determining module is configured to:

the determination module is to:

the first prediction module 42 is configured to:

In some embodiments, functions or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementations and technical effects thereof may refer to the description of the above method embodiments, which are not described herein again for brevity.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-described method. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium.

The embodiment of the present disclosure also provides a computer program, which includes computer readable code, and when the computer readable code runs in an electronic device, a processor in the electronic device executes the computer program to implement the method described above.

The embodiments of the present disclosure also provide a computer program product for storing computer readable instructions, which when executed cause a computer to execute the operations of the three-dimensional reconstruction method provided in any one of the embodiments.

An embodiment of the present disclosure further provides an electronic device, including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the above-described method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 5 illustrates a block diagram of an electronic device 800 provided by an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 5, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (Wi-Fi), a second generation mobile communication technology (2G), a third generation mobile communication technology (3G), a fourth generation mobile communication technology (4G)/long term evolution of universal mobile communication technology (LTE), a fifth generation mobile communication technology (5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 6 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 6, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of three-dimensional reconstruction, comprising:

acquiring an image to be processed corresponding to a target object;

2. The method according to claim 1, wherein the predicting the image to be processed by using the first neural network to obtain a first pose parameter and a first shape parameter corresponding to the target object in the image to be processed comprises:

fusing the first feature and the second feature to obtain a third feature;

3. The method according to claim 1 or 2, wherein the first neural network is further configured to predict a first camera parameter corresponding to the image to be processed;

4. The method according to any one of claims 1 to 3, wherein before the predicting the image to be processed by using the first neural network, the method further comprises:

5. The method of claim 4, wherein the loss function for the training image comprises a first loss function for the training image;

6. The method of claim 5, wherein obtaining first coordinate prediction data for a two-dimensional keypoint set of the target object in the training image based on the second pose parameter, the second shape parameter, and the second camera parameter comprises:

7. The method of claim 5 or 6, wherein the two-dimensional pose estimation result further comprises confidence levels corresponding to the key points in the coordinate labels one to one;

8. The method of any one of claims 4 to 7, wherein the loss function corresponding to the training image comprises a second loss function corresponding to the training image;

9. The method according to claim 8, wherein the determining the rationality of the second pose parameter and the second shape parameter pair to obtain a value of a second loss function corresponding to the training image comprises:

10. The method of any one of claims 4 to 9, wherein the loss function corresponding to the training image comprises a third loss function corresponding to the training image;

11. The method according to claim 10, wherein the third shape parameter is an average shape parameter corresponding to a previous image of the training images, and the average shape parameter corresponding to a first image in the first image sequence is a shape parameter corresponding to the target object in the image predicted by the first neural network;

12. The method according to any one of claims 4 to 11, wherein the loss function corresponding to the training image comprises a fourth loss function corresponding to the training image;

13. The method of any one of claims 1 to 12, wherein the first neural network comprises a first sub-network and a second sub-network, wherein the second neural network comprises a third sub-network and a fourth sub-network, wherein the network structure of the first sub-network is the same as the network structure of the third sub-network, wherein the initial parameters of the first sub-network are the same as the parameters of the third sub-network, wherein the network structure of the second sub-network is the same as the network structure of the fourth sub-network, and wherein the initial parameters of the second sub-network are the same as the parameters of the fourth sub-network.

14. The method of claim 13, wherein the first neural network further comprises the third sub-network and a fusion layer, wherein the fusion layer is connected to the first sub-network, the third sub-network, and the second sub-network, respectively, and wherein parameters of the third sub-network remain fixed during training of the first neural network;

15. A three-dimensional reconstruction apparatus, comprising:

16. An electronic device, comprising:

one or more processors;

a memory for storing executable instructions;

wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the method of any one of claims 1 to 14.

17. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 14.