CN115578511A

CN115578511A - Semi-supervised single-view 3D object reconstruction method

Info

Publication number: CN115578511A
Application number: CN202211149378.9A
Authority: CN
Inventors: 邢桢; 吴祖煊; 姜育刚
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2023-01-06

Abstract

The invention belongs to the technical field of computer vision three-dimensional reconstruction, and particularly relates to a semi-supervised single-view 3D object reconstruction method. The invention trains a neural network by using a small amount of labeled samples, then generates pseudo labels for the unlabeled samples by using the trained neural network and guides the unlabeled sample training, and meanwhile, the invention provides a discriminator for scoring the quality of the generated pseudo labels and limits the bias of low-quality pseudo labels on model training. In addition, the invention provides a prototype shape prior module based on an attention mechanism, which is used as a bridge spanning an image and a 3D shape, reduces the difference between the two modes, provides shape prior for a neural network and ensures that the reconstructed 3D shape conforms to the nature. Compared with the mainstream method in the current industry, the method has the advantages that under the condition that the amount of the labeled data is small, the three-dimensional reconstruction effect of a single image is more accurate, and the generated 3D shape is more real and natural.

Description

Semi-supervised single-view 3D object reconstruction method

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a semi-supervised single-view 3D object reconstruction method.

Background

Reconstructing the 3D shape of an object from a single RGB image has important real-world significance in many computer vision tasks including 3D printing, virtual reality and 3D scene understanding. The human visual system stores powerful shape priors, which make it easy for a human to deduce the 3D shape of an object from a single image, but this is a great challenge for computers, especially when deducing the shape of an object with its back surface occluded, and how to accurately reconstruct the 3D shape of an object from a single RGB image is still a technical problem.

Although some conventional geometry-based methods such as Structure From Motion (SFM) [1], simultaneous Localization and Mapping (SLAM) [2] are viable solutions, they all rely on large-scale data labeling and solution of camera parameters.

In recent years, with the extensive research of deep learning, the adoption of deep neural networks to predict the 3D shape of a single-view image has been with great success. The object modeling of 3D model is voxel form based on voxel research ([ 3], [4], [5], [6 ]), which is very suitable for convolutional neural network and Transformer model, the method is mainly composed of a coder-decoder form, wherein the coder is used for extracting image features, the decoder predicts the 3D shape of an object by regression according to the depth features extracted by the image, and the method generally adopts supervised cross entropy loss to train the network. The difference lies in the structure of the network model, which gradually develops from the method [7] based on LSTM in the early period to the model [8] [9] based on the convolutional neural network in the later period, and then to the model [10] [11] based on the transform.

In addition, a 3D reconstruction mode based on generation of a confrontation network is adopted, the generation of the confrontation network is used for training the reconstruction network, the generator is used for reconstructing the 3D shape of the object, and the discriminator is used for limiting the reality and the nature of the reconstructed object. In addition, mesh reconstruction based on a graph neural network and a mode based on a depth map or a contour map exist. Although these deep learning based approaches have met with considerable success. But still face the following two challenges: (1) The accuracy of the reconstruction comes from the large number of labeled fine-grained 3D shapes, which is time and cost consuming to obtain a data set. (2) Inferring the shape of an object from a single image is an ill-posed problem because there may be many reasonable 3D shape interpretations for a single 2D image.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, it is an object of the present invention to propose a semi-supervised single view 3D object reconstruction method for reconstructing a 3D shape of an object in an image based on a given single RGB image. The invention combines a semi-supervised learning mode to relieve the pressure of labeling cost, only needs half of labeled samples, can achieve the accuracy rate compared with the accuracy rate of full supervision, and solves the problem of insufficient labeled samples. In addition, a strong shape prior is difficult to learn when the number of labeled samples is small in the method based on the neural network, a prototype shape prior module is introduced to serve as two different modes of connecting a 2D image and a 3D shape with an intermediate bridge to improve prior deficiency and reconstruction misalignment caused by insufficient sample amount, and compared with a method needing an external memory network and a depth map, the prototype shape prior module is light in weight, can be simply transferred to neural networks with various different structures, and practically improves accuracy and naturalness of a reconstructed object.

The invention provides a semi-supervised single-view 3D object reconstruction method, which is divided into three stages, specifically as follows:

the first stage is as follows: supervised training warm-up phase with labeled data

Training a neural network model consisting of a generator and a first discriminator based on a small number of labeled samples, wherein the generator is used for generating a 3D shape, and the first discriminator is used for judging the quality of the generated 3D shape; wherein:

the generator comprises an image feature coding module, a prototype shape prior module and a 3D decoding generation module; image feature coding

The module is used for extracting the multi-scale image features of an input RGB image, the prototype shape prior module is used for calculating cross attention according to the image features and the prototype shape features output by the image feature coding module to extract prior features, and the 3D decoding generation module is used for outputting predicted 3D object voxels by taking the image features output by the image feature coding module and the prior features extracted by the prototype shape prior module as input;

the representation of the 3D object is based on voxel characterization, and the concrete form of the representation is I (I, j, k) epsilon {0,1}, wherein I, j, k represent the 3D coordinates of the point; i (x, y, z) =1 indicates that the point is a voxel constituting an object, otherwise, it indicates that the point has no voxel;

the first discriminator is a natural shape discrimination module and is used for judging whether the 3D object voxel distribution output by the 3D decoding generation module belongs to a labeled sample or a predicted sample, and the consistency of the distribution of the generated reconstructed object and the label sample is ensured;

and a second stage: unsupervised data generation of pseudo-labels for unsupervised training of the training phase of teacher-student network co-learning through loss of consistency

Copying the neural network model parameters of the generator trained in the first stage to obtain two networks: one as a teacher model and the other as a student model; the teacher model is used for generating pseudo labels for the unlabelled samples, the student model is responsible for training all samples with labels and without labels in the second stage, the labeled samples take labels as training targets, the unlabelled samples take the pseudo labels generated by the teacher model as training targets to update network parameters, and parameters of the teacher model are updated and optimized through an EMA algorithm shown in the following formula along with the learning of the student model, so that the trained teacher model is obtained;

θ _t ←αθ _t +(1-α)θ _s

wherein, theta _t Is a parameter of the teacher's network, θ _s Is a parameter of the student network, and alpha is a momentum coefficient;

the invention adopts EMA algorithm to update teacher model parameters after each step, so that the teacher model can be regarded as the integration of student models at a plurality of time steps.

And a third stage: test reconstruction phase

And taking the Teacher model trained in the second stage as a final inference model, inputting the RGB images to be tested into a network, and directly regressing and predicting the 3D shape of the object in an end-to-end mode.

In the invention, in the first stage, an image characteristic encoder module adopts a ResNet model, and a 3D decoding generation module adopts a 3D deconvolution layer.

In the invention, in the first stage, a prototype shape prior module extracts prior characteristics through a multi-head attention model: firstly, calculating attention weights of the image features and the prototype shape features, learning the weights of different parts of the prototype shape corresponding to the image features, multiplying the weights by the prototype shape features value, and extracting the features of the concerned different parts to serve as prior features of the current image; the extraction mode of the prior characteristics is specifically as follows:

Q＝W _q .Query；K＝W _k .Key；V＝W _v .Value

Prior feature＝Multi-head Attention(Q,K,V)

wherein the content of the first and second substances,

are both learnable matrix parameters, and C and D are the dimensions of the image features and the prototype features, respectively.

In the first stage, prototype shape features are extracted from a marked sample through a self-encoder AutoEncoder, and then a KMeans clustering method is adopted to generate a clustering center of each category, wherein the clustering center is called as a prototype shape; the self-encoder is composed of a 3D convolution layer and a 3D deconvolution layer.

In the first stage, the shape natural distinguishing module is mainly formed by stacking 3D convolution layers, the characteristics of the generated shape are extracted by utilizing 3D convolution, and the generated shape quality is judged by taking the characteristic distribution of the label as a target.

In the first stage, the loss function of the neural network model is composed of two parts: reconstruction loss L _rec And generating a countermeasure loss L _d (ii) a Reconstruction loss L _rec For BCE loss, generate a countermeasure loss L _d Comprises the following steps:

wherein D _p And D _g For predictive 3D and annotationOf the 3D shape, y _p And y _g Respectively a predicted sample and a labeled sample, and D is a first discriminator;

the loss function of the final neural network model is:

wherein theta is _f And theta _d Parameters of the generator and the first discriminator, respectively, lambda _d To generate a weighting factor to combat the loss.

In the invention, in the second stage, the loss when the student model is trained by using the labeled sample is BCE loss.

In the invention, in the second stage, the pseudo label generated by the teacher model is input into a second discriminator of the countermeasure network to generate a pseudo label quality score as a weight item, so that the bias of the low-quality pseudo label to a sample training student model without labels is prevented; when the student model is trained with unlabeled samples, the unsupervised loss is as follows:

wherein the content of the first and second substances,

pseudo label, y, generated for teacher network _i Score for the prediction of student network _i Is the second discriminator pair

And n represents the same batch size.

In the second stage, for the unlabelled sample, the input of the teacher model is the image with the sample weak data enhanced, and the input of the student model is the image with the sample strong data enhanced.

The inventor calculates an average IoU (cross-over ratio) according to the 3D shape predicted by the model and the truly labeled 3D shape to measure the quantitative evaluation index of the reconstructed shape, compares the method with the fully supervised method, expands the semi-supervised image recognition method to the task of redoing the comparison, and verifies that the method is better than the two modes mentioned above. In addition, the invention also makes visualization on the reconstructed object, and performs visual qualitative evaluation with other methods, and the method of the invention is obviously more real and natural to approach the marking sample of marking scanning. In summary, the beneficial effects of the invention are as follows:

1. the method solves the problems of insufficient labeled samples and high labeling cost in the 3D reconstruction task by utilizing semi-supervised learning, only needs a small amount of labeled samples, has more accurate three-dimensional reconstruction effect on a single image, has more real and natural reconstruction effect on the generated 3D shape, can reach or even exceed the fully supervised method, and has important significance for the single-view 3D reconstruction task.

2. Aiming at the problem that the natural difference of two modes of a 2D image and a 3D shape can cause the direct regression effect of a neural network to be poor, the invention introduces a prototype shape prior module based on an attention mechanism, the module can effectively supplement shape prior characteristics, and the module is light without introducing extra information and storage space, thereby providing an intermediate bridge for the neural network to deduce the 3D shape from the 2D image characteristics and reducing the difference between the two modes.

3. Aiming at the problem of pseudo label quality evaluation generated by unmarked samples in semi-supervised learning, the invention provides a pseudo label quality scorer based on a generated countermeasure network discriminator, so that samples with lower pseudo label quality can be filtered, and bias is prevented from being generated on model training. In addition, compared with the traditional function only using supervision loss, the loss function provided by the invention can enable the training to be more stable, and the final object reconstruction effect is better.

Drawings

Fig. 1 is a flow chart of semi-supervised single view reconstruction.

Fig. 2 is a schematic diagram of a semi-supervised single-view 3D reconstruction method based on prototype shape prior proposed by the present invention. The method comprises two stages of training: the training of the first stage is to train a generator of 3D shapes and a discriminator of shape quality scores by using a small number of labeled samples, and the stage is divided into four key modules: an image feature encoding module; the device comprises a prototype shape prior module, a 3D decoding generation module and a shape natural distinguishing module; the first three modules are called generators, and the fourth module is a first discriminator and is constructed to generate the countermeasure loss. In the second stage, the invention copies the generator part in the previous stage into a teacher model and a student model, and trains the network by adopting different target labels and loss functions for the unlabeled samples and the labeled samples.

FIG. 3 is a diagram of the prototype shape prior network structure in the present invention. The 2D image is extracted to be an image feature through an image feature coding module to serve as Query, different prototype shapes are extracted to be prototype shape features through a 3D coder to serve as Key and Value, and the shape prior feature of the current image is finally extracted through calculating cross attention.

Figure 4 is a graph of the results of the invention on a synthetic dataset shareent. For such a clean background, noise-free image, the method of the present invention produces a smoother object surface than other methods.

Fig. 5 is a graph of the results of the invention in a natural data set Pix 3D. For the image which is shielded by a complex background and a large amount, the real shape of the object can be better deduced due to the help of shape prior, and the reconstruction effect is far better than that of other methods.

Detailed Description

The technical scheme of the invention is explained in detail by combining the drawings and the embodiment.

In order to reduce the workload of image-3D shape labeling, the invention firstly proposes to train the network model by using a small number of labeled samples, namely the models involved in the invention are semi-supervised pattern training (FIG. 1). The 3D-shaped objects adopted by the invention are all voxel (voxels) as a unified 3D model representation form, the concrete representation form is I (I, j, k) belongs to {0,1}, wherein I, j, k represent the 3D coordinates of the point; i (x, y, z) =1 indicates that the spatial position of the point is a voxel constituting an object, otherwise, the spatial position of the point is empty, and the voxel is a representation form of a common 3D reconstruction method and is suitable for operation and reasoning of a convolutional neural network.

The invention provides a prototype shape prior-based 3D reconstruction system (figure 2), which is based on a deep learning technology and mainly comprises the following four modules: (1) an image feature encoding module; (2) a prototype shape prior module; (3) a 3D decoding generation module; and (4) a natural shape judging module. Giving a single RGB image, firstly, extracting image characteristics by an initial image characteristic extraction module; then, calculating cross-attention to extract prior characteristics by a shape prototype shape prior module according to the image characteristics and the type prototype prior characteristics; finally, the image features and the prior features are spliced together and are used as input to a 3D decoding generation module to predict and generate the 3D shape of the object; in addition, the shape naturalness judging module can judge the generated 3D shape, and stability and naturalness of the generated shape are improved.

In the invention, the image feature coding module and the 3D decoding generation module adopt the existing single-view 3D reconstruction model based on voxel representation, wherein the image feature coding module adopts a ResNet model, thus the performance of the model is also improved on the premise of reducing parameters and calculated amount. For a 3D decoding generation module, the prior characteristics are simultaneously input into the module, the dimensionality of the prior characteristics is lower than that of image characteristics, but the method is greatly helpful for improving the model precision.

In the invention, a prototype shape prior module (figure 3) takes the characteristics of an image as input, calculates the weights of the image characteristics and the prior of prototype shapes in the category, and extracts the required characteristic information of the part from each prototype to be used as the shape prior characteristics corresponding to the image. The invention relates to a process of weight information and prior extraction, which takes the characteristics of an image as Query, takes the shape characteristics of each prototype as Key and Value, and extracts prior characteristics through a multi-head attention model.

In the invention, the shape natural distinguishing module is mainly formed by stacking 3D convolution layers, the 3D convolution is utilized to extract the characteristics of the generated shape, and the characteristics and the characteristic distribution of the label are taken as targets to judge the quality of the generated shape.

In the first stage of the invention, a loss function comprises two parts, namely, a reconstructed binary cross-entry (BCE) loss and a newly-added generated countermeasure loss are added, an image feature coding module, a prototype shape prior module and a 3D decoding generation module of a whole network model are used as generators, a shape natural discrimination module is used as a first discriminator, and a generated countermeasure loss is added to the network to increase the stability of training and ensure that the generated reconstructed object and the label sample are distributed consistently. The final first stage loss function is thus a weighted proportional addition of the two parts.

The second stage of the invention mainly trains the unmarked samples, and the specific steps are as follows:

(1) Firstly, copying a generation model trained in the first stage into two networks, namely a Teacher model and a Student model, wherein the Teacher model is used for generating pseudo labels, and the Student model is used for training the model.

(2) For unlabeled samples, the input of the Teacher model is an image with the sample subjected to weak data enhancement, and the input of the Student model is an image with strong data enhancement.

(3) The output of the Teacher model acts as a pseudo label to guide the training of the Student model. In the second stage, the parameters of the discriminator module are fixed and do not participate in the training of the model, and the pseudo label generated by the Teacher model is input into the discriminator module to generate the quality score of the pseudo label.

(4) The quality scores of the pseudo labels are used as weight terms of the unsupervised loss function, the weight is higher for the pseudo labels with higher quality scores, and otherwise, the weight is lower, so that the model is prevented from being biased by the low-quality pseudo labels.

(5) And adopting smoother L2 loss as an unsupervised loss function, taking the pseudo label quality score obtained in the last step as weight, wherein the supervised loss function is BCE loss as the same as the first stage, and the loss in the stage is the addition of the two parts according to the weight proportion.

In the testing stage of the invention, the Teacher model obtained by the training in the second stage is used as the final testing model, the testing image is input into the model, and then the 3D model generated corresponding to the image can be generated end to end, and the 3D shape of the object is predicted.

The technical scheme of the invention is further illustrated by the following embodiments.

Example 1

The model accepts 224 static images of fixed size as input for the visual branch. Images of various sizes were normalized to the above dimensions by scaling. Two different data enhancement methods are mentioned in the invention, wherein weak data enhancement comprises center clipping, random noise addition and random background addition; the strong data enhancement comprises various modes such as random cutting, random noise, random background, random turning, color turning, RGB channel conversion and the like.

The extraction of the prototype shape adopts a mode of training a self-encoder AutoEncoder, firstly, a small amount of labeled samples are used for training a self-encoder, the self-encoder is formed by 3D convolution layers, and the output of the encoder is used as the characteristic of the 3D shape; then, extracting features of all marked samples through a self-encoder, and generating a clustering center of each category by adopting a KMeans clustering method, wherein the clustering center is called as a prototype shape.

The prototype shape prior module is designed by first taking the image features as Query, setting the hidden layer dimension to 2048, and the number of heads of the multi-head attention mechanism to 2. And under the cross attention mechanism, the shape prior of the prototype shape feature can be embedded and inquired according to the image feature, the output is used as the shape prior feature of the current image and is input into a decoder together with the image feature, and the network can be efficiently and accurately helped to reconstruct a more real and reasonable 3D shape.

Image feature encoding module, and 3D shape decoder and [9]]The design is similar, the framework network of the encoder is ResNet-50, and three layers of 2D convolution layers and batch normalization layers and a ReLu activation function layer are added behind the framework network. Because the 3D reconstruction task belongs to an intensive prediction task and different stages of ResNet can notice different hierarchical characteristics, the invention splices the output of the ResNet last 3 layers as extracted multi-scaleImage features of the degree, and then 3D reconstruction of the object is performed from the multi-scale image features. For voxel 32 ³ The decoder is five 3D deconvolution layers, each of which is followed by a batch normalization layer and a ReLu activation function layer, for a resolution of 128D ³ The voxel decoder of (1) is a seven-layer 3D deconvolution layer, a batch normalization layer and a ReLu activation function layer.

The shape natural decision module is a discriminator to decide whether the distribution of 3D voxels is true (labeled samples) or false (predicted samples), with a voxel resolution of 32 ³ The discriminator is composed of four 3D convolution layers, a maximum pooling layer and a ReLu activation function layer, and has a resolution of 128 ³ The module is composed of five layers of 3D convolutional layers, max pooling layers and ReLu activation function layers. And finally, connecting two linear layers and a Sigmoid activation function layer.

And (4) evaluating the method. For the voxel-based 3D reconstructed model, an IoU (intersection ratio) is used as an evaluation index because it has a referable 3D voxel shape.

Wherein the content of the first and second substances,

and p _(i,j,k) Represents the value of the (I, j, k) th voxel unit, t is an artificially defined threshold, I is a defined condition function that is 1 if this condition is met, and 0 otherwise.

FIG. 4 is a graph of the results of the present invention on the synthetic dataset ShapeNet. For such a clean background, noise-free image, the method of the present invention produces a smoother object surface than other methods.

Fig. 5 is a graph of the results of the invention in the natural data set Pix 3D. For the image which is shielded by a complex background and a large amount, the real shape of the object can be better deduced due to the help of shape prior, and the reconstruction effect is far better than that of other methods.

The specific experimental results of the method in the synthetic data set Shapelet and the natural data set Pix3D are shown in tables 1 and 2 respectively.

TABLE 1

Method \ labeling proportion	1％	5％	10％	20％
					Supervised[8]	41.13	50.32	53.99	58.06
MeanTeacher[12]	43.36(+2.23)	51.92(+1.60)	55.93(+1.94)	58.88(+0.82)
					MixMatch[13]	41.77(+0.64)	51.23(+0.91)	52.62(-1.37)	57.43(-0.63)
FixMatch[14]	42.44(+1.31)	51.89(+1.57)	55.79(+1.80)	59.63(+1.57)
					Method of the invention	46.99(+5.86)	55.23(+4.91)	58.98(+4.99)	61.64(+3.58)

TABLE 2

According to the tables 1 and 2, the model provided by the invention can obviously improve the performance of the 3D reconstruction model and improve the reconstruction effect.

Reference to the literature

[1]Schonberger,J.L.,Frahm,J.M.:Structure-from-motion revisited.In:CVPR(2016).

[2]Cadena,C.,Carlone,L.,Carrillo,H.,Latif,Y.,Scaramuzza,D.,Neira,J.,Reid,I.,Leonard,J.J.:Past,present,and future of simultaneous localization and mapping:Toward the robust-perception age.IEEE Trans.Robotics(2016).

[3]Wu,J.,Zhang,C.,Xue,T.,Freeman,W.T.,Tenenbaum,J.B.:Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling.In:NeurIPS(2016).

[4]Michalkiewicz,M.,Parisot,S.,Tsogkas,S.,Baktashmotlagh,M.,Eriksson,A.,Belilovsky,E.:Few-shot single-view 3-d object reconstruction with compositional priors.In:ECCV(2020).

[5]Yang,G.,Cui,Y.,Belongie,S.,Hariharan,B.:Learning single-view 3d reconstruction with limited pose supervision.In:ECCV(2018)4.

[6]Wu,J.,Zhang,C.,Zhang,X.,Zhang,Z.,Freeman,W.T.,Tenenbaum,J.B.:Learning shape priors for single-view 3d completion and reconstruction.In:ECCV(2018).

[7]Choy,C.B.,Xu,D.,Gwak,J.,Chen,K.,Savarese,S.:3d-r2n2:A unified approach for single and multi-view 3d object reconstruction.In:ECCV(2016).

[8]Xie,H.,Yao,H.,Sun,X.,Zhou,S.,Zhang,S.:Pix2vox:Context-aware 3d reconstruction from single and multi-view images.In:ICCV(2019).

[9]Xie,H.,Yao,H.,Zhang,S.,Zhou,S.,Sun,W.:Pix2vox++:multi-scale context aware 3d object reconstruction from single and multiple images.IJCV(2020).

[10]N.Kolotouros,G.Pavlakos,and K.Daniilidis.Convolutional mesh regression for single-image human shape reconstruction.In CVPR,2019.

[11]Wang,D.,Cui,X.,Chen,X.,Zou,Z.,Shi,T.,Salcudean,S.,Wang,Z.J.,Ward,R.:Multi-view 3d reconstruction with transformers.In:ICCV(2021).

[12]Tarvainen,A.,Valpola,H.:Mean teachers are better role models:Weight-averaged consistency targets improve semi-supervised deep learning results.In:NeurIPS(2017).

[13]Berthelot,D.,Carlini,N.,Goodfellow,I.,Papernot,N.,Oliver,A.,Raffel,C.A.:Mixmatch:A holistic approach to semi-supervised learning.In:NeurIPS(2019).

[14]Sohn,K.,Berthelot,D.,Carlini,N.,Zhang,Z.,Zhang,H.,Raffel,C.A.,Cubuk,E.D.,Kurakin,A.,Li,C.L.:Fixmatch:Simplifying semi-supervised learning with consistency and confidence.In:NeurIPS(2020)。

Claims

1. A semi-supervised single-view 3D object reconstruction method is characterized by comprising three stages:

the generator comprises an image feature coding module, a prototype shape prior module and a 3D decoding generation module; the image feature coding module is used for extracting multi-scale image features of an input RGB image, the prototype shape prior module is used for calculating cross attention according to the image features and the prototype shape features output by the image feature coding module to extract prior features, and the 3D decoding generation module is used for generating predicted 3D object voxels by taking the image features output by the image feature coding module and the prior features extracted by the prototype shape prior module as input;

the first discriminator is a natural shape discrimination module and is used for judging whether the 3D object voxel distribution output by the 3D decoding generation module belongs to a labeled sample or a predicted sample, and the consistency of the distribution of the generated reconstructed object sample and the label sample is ensured;

and a second stage: unsupervised data generating pseudo labels for unsupervised training of the teacher-student network co-learning training phase through consistency loss

Copying the neural network model parameters of the generator trained in the first stage to obtain two networks: one as a teacher model and the other as a student model; the teacher model is used for generating pseudo labels for the unlabelled samples, and the student model is responsible for training all samples with labels and without labels in the second stage, wherein the samples with labels are trained in a supervised mode, the labels are used as training targets, the samples without labels are trained in an unsupervised mode, the pseudo labels generated by the teacher model are used as the training targets, and parameters of the teacher model are updated and optimized through an EMA algorithm shown in the following formula along with the learning of the student model, so that the trained teacher model is obtained;

θ _t ←αθ _t +(1-α)θ _s

wherein, theta _t Is a parameter of the teacher's network, θ _s Is a parameter of the student network, α is the momentum coefficient;

and a third stage: test reconstruction phase

And taking the teacher model trained in the second stage as a final reasoning model, inputting the RGB images to be tested into a network, and directly regressing and predicting the 3D shape of the object in an end-to-end mode.

2. The semi-supervised single-view 3D object reconstruction method according to claim 1, wherein in the first stage, the image feature coding module adopts a ResNet model; the 3D decoding generation module adopts a 3D deconvolution layer.

3. The semi-supervised single-view 3D object reconstruction method of claim 1, wherein in the first stage, the prototype shape prior module extracts prior features through a multi-head attention model: firstly, calculating attention weights of the image features and the prototype shape features, learning the weights of different parts of the prototype shape corresponding to the image features, multiplying the weights by the prototype shape feature value, and extracting the features of the concerned different parts as prior features of the current image; the extraction mode of the prior characteristics is specifically as follows:

Q＝W _q .Query；K＝W _k .Key；V＝W _v .Value

Prior feature＝Multi-head Attention(Q,K,V)

wherein, the first and the second end of the pipe are connected with each other,

are learnable matrix parameters, and C and D are the dimensions of the image features and the prototype features, respectively.

4. The semi-supervised single-view 3D object reconstruction method according to claim 1, wherein in the first stage, the labeled samples are extracted from a self-encoder by an AutoEncoder to obtain prototype shape features, and then a KMeans clustering method is adopted to generate a clustering center of each category, wherein the clustering center is called as a prototype shape; the self-encoder is composed of a 3D convolution layer and a 3D deconvolution layer.

5. The semi-supervised single-view 3D object reconstruction method of claim 1, wherein in the first stage, the shape natural discrimination module is mainly formed by stacking 3D convolution layers, and uses 3D convolution to extract features of a generated shape, and uses feature distribution of a label as a target to judge quality of the generated shape.

6. The semi-supervised single-view 3D object reconstruction method of claim 1, wherein in the first stage, the loss function of the neural network model consists of two parts: reconstruction loss L _rec And generating a countermeasure loss L _d (ii) a Reconstruction loss L _rec To generate a countermeasure loss L for BCE (Binary Cross Engine) loss _d Comprises the following steps:

wherein D _p And D _g For the distribution of predicted 3D and labeled 3D shapes, y _p And y _g Respectively a predicted sample and a labeled sample, and D is a first discriminator;

the loss function of the final neural network model is:

wherein theta is _f And theta _d Are the parameters of the generator and the first discriminator, λ, respectively _d To generate a weighting factor to combat the loss.

7. The semi-supervised single-view 3D object reconstruction method of claim 1, wherein in the second stage, the loss when the student model is trained with labeled samples is BCE loss.

8. The semi-supervised single-view 3D object reconstruction method according to claim 1, wherein in the second stage, the pseudo labels generated by the teacher model are input into a second discriminator of the countermeasure network, and the quality scores of the pseudo labels are generated as weight items to prevent the low-quality pseudo labels from biasing the unmarked sample training student model; when the student model is trained by using the unlabeled sample, the unsupervised loss is as follows:

wherein the content of the first and second substances,

pseudo label, y, generated for teacher's network _i Score for the prediction of student network _i Is the second discriminator pair

And n represents the same batch size.

9. The semi-supervised single-view 3D object reconstruction method of claim 1, wherein in the second stage, for the unlabeled samples, the input of the teacher model is the image with the sample weak data enhancement, and the input of the student model is the image with the sample strong data enhancement.