CN112489218B - Single-view three-dimensional reconstruction system and method based on semi-supervised learning - Google Patents

Single-view three-dimensional reconstruction system and method based on semi-supervised learning Download PDF

Info

Publication number
CN112489218B
CN112489218B CN202011372323.5A CN202011372323A CN112489218B CN 112489218 B CN112489218 B CN 112489218B CN 202011372323 A CN202011372323 A CN 202011372323A CN 112489218 B CN112489218 B CN 112489218B
Authority
CN
China
Prior art keywords
dimensional
image
layer
reconstruction
view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011372323.5A
Other languages
Chinese (zh)
Other versions
CN112489218A (en
Inventor
史金龙
白素琴
葛俊彦
钱强
成昌喜
钱萍
欧镇
田朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202011372323.5A priority Critical patent/CN112489218B/en
Publication of CN112489218A publication Critical patent/CN112489218A/en
Application granted granted Critical
Publication of CN112489218B publication Critical patent/CN112489218B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/20Finite element generation, e.g. wire-frame surface description, tesselation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single-view three-dimensional reconstruction system and a method thereof based on semi-supervised learning, wherein the system comprises the following steps: the three-dimensional reconstruction network is trained using image data sets with pose and instance markers. Training a three-dimensional reconstruction network by using the unmarked image data set, and designing four loss functions: image reconstruction loss function L rec Three-dimensional reconstruction loss function L gt Pose invariant loss function L represented by hidden vector lv Pose invariant loss function L on voxel grid vi . And training the three-dimensional reconstruction network by using the unmarked image data set. The invention adopts training image data with gestures and instance marks to keep consistency of the three-dimensional shape reconstructed by the cross-view and restrict the similarity between the rendered image of the reconstructed three-dimensional shape and the corresponding reference view; the unlabeled image data constrains the rationality of reconstructing the three-dimensional shape; the two image data sets are alternately and iteratively trained in the training process, so that the three-dimensional shape of the object can be quickly reconstructed from a single view conveniently and in a contactless manner.

Description

Single-view three-dimensional reconstruction system and method based on semi-supervised learning
Technical Field
The invention belongs to the technical field of single-view three-dimensional reconstruction, and particularly relates to a single-view three-dimensional reconstruction system and a method based on semi-supervised learning.
Technical Field
The ability to understand the three-dimensional structure of an object from a single image is a feature of the human visual system and is also a key step in visual reasoning and interaction. However, the individual images themselves do not have sufficient information to be required for three-dimensional reconstruction. To reconstruct the three-dimensional shape of an object from a single image, the machine vision system must obtain valid shape prior information, such as: the machine vision system is enabled to know that all automobiles have wheels, so that the three-dimensional shape of the automobile can be reconstructed more accurately. A key issue is how the machine vision system obtains such a priori information.
The current deep learning-based technique on large-scale data allows the vision system to acquire three-dimensional prior knowledge of the reconstructed object. One approach is to train a deep neural network with a three-dimensional shaped data set, but obtaining such a three-dimensional data set requires a three-dimensional modeling expert or a three-dimensional scanning tool, the acquisition process of which is cumbersome and expensive. Another approach developed in recent years, which is more easily than collecting three-dimensional models, is to train with calibrated multi-view images of the same scene and then to perform photometric consistency comparisons with reference views using rendered views of reconstructed three-dimensional shapes, but which is still very expensive in practice: a calibrated multi-view image of thousands of objects needs to be acquired and the annotators are noted with the parameters and precise instances of the image depiction. It is also very difficult to obtain calibrated multi-view data for thousands of objects.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, alternately and iteratively train a three-dimensional reconstruction network on an image dataset with gesture and instance marks and an image dataset without marks, combine supervision with gesture and category marks with weak supervision of an image without marks, and provide a single-view three-dimensional reconstruction system and a method based on semi-supervised learning. The method has the advantages of combining the advantages of supervised and unsupervised strategies and greatly reducing the dependence on the labeling data set.
In order to solve the technical problems, the invention adopts the following technical scheme.
The invention discloses a single-view three-dimensional reconstruction system based on semi-supervised learning, which comprises the following steps: an encoder E, a generator G, a discriminator D and a renderer P;
the encoder E: taking the image as input to generate corresponding hidden vector representation, wherein the network structure of the hidden vector representation comprises 6 two-dimensional convolution layers Conv and 3 full connection layers FC; the convolution kernel size of the 6 convolution layers is 5×5, stride is 1, 2, 1 and 2, and the output channel number filters is 128, 256, 512 and 512, respectively; setting a batch normalization layer, namely a BN layer and a ReLU activation function, behind each convolution layer; the outputs of the 3 full connection layers are 2048, 2048 and 1024 respectively, and a BN layer and a ReLU activation function are arranged behind each full connection layer; the encoder E finally outputs 1024-dimensional feature vectors;
the generator G: the hidden vector representation is used as input to generate a three-dimensional voxel grid, and the network structure of the three-dimensional voxel grid comprises 1 full connection layer FC and 3 three-dimensional transposed convolution layers ConvT; the output of the fully connected layer is 256 x 4 dimensions, setting a BN layer and a ReLU activation function; the kernel size of the three-dimensional transposed convolutional layer is 5 x 5, the stride is 2, the output channels are 256, 128, 64, and 1, setting a BN layer and a ReLU activation function after each three-dimensional transpose convolution layer;
the discriminator D: attempting to distinguish a rendered view of the three-dimensional voxel grid output by the generator G from an image in a data set, so as to improve reconstruction quality, wherein a network structure of the three-dimensional voxel grid comprises 4 two-dimensional convolution layers and 1 fully-connected layer; the convolution kernel size of the 4 convolution layers is 5×5, the steps are 1, and the output channel numbers are 256, 512, 1024 and 2048, respectively; setting a layer normalization layer LN layer and a leak-ReLU activation function behind each convolution layer; the output of the full connection layer is 1 dimension, and a Sigmoid function is set afterwards; the final output of the arbiter D is the probability of generating an image;
the renderer P takes the three-dimensional voxel grid and the gesture as input and outputs a rendering view of a corresponding visual angle.
The invention discloses a single-view three-dimensional reconstruction method based on semi-supervised learning, which comprises the following steps:
step 1, training a three-dimensional reconstruction network by using an image data set with gestures and instance marks;
step 2, training a three-dimensional reconstruction network by using a label-free image data set, and designing four loss functions: image reconstruction loss function L rec Three-dimensional reconstruction loss function L gt Pose invariant loss function L represented by hidden vector lv Pose invariant loss function L on voxel grid vi
Step 3, training a three-dimensional reconstruction network by using the unmarked image data set;
further, the step 1 includes:
1a. Suppose that from two different poses p 1 And p 2 Acquiring a pair of images x of a three-dimensional object 1 ,x 2 Will x 1 And x 2 As input to the encoder E;
encoder E maps the two images to the hidden vector space, denoted E (x) 1 ) And E (x) 2 );
1c generator G is derived from hidden vector E (x 1 ) And E (x) 2 ) A three-dimensional voxel grid is reconstructed, denoted as G (E (x) 1 ) G (x) 2 ));
1d. Pose p 1 、p 2 Respectively and corresponding three-dimensional voxel grid G (E (x 1 ))、G(E(x 2 ) As input to the renderer P, a rendering view of a corresponding viewing angle is output.
Further, the four loss functions described in step 2 are designed as follows:
(1) image reconstruction loss function L rec : the reconstructed three-dimensional voxel grid is consistent with the reference image according to the rendering view generated when the pose of a certain camera is projected; setting: (x) 1 ,p 1 ) And (x) 2 ,p 2 ) Is two image/pose pairs sampled from a three-dimensional model; e (x) 1 ) Representation encoder E is based on input image x 1 The generated hidden vector; g (E (x) 1 ) Representing according to the hidden vector E (x) 1 ) A reconstructed three-dimensional shape; then, the reconstructed three-dimensional shape is oriented towards camera pose p 2 Projection methodThe generated rendering view should be aligned with the input image x 2 Keeping consistent, other views are similar; to address this consistency requirement, a reconstruction loss function L is defined rec
L rec =||P(G(E(x 2 )),p 1 )-x 1 || 1+2 +||P(G(E(x 1 )),p 2 )-x 2 || 1+2 (1)
Wherein I 1+2 =||·|| 1 +||·|| 2 Is thatAnd->Regularized sum of reconstruction losses;
(2) three-dimensional reconstruction loss function L gt : three-dimensional voxel grid G (E (x 1 ) G (x) 2 ) Should the reference three-dimensional model Vb be kept consistent, a three-dimensional reconstruction consistency loss function L is defined gt
L gt =||G(E(x 1 ))-Vb|| 1+2 +||G(E(x 2 ))-Vb|| 1+2 (2)
(3) The pose of the hidden vector representation is unchanged and is lost to function L lv : given two randomly sampled views of an object, the encoder E should be able to ignore the pose of the image, mapping them to the same hidden vector; defining invariance loss function L of hidden vector relative to image gesture lv :
L lv =||E(x 1 )-E(x 2 )|| 2 (3)
(4) Pose invariant loss function L on voxel grid vi : the three-dimensional voxel grid reconstructed by generator G from two different views of the same object should remain consistent; defining a pose invariant penalty L based on voxel grids vi
L vi =||G(E(x 1 ))-G(E(x 2 ))|| 1+2 (4)
Training on image datasets with pose markers, attempts to minimize the loss of the following combinations:
L semi-supervised =L rec +αL gt +βL lv +γL vi (5)
wherein α, β and γ are each L gt 、L lv And L vi Weights of (2); let α=β=γ=0.1.
Further, training the three-dimensional reconstruction network using the label-free image dataset described in step 3 includes:
2a, reconstructing a three-dimensional voxel grid by a generator G according to a given hidden vector;
2b, projecting the three-dimensional voxel grid from the random view angle P by using a renderer P, and generating a projected rendering view;
2c, respectively updating the loss of the generator G and the loss of the discriminator D, and enabling the discriminator D to be incapable of distinguishing the rendering view and the reference image by utilizing the concept of countertraining;
design of the loss functions of the arbiter D and the generator GAnd->
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. training data for single view three-dimensional reconstruction requires multiple perspective images of a reference three-dimensional shape or object. Although it is relatively easy to collect multi-view images of small objects, it is difficult to collect multi-view images for large scenes, and furthermore, it is necessary to design a three-dimensional CAD model or perform three-dimensional scanning using a dedicated device such as a three-dimensional scanner for three-dimensional labeling, which is a huge effort. For some specific application scenarios, even three-dimensional reference shapes cannot be obtained. Under the condition, the invention comprehensively utilizes the existing three-dimensional shape data, the marked multi-view data and the unmarked image data, greatly reduces the workload of acquiring multi-view images or three-dimensional marking, and effectively realizes the three-dimensional reconstruction of single view.
2. Besides the existing three-dimensional shape data, multi-view data with marks and unmarked image data, in the process of three-dimensional reconstruction, images without examples or gestures can also provide partial prior information for the reconstruction result, so that the invention fully utilizes the images without examples or gestures to capture the distribution information of the visual appearance of the object from the data, thereby adding rationality constraint to the three-dimensional reconstruction result.
3. The reconstruction process for the class of unknown objects is very difficult because useful information of the unknown object cannot be obtained. The method can help understand the unknown three-dimensional shape by using the gesture supervision, so the reconstruction performance of the model generated by the unknown object class can be greatly improved by using the gesture supervision.
4. The invention combines the supervision with gesture and category marks with the weak supervision of the unmarked image, thereby greatly reducing the dependence on the marked data set.
Drawings
FIG. 1 is a method schematic of an embodiment of the present invention.
FIG. 2 is a functional block diagram of a three-dimensional reconstruction network employing a label-free image dataset training in accordance with an embodiment of the present invention.
Fig. 3 is a schematic diagram of an encoder network architecture in accordance with an embodiment of the present invention.
Fig. 4 is a schematic diagram of a generator network architecture of an embodiment of the present invention.
Fig. 5 is a schematic diagram of a network architecture of a arbiter according to an embodiment of the present invention.
Detailed Description
The invention relates to a single-view three-dimensional reconstruction system and a method based on semi-supervised learning. The encoder takes the image as input and generates corresponding hidden vector representation; the generator takes the hidden vector representation as input to reconstruct the three-dimensional voxel grid; the discriminator distinguishes the rendering view and the reference view of the three-dimensional voxel grid, and improves the reconstruction quality; the renderer takes the three-dimensional voxel grid and the viewpoints as inputs, and outputs a rendering view of the corresponding viewing angle. The method of the invention adopts training image data with gestures and instance marks to keep consistency of the three-dimensional shape reconstructed by cross-view, and constrains the similarity between the rendered image of the reconstructed three-dimensional shape and the corresponding reference view; the unlabeled image data constrains the rationality of reconstructing the three-dimensional shape; the two image data sets are alternately and iteratively trained in the training process. The invention alternately and iteratively trains a three-dimensional reconstruction network on an image with a gesture and an example mark and an image without a mark, combines supervision with the gesture and a category mark with weak supervision of the image without the mark, and forms a single-view three-dimensional reconstruction method based on semi-supervised learning. The present invention is a technique for quickly reconstructing a three-dimensional shape of an object from a single view without contact and with convenience.
The present invention will be described in further detail with reference to the accompanying drawings.
FIG. 1 is a method schematic of an embodiment of the present invention. FIG. 2 is a functional block diagram of a three-dimensional reconstruction network employing a label-free image dataset training in accordance with an embodiment of the present invention.
In order to efficiently use two types of image data sets to train a three-dimensional reconstruction network, a system embodiment of the present invention includes: encoder E, generator G, arbiter D and renderer P. The encoder E in fig. 1 is a shared parameter encoder, i.e. the two encoders are actually the same encoder. The generator G in fig. 1 and 2 is also a shared parameter encoder, i.e. the three generators are one and the same generator. The renderer in fig. 1 and 2 is also the same renderer.
The encoder E is used for: taking the image as input, a corresponding hidden vector representation is generated, the network structure of which is shown in fig. 3, and fig. 3 is a schematic diagram of the network structure of an encoder according to an embodiment of the present invention, and the encoder is composed of 6 two-dimensional convolution layers (Conv) and 3 fully-connected layers (FC). The convolution kernel size of the 6 convolution layers is 5×5, the stride (stride) is 1, 2, 1 and 2, the output channel number (filters) is 128, 256, 512 and 512, respectively, and a batch normalization layer (i.e., BN layer) and 1 ReLU activation function are set after each convolution layer; the outputs of the 3 fully connected layers are 2048, 2048 and 1024, respectively, with 1 BN layer and 1 ReLU activation function placed behind each fully connected layer. The encoder E finally outputs 1024-dimensional feature vectors.
The generator G is configured to: with the hidden vector representation as input, a three-dimensional voxel grid is generated, the network structure of which is shown in fig. 4, and fig. 4 is a schematic diagram of the network structure of a generator according to an embodiment of the present invention, and the network structure of the generator is composed of 1 fully connected layer (FC) and 3 three-dimensional transposed convolutional layers (ConvT). The output of the fully connected layer is 256 x 4 dimensions, setting 1 BN layer and a ReLU activation function; the kernel size of the three-dimensional transposed convolutional layer is 5 x 5, the stride is 2, the output channels are 256, 128, 64, and 1, after each three-dimensional transposed convolutional layer, 1 BN layer and ReLU activation functions are set.
The arbiter D tries to distinguish the rendering view of the three-dimensional voxel grid output by the generator G and the image in the data set, so as to improve the reconstruction quality, the network structure is shown in fig. 5, and fig. 5 is a schematic diagram of the network structure of the arbiter in an embodiment of the invention, and the arbiter consists of 4 two-dimensional convolution layers and 1 full connection layer. The convolution kernel size of the 4 convolution layers is 5×5, the steps are 1, the output channel numbers are 256, 512, 1024 and 2048, respectively, and 1 layer normalization layer (i.e. LN layer) and 1 leak-ReLU activation function are arranged behind each convolution layer; the output of the full connectivity layer is 1-dimensional, followed by a Sigmoid function. The final output of the arbiter D is the probability of generating an image.
The renderer P takes the three-dimensional voxel grid and the gesture as input and outputs a rendering view of a corresponding visual angle.
As described above, the adoption of the training image data with the gestures and the instance marks has the advantages in the three-dimensional reconstruction training process, so that the consistency of the three-dimensional shape reconstructed by the cross-view can be maintained, and the similarity between the rendered image of the reconstructed three-dimensional shape and the corresponding reference view can be restrained.
As shown in fig. 1, an embodiment method of the present invention includes:
step 1, training a three-dimensional reconstruction network by using an image data set with gestures and instance marks.
Step 2, training a three-dimensional reconstruction network by using a label-free image data set, and designing four loss functions: image reconstruction loss function L rec Three-dimensional reconstruction loss function L gt Pose invariant loss function L represented by hidden vector lv Pose invariant loss function L on voxel grid vi
And 3, training the three-dimensional reconstruction network by using the unmarked image data set.
In step 1, the three-dimensional reconstruction network is trained by using the image data set with the gestures and the examples, and the principle is as follows:
1a. Suppose that from two different poses p 1 And p 2 Acquiring a pair of images x of a three-dimensional object 1 ,x 2 Will x 1 And x 2 As input to the encoder E;
encoder E maps the two images to the hidden vector space, denoted E (x) 1 ) And E (x) 2 )。
1c generator G is derived from hidden vector E (x 1 ) And E (x) 2 ) A three-dimensional voxel grid is reconstructed, denoted as G (E (x) 1 ) G (x) 2 ))。
1d. Pose p 1 、p 2 Respectively and corresponding three-dimensional voxel grid G (E (x 1 ))、G(E(x 2 ) As input to the renderer P, a rendering view of a corresponding viewing angle is output.
The three-dimensional voxel grid reconstructed by the generator G should be provided with: the reconstruction effect is accurate, and the reconstruction result is not affected by the gesture of the input image. This requires the hidden vector E (x 1 ) And E (x) 2 ) There is invariance to the camera pose of the input image. To ensure the hidden vector E (x 1 ) Is not changed in the posture of the patient,from E (x) 1 ) Predicted three-dimensional voxel grid G (E (x 1 ) Re-projection to a second viewpoint p 2 The resulting projection image should then be compared with the second input image x 2 And remain the same and vice versa. To this end, the present invention contemplates four loss functions, including:
(1) image reconstruction loss function L rec : the reconstructed three-dimensional voxel grid should be consistent with the reference image from the rendered view generated when projected from a certain camera pose. Setting: (x) 1 ,p 1 ) And (x) 2 ,p 2 ) Is two image/pose pairs sampled from a three-dimensional model; e (x) 1 ) The representation encoder E (1) is based on the input image x 1 The generated hidden vector; g (E (x) 1 ) Representing according to the hidden vector E (x) 1 ) A reconstructed three-dimensional shape. Then, the reconstructed three-dimensional shape is oriented towards camera pose p 2 Projection, the generated rendered view should be aligned with the input image x 2 And remain the same, the other views are similar. To address this consistency requirement, a reconstruction loss function L is defined rec
L rec =||P(G(E(x 2 )),p 1 )-x 1 || 1+2 +||P(G(E(x 1 )),p 2 )-x 2 || 1+2 (1)
Wherein I 1+2 =||·|| 1 +||·|| 2 Is thatAnd->Regularized reconstruction loss.
(2) Three-dimensional reconstruction loss function L gt : three-dimensional voxel grid G (E (x 1 ) G (x) 2 ) Should the reference three-dimensional model Vb be kept consistent, a three-dimensional reconstruction consistency loss function L is defined gt
L gt =||G(E(x 1 ))-Vb|| 1+2 +||G(E(x 2 ))-Vb|| 1+2 (2)
(3) Hidden vectorThe represented pose unchanged loss function L lv : given two randomly sampled views of an object, the encoder E (1) should be able to ignore the poses of the images and map them to the same hidden vector. Defining invariance loss function L of hidden vector relative to image gesture lv :
L lv =||E(x 1 )-E(x 2 )|| 2 (3)
(4) Pose invariant loss function L on voxel grid vi : the three-dimensional voxel grid reconstructed by generator G from two different views of the same object should remain consistent. Defining a pose invariant penalty L based on voxel grids vi
L vi =||G(E(x 1 ))-G(E(x 2 ))|| 1+2 (4)
The broken lines in fig. 1 illustrate the four loss functions described above. Training on image datasets with pose markers, attempts to minimize the loss of the following combinations:
L semi-supervised =L rec +αL gt +βL lv +γL vi (5)
wherein α, β and γ are each L gt 、L lv And L vi Is a weight of (2). The present invention uses α=β=γ=0.1.
As shown in FIG. 2, a functional block diagram of one embodiment of the present invention for training a three-dimensional reconstruction network using a marker-free image dataset. In order to illustrate the rationality of the unlabeled image data to constrain the reconstruction of the three-dimensional shape, in step 2, the present invention designs an countermeasure training method that trains the three-dimensional reconstruction network using the unlabeled image data set. Wherein four loss functions are designed: image reconstruction loss function L rec Three-dimensional reconstruction loss function L gt Pose invariant loss function L represented by hidden vector lv Pose invariant loss function L on voxel grid vi The method comprises the steps of carrying out a first treatment on the surface of the The principle is described as follows:
2a, reconstructing a three-dimensional voxel grid by a generator G according to a given hidden vector;
2b, projecting the three-dimensional voxel grid from the random view angle P by using a renderer P, and generating a projected rendering view;
and 2c, respectively updating the loss of the generator G and the loss of the discriminator D, and enabling the discriminator D to be incapable of distinguishing the rendering view from the reference image by utilizing the concept of contrast training. Namely: the rendered image should look similar to the image in the dataset no matter from which camera pose projection the three-dimensional voxel grid is reconstructed.
Loss function of discriminator D and generator GAnd->The method comprises the following steps:
in summary, the invention provides a single-view three-dimensional reconstruction method based on semi-supervised learning. The present invention considers two classes of training data sets: one type is a label-free image dataset, which may be downloaded from the internet, considering a large-scale image set of a class, but without any precise instance or pose labels. Although it is difficult to extract three-dimensional information from these images, distribution information of the visual appearance of the object can be captured from the class; the other is a labeled image dataset with poses and instances, considering other semantic classes of labeled images that do not tell us about nuances of a particular class, but which can describe the general shape of an object. For example, most shapes are compact, smooth, tend to be convex, and so on. The invention provides an effective single-view three-dimensional reconstruction method based on semi-supervision, which can effectively utilize all the information to train a three-dimensional reconstruction network in an alternating iterative mode and combines supervision with gesture and category marks with weak supervision of unmarked images. The technology is suitable for multiple fields of comprehensive ship guarantee, equipment virtual maintenance, interactive electronic technical manuals, movies, animation, virtual reality, augmented reality, industrial manufacturing and the like, can accurately acquire the three-dimensional shape of an object from a single image, and has wide market prospect.

Claims (4)

1. A single view three-dimensional reconstruction system based on semi-supervised learning, comprising: an encoder E, a generator G, a discriminator D and a renderer P;
the encoder E: taking the image as input to generate corresponding hidden vector representation, wherein the network structure of the hidden vector representation comprises 6 two-dimensional convolution layers Conv and 3 full connection layers FC; the convolution kernel size of the 6 convolution layers is 5×5, stride is 1, 2, 1 and 2, and the output channel number filters is 128, 256, 512 and 512, respectively; setting a batch normalization layer, namely a BN layer and a ReLU activation function, behind each convolution layer; the outputs of the 3 full connection layers are 2048, 2048 and 1024 respectively, and a BN layer and a ReLU activation function are arranged behind each full connection layer; the encoder E finally outputs 1024-dimensional feature vectors;
the generator G: the hidden vector representation is used as input to generate a three-dimensional voxel grid, and the network structure of the three-dimensional voxel grid comprises 1 full connection layer FC and 3 three-dimensional transposed convolution layers ConvT; the output of the fully connected layer is 256 x 4 dimensions, setting a BN layer and a ReLU activation function; the kernel size of the three-dimensional transposed convolutional layer is 5 x 5, the stride is 2, the output channels are 256, 128, 64, and 1, setting a BN layer and a ReLU activation function after each three-dimensional transpose convolution layer;
the discriminator D: attempting to distinguish a rendered view of the three-dimensional voxel grid output by the generator G from an image in a data set, so as to improve reconstruction quality, wherein a network structure of the three-dimensional voxel grid comprises 4 two-dimensional convolution layers and 1 fully-connected layer; the convolution kernel size of the 4 convolution layers is 5×5, the steps are 1, and the output channel numbers are 256, 512, 1024 and 2048, respectively; setting a layer normalization layer LN layer and a leak-ReLU activation function behind each convolution layer; the output of the full connection layer is 1 dimension, and a Sigmoid function is set afterwards; the final output of the arbiter D is the probability of generating an image;
the renderer P takes the three-dimensional voxel grid and the gesture as input and outputs a rendering view of a corresponding visual angle.
2. A single view three-dimensional reconstruction method based on semi-supervised learning using the system of claim 1, the method comprising:
step 1, training a three-dimensional reconstruction network by using an image data set with gestures and instance marks;
step 2, training a three-dimensional reconstruction network by using a label-free image data set, and designing four loss functions: image reconstruction loss function L rec Three-dimensional reconstruction loss function L gt Pose invariant loss function L represented by hidden vector lv Pose invariant loss function L on voxel grid vi
Step 3, training a three-dimensional reconstruction network by using the unmarked image data set;
the step 1 comprises the following steps:
1a. Suppose that from two different poses p 1 And p 2 Acquiring a pair of images x of a three-dimensional object 1 ,x 2 Will x 1 And x 2 As input to the encoder E;
encoder E maps the two images to the hidden vector space, denoted E (x) 1 ) And E (x) 2 );
1c generator G is derived from hidden vector E (x 1 ) And E (x) 2 ) A three-dimensional voxel grid is reconstructed, denoted as G (E (x) 1 ) G (x) 2 ));
1d. Pose p 1 、p 2 Respectively and corresponding three-dimensional voxel grid G (E (x 1 ))、G(E(x 2 ) As input to the renderer P, a rendering view of a corresponding viewing angle is output.
3. The single-view three-dimensional reconstruction method based on semi-supervised learning as set forth in claim 2, wherein the four loss functions in step 2 are designed as follows:
(1) image reconstruction loss function L rec : the reconstructed three-dimensional voxel grid is consistent with the reference image according to the rendering view generated when the pose of a certain camera is projected; setting: (x) 1 ,p 1 ) And (x) 2 ,p 2 ) Is two image/pose pairs sampled from a three-dimensional model; e (x) 1 ) Representation encoder E is based on input image x 1 The generated hidden vector; g (E (x) 1 ) Representing according to the hidden vector E (x) 1 ) A reconstructed three-dimensional shape; then, the reconstructed three-dimensional shape is oriented towards camera pose p 2 Projection, the generated rendered view should be aligned with the input image x 2 Keeping consistent, other views are similar; to address this consistency requirement, a reconstruction loss function L is defined rec
L rec =||P(G(E(x 2 )),p 1 )-x 1 || 1+2 +||P(G(E(x 1 )),p 2 )-x 2 || 1+2 (1)
Wherein I 1+2 =||·|| 1 +||·|| 2 Is thatAnd->Regularized sum of reconstruction losses;
(2) three-dimensional reconstruction loss function L gt : three-dimensional voxel grid G (E (x 1 ) G (x) 2 ) Should the reference three-dimensional model Vb be kept consistent, a three-dimensional reconstruction consistency loss function L is defined gt
L gt =||G(E(x 1 ))-Vb|| 1+2 +||G(E(x 2 ))-Vb|| 1+2 (2)
(3) The pose of the hidden vector representation is unchanged and is lost to function L lv : given two randomly sampled views of an object, the encoder E should be able to ignore the pose of the image, mapping them to the same hidden vector; defining hidden vectors relative to image posesIs a invariant loss function L of (2) lv :
L lv =||E(x 1 )-E(x 2 )|| 2 (3)
(4) Pose invariant loss function L on voxel grid vi : the three-dimensional voxel grid reconstructed by generator G from two different views of the same object should remain consistent; defining a pose invariant penalty L based on voxel grids vi
L vi =||G(E(x 1 ))-G(E(x 2 ))|| 1+2 (4)
Training on image datasets with pose markers, attempts to minimize the loss of the following combinations:
L semi-supervised =L rec +αL gt +βL lv +γL vi (5)
wherein α, β and γ are each L gt 、L lv And L vi Weights of (2); let α=β=γ=0.1.
4. The method for reconstructing a single view three-dimensional system based on semi-supervised learning as set forth in claim 2, wherein the training of the three-dimensional reconstruction network with the unmarked image data set in step 3 comprises:
2a, reconstructing a three-dimensional voxel grid by a generator G according to a given hidden vector;
2b, projecting the three-dimensional voxel grid from the random view angle P by using a renderer P, and generating a projected rendering view;
2c, respectively updating the loss of the generator G and the loss of the discriminator D, and enabling the discriminator D to be incapable of distinguishing the rendering view and the reference image by utilizing the concept of countertraining;
design of the loss functions of the arbiter D and the generator GAnd->
CN202011372323.5A 2020-11-30 2020-11-30 Single-view three-dimensional reconstruction system and method based on semi-supervised learning Active CN112489218B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011372323.5A CN112489218B (en) 2020-11-30 2020-11-30 Single-view three-dimensional reconstruction system and method based on semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011372323.5A CN112489218B (en) 2020-11-30 2020-11-30 Single-view three-dimensional reconstruction system and method based on semi-supervised learning

Publications (2)

Publication Number Publication Date
CN112489218A CN112489218A (en) 2021-03-12
CN112489218B true CN112489218B (en) 2024-03-19

Family

ID=74937363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011372323.5A Active CN112489218B (en) 2020-11-30 2020-11-30 Single-view three-dimensional reconstruction system and method based on semi-supervised learning

Country Status (1)

Country Link
CN (1) CN112489218B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822252B (en) * 2021-11-24 2022-04-22 杭州迪英加科技有限公司 Pathological image cell robust detection method under microscope

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110745A (en) * 2019-03-29 2019-08-09 上海海事大学 Based on the semi-supervised x-ray image automatic marking for generating confrontation network
JP2020060879A (en) * 2018-10-05 2020-04-16 オムロン株式会社 Learning device, image generator, method for learning, and learning program
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993825B (en) * 2019-03-11 2023-06-20 北京工业大学 Three-dimensional reconstruction method based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020060879A (en) * 2018-10-05 2020-04-16 オムロン株式会社 Learning device, image generator, method for learning, and learning program
CN110110745A (en) * 2019-03-29 2019-08-09 上海海事大学 Based on the semi-supervised x-ray image automatic marking for generating confrontation network
CN111243066A (en) * 2020-01-09 2020-06-05 浙江大学 Facial expression migration method based on self-supervision learning and confrontation generation mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
利用自监督卷积网络估计单图像深度信息;孙蕴瀚;史金龙;孙正兴;;计算机辅助设计与图形学学报;20200113(第04期);全文 *

Also Published As

Publication number Publication date
CN112489218A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
Li et al. Robust flow-guided neural prediction for sketch-based freeform surface modeling
Schmidt et al. Self-supervised visual descriptor learning for dense correspondence
Zhou et al. Sparse representation for 3D shape estimation: A convex relaxation approach
Wang et al. 3d shape reconstruction from free-hand sketches
CN110852182A (en) Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
Jin et al. Robust 3D face modeling and reconstruction from frontal and side images
Chen et al. Autosweep: Recovering 3d editable objects from a single photograph
CN115661246A (en) Attitude estimation method based on self-supervision learning
CN111402403B (en) High-precision three-dimensional face reconstruction method
Kang et al. Competitive learning of facial fitting and synthesis using uv energy
Cao et al. Accurate 3-D reconstruction under IoT environments and its applications to augmented reality
Abdulwahab et al. Adversarial learning for depth and viewpoint estimation from a single image
CN112489218B (en) Single-view three-dimensional reconstruction system and method based on semi-supervised learning
Zeng et al. Self-supervised learning for point cloud data: A survey
Yin et al. [Retracted] Virtual Reconstruction Method of Regional 3D Image Based on Visual Transmission Effect
Bende et al. VISMA: A Machine Learning Approach to Image Manipulation
Liu et al. DGSN: Learning how to segment pedestrians from other datasets for occluded person re-identification
CN116188894A (en) Point cloud pre-training method, system, equipment and medium based on nerve rendering
Dhondse et al. Generative adversarial networks as an advancement in 2D to 3D reconstruction techniques
Chang et al. 3D hand reconstruction with both shape and appearance from an RGB image
Aleksandrova et al. 3D face model reconstructing from its 2D images using neural networks
Zhang Image and Graphics: 8th International Conference, ICIG 2015, Tianjin, China, August 13-16, 2015, Proceedings, Part III
Wang et al. A Survey of Deep Learning-based Hand Pose Estimation
Yang et al. Hallucinating very low-resolution and obscured face images
Tata et al. 3D GANs and Latent Space: A comprehensive survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant