CN112489218B

CN112489218B - Single-view three-dimensional reconstruction system and method based on semi-supervised learning

Info

Publication number: CN112489218B
Application number: CN202011372323.5A
Authority: CN
Inventors: 史金龙; 白素琴; 葛俊彦; 钱强; 成昌喜; 钱萍; 欧镇; 田朝晖
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2024-03-19
Anticipated expiration: 2040-11-30
Also published as: CN112489218A

Abstract

The invention discloses a single-view three-dimensional reconstruction system and a method thereof based on semi-supervised learning, wherein the system comprises the following steps: the three-dimensional reconstruction network is trained using image data sets with pose and instance markers. Training a three-dimensional reconstruction network by using the unmarked image data set, and designing four loss functions: image reconstruction loss function L _rec Three-dimensional reconstruction loss function L _gt Pose invariant loss function L represented by hidden vector _lv Pose invariant loss function L on voxel grid _vi . And training the three-dimensional reconstruction network by using the unmarked image data set. The invention adopts training image data with gestures and instance marks to keep consistency of the three-dimensional shape reconstructed by the cross-view and restrict the similarity between the rendered image of the reconstructed three-dimensional shape and the corresponding reference view; the unlabeled image data constrains the rationality of reconstructing the three-dimensional shape; the two image data sets are alternately and iteratively trained in the training process, so that the three-dimensional shape of the object can be quickly reconstructed from a single view conveniently and in a contactless manner.

Description

Single-view three-dimensional reconstruction system and method based on semi-supervised learning

Technical Field

The invention belongs to the technical field of single-view three-dimensional reconstruction, and particularly relates to a single-view three-dimensional reconstruction system and a method based on semi-supervised learning.

Technical Field

The ability to understand the three-dimensional structure of an object from a single image is a feature of the human visual system and is also a key step in visual reasoning and interaction. However, the individual images themselves do not have sufficient information to be required for three-dimensional reconstruction. To reconstruct the three-dimensional shape of an object from a single image, the machine vision system must obtain valid shape prior information, such as: the machine vision system is enabled to know that all automobiles have wheels, so that the three-dimensional shape of the automobile can be reconstructed more accurately. A key issue is how the machine vision system obtains such a priori information.

The current deep learning-based technique on large-scale data allows the vision system to acquire three-dimensional prior knowledge of the reconstructed object. One approach is to train a deep neural network with a three-dimensional shaped data set, but obtaining such a three-dimensional data set requires a three-dimensional modeling expert or a three-dimensional scanning tool, the acquisition process of which is cumbersome and expensive. Another approach developed in recent years, which is more easily than collecting three-dimensional models, is to train with calibrated multi-view images of the same scene and then to perform photometric consistency comparisons with reference views using rendered views of reconstructed three-dimensional shapes, but which is still very expensive in practice: a calibrated multi-view image of thousands of objects needs to be acquired and the annotators are noted with the parameters and precise instances of the image depiction. It is also very difficult to obtain calibrated multi-view data for thousands of objects.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, alternately and iteratively train a three-dimensional reconstruction network on an image dataset with gesture and instance marks and an image dataset without marks, combine supervision with gesture and category marks with weak supervision of an image without marks, and provide a single-view three-dimensional reconstruction system and a method based on semi-supervised learning. The method has the advantages of combining the advantages of supervised and unsupervised strategies and greatly reducing the dependence on the labeling data set.

In order to solve the technical problems, the invention adopts the following technical scheme.

The invention discloses a single-view three-dimensional reconstruction system based on semi-supervised learning, which comprises the following steps: an encoder E, a generator G, a discriminator D and a renderer P;

the encoder E: taking the image as input to generate corresponding hidden vector representation, wherein the network structure of the hidden vector representation comprises 6 two-dimensional convolution layers Conv and 3 full connection layers FC; the convolution kernel size of the 6 convolution layers is 5×5, stride is 1, 2, 1 and 2, and the output channel number filters is 128, 256, 512 and 512, respectively; setting a batch normalization layer, namely a BN layer and a ReLU activation function, behind each convolution layer; the outputs of the 3 full connection layers are 2048, 2048 and 1024 respectively, and a BN layer and a ReLU activation function are arranged behind each full connection layer; the encoder E finally outputs 1024-dimensional feature vectors;

the generator G: the hidden vector representation is used as input to generate a three-dimensional voxel grid, and the network structure of the three-dimensional voxel grid comprises 1 full connection layer FC and 3 three-dimensional transposed convolution layers ConvT; the output of the fully connected layer is 256 x 4 dimensions, setting a BN layer and a ReLU activation function; the kernel size of the three-dimensional transposed convolutional layer is 5 x 5, the stride is 2, the output channels are 256, 128, 64, and 1, setting a BN layer and a ReLU activation function after each three-dimensional transpose convolution layer;

the discriminator D: attempting to distinguish a rendered view of the three-dimensional voxel grid output by the generator G from an image in a data set, so as to improve reconstruction quality, wherein a network structure of the three-dimensional voxel grid comprises 4 two-dimensional convolution layers and 1 fully-connected layer; the convolution kernel size of the 4 convolution layers is 5×5, the steps are 1, and the output channel numbers are 256, 512, 1024 and 2048, respectively; setting a layer normalization layer LN layer and a leak-ReLU activation function behind each convolution layer; the output of the full connection layer is 1 dimension, and a Sigmoid function is set afterwards; the final output of the arbiter D is the probability of generating an image;

the renderer P takes the three-dimensional voxel grid and the gesture as input and outputs a rendering view of a corresponding visual angle.

The invention discloses a single-view three-dimensional reconstruction method based on semi-supervised learning, which comprises the following steps:

step 1, training a three-dimensional reconstruction network by using an image data set with gestures and instance marks;

step 2, training a three-dimensional reconstruction network by using a label-free image data set, and designing four loss functions: image reconstruction loss function L _rec Three-dimensional reconstruction loss function L _gt Pose invariant loss function L represented by hidden vector _lv Pose invariant loss function L on voxel grid _vi ；

Step 3, training a three-dimensional reconstruction network by using the unmarked image data set;

further, the step 1 includes:

1a. Suppose that from two different poses p ₁ And p ₂ Acquiring a pair of images x of a three-dimensional object ₁ ，x ₂ Will x ₁ And x ₂ As input to the encoder E;

encoder E maps the two images to the hidden vector space, denoted E (x) ₁ ) And E (x) ₂ )；

1c generator G is derived from hidden vector E (x ₁ ) And E (x) ₂ ) A three-dimensional voxel grid is reconstructed, denoted as G (E (x) ₁ ) G (x) ₂ ))；

1d. Pose p ₁ 、p ₂ Respectively and corresponding three-dimensional voxel grid G (E (x ₁ ))、G(E(x ₂ ) As input to the renderer P, a rendering view of a corresponding viewing angle is output.

Further, the four loss functions described in step 2 are designed as follows:

(1) image reconstruction loss function L _rec : the reconstructed three-dimensional voxel grid is consistent with the reference image according to the rendering view generated when the pose of a certain camera is projected; setting: (x) ₁ ,p ₁ ) And (x) ₂ ,p ₂ ) Is two image/pose pairs sampled from a three-dimensional model; e (x) ₁ ) Representation encoder E is based on input image x ₁ The generated hidden vector; g (E (x) ₁ ) Representing according to the hidden vector E (x) ₁ ) A reconstructed three-dimensional shape; then, the reconstructed three-dimensional shape is oriented towards camera pose p ₂ Projection methodThe generated rendering view should be aligned with the input image x ₂ Keeping consistent, other views are similar; to address this consistency requirement, a reconstruction loss function L is defined _rec ：

L _rec ＝||P(G(E(x ₂ )),p ₁ )-x ₁ || ₁₊₂ +||P(G(E(x ₁ )),p ₂ )-x ₂ || ₁₊₂ (1)

Wherein I ₁₊₂ ＝||·|| ₁ +||·|| ₂ Is thatAnd->Regularized sum of reconstruction losses;

(2) three-dimensional reconstruction loss function L _gt : three-dimensional voxel grid G (E (x ₁ ) G (x) ₂ ) Should the reference three-dimensional model Vb be kept consistent, a three-dimensional reconstruction consistency loss function L is defined _gt ：

L _gt ＝||G(E(x ₁ ))-Vb|| ₁₊₂ +||G(E(x ₂ ))-Vb|| ₁₊₂ (2)

(3) The pose of the hidden vector representation is unchanged and is lost to function L _lv : given two randomly sampled views of an object, the encoder E should be able to ignore the pose of the image, mapping them to the same hidden vector; defining invariance loss function L of hidden vector relative to image gesture _lv :

L _lv ＝||E(x ₁ )-E(x ₂ )|| ₂ (3)

(4) Pose invariant loss function L on voxel grid _vi : the three-dimensional voxel grid reconstructed by generator G from two different views of the same object should remain consistent; defining a pose invariant penalty L based on voxel grids _vi ：

L _vi ＝||G(E(x ₁ ))-G(E(x ₂ ))|| ₁₊₂ (4)

Training on image datasets with pose markers, attempts to minimize the loss of the following combinations:

L _{semi-supervised} ＝L _rec +αL _gt +βL _lv +γL _vi (5)

wherein α, β and γ are each L _gt 、L _lv And L _vi Weights of (2); let α=β=γ=0.1.

Further, training the three-dimensional reconstruction network using the label-free image dataset described in step 3 includes:

2a, reconstructing a three-dimensional voxel grid by a generator G according to a given hidden vector;

2b, projecting the three-dimensional voxel grid from the random view angle P by using a renderer P, and generating a projected rendering view;

2c, respectively updating the loss of the generator G and the loss of the discriminator D, and enabling the discriminator D to be incapable of distinguishing the rendering view and the reference image by utilizing the concept of countertraining;

design of the loss functions of the arbiter D and the generator GAnd->

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. training data for single view three-dimensional reconstruction requires multiple perspective images of a reference three-dimensional shape or object. Although it is relatively easy to collect multi-view images of small objects, it is difficult to collect multi-view images for large scenes, and furthermore, it is necessary to design a three-dimensional CAD model or perform three-dimensional scanning using a dedicated device such as a three-dimensional scanner for three-dimensional labeling, which is a huge effort. For some specific application scenarios, even three-dimensional reference shapes cannot be obtained. Under the condition, the invention comprehensively utilizes the existing three-dimensional shape data, the marked multi-view data and the unmarked image data, greatly reduces the workload of acquiring multi-view images or three-dimensional marking, and effectively realizes the three-dimensional reconstruction of single view.

2. Besides the existing three-dimensional shape data, multi-view data with marks and unmarked image data, in the process of three-dimensional reconstruction, images without examples or gestures can also provide partial prior information for the reconstruction result, so that the invention fully utilizes the images without examples or gestures to capture the distribution information of the visual appearance of the object from the data, thereby adding rationality constraint to the three-dimensional reconstruction result.

3. The reconstruction process for the class of unknown objects is very difficult because useful information of the unknown object cannot be obtained. The method can help understand the unknown three-dimensional shape by using the gesture supervision, so the reconstruction performance of the model generated by the unknown object class can be greatly improved by using the gesture supervision.

4. The invention combines the supervision with gesture and category marks with the weak supervision of the unmarked image, thereby greatly reducing the dependence on the marked data set.

Drawings

FIG. 1 is a method schematic of an embodiment of the present invention.

FIG. 2 is a functional block diagram of a three-dimensional reconstruction network employing a label-free image dataset training in accordance with an embodiment of the present invention.

Fig. 3 is a schematic diagram of an encoder network architecture in accordance with an embodiment of the present invention.

Fig. 4 is a schematic diagram of a generator network architecture of an embodiment of the present invention.

Fig. 5 is a schematic diagram of a network architecture of a arbiter according to an embodiment of the present invention.

Detailed Description

The invention relates to a single-view three-dimensional reconstruction system and a method based on semi-supervised learning. The encoder takes the image as input and generates corresponding hidden vector representation; the generator takes the hidden vector representation as input to reconstruct the three-dimensional voxel grid; the discriminator distinguishes the rendering view and the reference view of the three-dimensional voxel grid, and improves the reconstruction quality; the renderer takes the three-dimensional voxel grid and the viewpoints as inputs, and outputs a rendering view of the corresponding viewing angle. The method of the invention adopts training image data with gestures and instance marks to keep consistency of the three-dimensional shape reconstructed by cross-view, and constrains the similarity between the rendered image of the reconstructed three-dimensional shape and the corresponding reference view; the unlabeled image data constrains the rationality of reconstructing the three-dimensional shape; the two image data sets are alternately and iteratively trained in the training process. The invention alternately and iteratively trains a three-dimensional reconstruction network on an image with a gesture and an example mark and an image without a mark, combines supervision with the gesture and a category mark with weak supervision of the image without the mark, and forms a single-view three-dimensional reconstruction method based on semi-supervised learning. The present invention is a technique for quickly reconstructing a three-dimensional shape of an object from a single view without contact and with convenience.

The present invention will be described in further detail with reference to the accompanying drawings.

FIG. 1 is a method schematic of an embodiment of the present invention. FIG. 2 is a functional block diagram of a three-dimensional reconstruction network employing a label-free image dataset training in accordance with an embodiment of the present invention.

In order to efficiently use two types of image data sets to train a three-dimensional reconstruction network, a system embodiment of the present invention includes: encoder E, generator G, arbiter D and renderer P. The encoder E in fig. 1 is a shared parameter encoder, i.e. the two encoders are actually the same encoder. The generator G in fig. 1 and 2 is also a shared parameter encoder, i.e. the three generators are one and the same generator. The renderer in fig. 1 and 2 is also the same renderer.

The encoder E is used for: taking the image as input, a corresponding hidden vector representation is generated, the network structure of which is shown in fig. 3, and fig. 3 is a schematic diagram of the network structure of an encoder according to an embodiment of the present invention, and the encoder is composed of 6 two-dimensional convolution layers (Conv) and 3 fully-connected layers (FC). The convolution kernel size of the 6 convolution layers is 5×5, the stride (stride) is 1, 2, 1 and 2, the output channel number (filters) is 128, 256, 512 and 512, respectively, and a batch normalization layer (i.e., BN layer) and 1 ReLU activation function are set after each convolution layer; the outputs of the 3 fully connected layers are 2048, 2048 and 1024, respectively, with 1 BN layer and 1 ReLU activation function placed behind each fully connected layer. The encoder E finally outputs 1024-dimensional feature vectors.

The generator G is configured to: with the hidden vector representation as input, a three-dimensional voxel grid is generated, the network structure of which is shown in fig. 4, and fig. 4 is a schematic diagram of the network structure of a generator according to an embodiment of the present invention, and the network structure of the generator is composed of 1 fully connected layer (FC) and 3 three-dimensional transposed convolutional layers (ConvT). The output of the fully connected layer is 256 x 4 dimensions, setting 1 BN layer and a ReLU activation function; the kernel size of the three-dimensional transposed convolutional layer is 5 x 5, the stride is 2, the output channels are 256, 128, 64, and 1, after each three-dimensional transposed convolutional layer, 1 BN layer and ReLU activation functions are set.

The arbiter D tries to distinguish the rendering view of the three-dimensional voxel grid output by the generator G and the image in the data set, so as to improve the reconstruction quality, the network structure is shown in fig. 5, and fig. 5 is a schematic diagram of the network structure of the arbiter in an embodiment of the invention, and the arbiter consists of 4 two-dimensional convolution layers and 1 full connection layer. The convolution kernel size of the 4 convolution layers is 5×5, the steps are 1, the output channel numbers are 256, 512, 1024 and 2048, respectively, and 1 layer normalization layer (i.e. LN layer) and 1 leak-ReLU activation function are arranged behind each convolution layer; the output of the full connectivity layer is 1-dimensional, followed by a Sigmoid function. The final output of the arbiter D is the probability of generating an image.

As described above, the adoption of the training image data with the gestures and the instance marks has the advantages in the three-dimensional reconstruction training process, so that the consistency of the three-dimensional shape reconstructed by the cross-view can be maintained, and the similarity between the rendered image of the reconstructed three-dimensional shape and the corresponding reference view can be restrained.

As shown in fig. 1, an embodiment method of the present invention includes:

step 1, training a three-dimensional reconstruction network by using an image data set with gestures and instance marks.

Step 2, training a three-dimensional reconstruction network by using a label-free image data set, and designing four loss functions: image reconstruction loss function L _rec Three-dimensional reconstruction loss function L _gt Pose invariant loss function L represented by hidden vector _lv Pose invariant loss function L on voxel grid _vi 。

And 3, training the three-dimensional reconstruction network by using the unmarked image data set.

In step 1, the three-dimensional reconstruction network is trained by using the image data set with the gestures and the examples, and the principle is as follows:

encoder E maps the two images to the hidden vector space, denoted E (x) ₁ ) And E (x) ₂ )。

1c generator G is derived from hidden vector E (x ₁ ) And E (x) ₂ ) A three-dimensional voxel grid is reconstructed, denoted as G (E (x) ₁ ) G (x) ₂ ))。

The three-dimensional voxel grid reconstructed by the generator G should be provided with: the reconstruction effect is accurate, and the reconstruction result is not affected by the gesture of the input image. This requires the hidden vector E (x ₁ ) And E (x) ₂ ) There is invariance to the camera pose of the input image. To ensure the hidden vector E (x ₁ ) Is not changed in the posture of the patient,from E (x) ₁ ) Predicted three-dimensional voxel grid G (E (x ₁ ) Re-projection to a second viewpoint p ₂ The resulting projection image should then be compared with the second input image x ₂ And remain the same and vice versa. To this end, the present invention contemplates four loss functions, including:

(1) image reconstruction loss function L _rec : the reconstructed three-dimensional voxel grid should be consistent with the reference image from the rendered view generated when projected from a certain camera pose. Setting: (x) ₁ ,p ₁ ) And (x) ₂ ,p ₂ ) Is two image/pose pairs sampled from a three-dimensional model; e (x) ₁ ) The representation encoder E (1) is based on the input image x ₁ The generated hidden vector; g (E (x) ₁ ) Representing according to the hidden vector E (x) ₁ ) A reconstructed three-dimensional shape. Then, the reconstructed three-dimensional shape is oriented towards camera pose p ₂ Projection, the generated rendered view should be aligned with the input image x ₂ And remain the same, the other views are similar. To address this consistency requirement, a reconstruction loss function L is defined _rec ：

Wherein I ₁₊₂ ＝||·|| ₁ +||·|| ₂ Is thatAnd->Regularized reconstruction loss.

L _gt ＝||G(E(x ₁ ))-Vb|| ₁₊₂ +||G(E(x ₂ ))-Vb|| ₁₊₂ (2)

(3) Hidden vectorThe represented pose unchanged loss function L _lv : given two randomly sampled views of an object, the encoder E (1) should be able to ignore the poses of the images and map them to the same hidden vector. Defining invariance loss function L of hidden vector relative to image gesture _lv :

L _lv ＝||E(x ₁ )-E(x ₂ )|| ₂ (3)

(4) Pose invariant loss function L on voxel grid _vi : the three-dimensional voxel grid reconstructed by generator G from two different views of the same object should remain consistent. Defining a pose invariant penalty L based on voxel grids _vi ：

L _vi ＝||G(E(x ₁ ))-G(E(x ₂ ))|| ₁₊₂ (4)

The broken lines in fig. 1 illustrate the four loss functions described above. Training on image datasets with pose markers, attempts to minimize the loss of the following combinations:

L _{semi-supervised} ＝L _rec +αL _gt +βL _lv +γL _vi (5)

wherein α, β and γ are each L _gt 、L _lv And L _vi Is a weight of (2). The present invention uses α=β=γ=0.1.

As shown in FIG. 2, a functional block diagram of one embodiment of the present invention for training a three-dimensional reconstruction network using a marker-free image dataset. In order to illustrate the rationality of the unlabeled image data to constrain the reconstruction of the three-dimensional shape, in step 2, the present invention designs an countermeasure training method that trains the three-dimensional reconstruction network using the unlabeled image data set. Wherein four loss functions are designed: image reconstruction loss function L _rec Three-dimensional reconstruction loss function L _gt Pose invariant loss function L represented by hidden vector _lv Pose invariant loss function L on voxel grid _vi The method comprises the steps of carrying out a first treatment on the surface of the The principle is described as follows:

and 2c, respectively updating the loss of the generator G and the loss of the discriminator D, and enabling the discriminator D to be incapable of distinguishing the rendering view from the reference image by utilizing the concept of contrast training. Namely: the rendered image should look similar to the image in the dataset no matter from which camera pose projection the three-dimensional voxel grid is reconstructed.

Loss function of discriminator D and generator GAnd->The method comprises the following steps:

in summary, the invention provides a single-view three-dimensional reconstruction method based on semi-supervised learning. The present invention considers two classes of training data sets: one type is a label-free image dataset, which may be downloaded from the internet, considering a large-scale image set of a class, but without any precise instance or pose labels. Although it is difficult to extract three-dimensional information from these images, distribution information of the visual appearance of the object can be captured from the class; the other is a labeled image dataset with poses and instances, considering other semantic classes of labeled images that do not tell us about nuances of a particular class, but which can describe the general shape of an object. For example, most shapes are compact, smooth, tend to be convex, and so on. The invention provides an effective single-view three-dimensional reconstruction method based on semi-supervision, which can effectively utilize all the information to train a three-dimensional reconstruction network in an alternating iterative mode and combines supervision with gesture and category marks with weak supervision of unmarked images. The technology is suitable for multiple fields of comprehensive ship guarantee, equipment virtual maintenance, interactive electronic technical manuals, movies, animation, virtual reality, augmented reality, industrial manufacturing and the like, can accurately acquire the three-dimensional shape of an object from a single image, and has wide market prospect.

Claims

1. A single view three-dimensional reconstruction system based on semi-supervised learning, comprising: an encoder E, a generator G, a discriminator D and a renderer P;

2. A single view three-dimensional reconstruction method based on semi-supervised learning using the system of claim 1, the method comprising:

the step 1 comprises the following steps:

3. The single-view three-dimensional reconstruction method based on semi-supervised learning as set forth in claim 2, wherein the four loss functions in step 2 are designed as follows:

(1) image reconstruction loss function L _rec : the reconstructed three-dimensional voxel grid is consistent with the reference image according to the rendering view generated when the pose of a certain camera is projected; setting: (x) ₁ ,p ₁ ) And (x) ₂ ,p ₂ ) Is two image/pose pairs sampled from a three-dimensional model; e (x) ₁ ) Representation encoder E is based on input image x ₁ The generated hidden vector; g (E (x) ₁ ) Representing according to the hidden vector E (x) ₁ ) A reconstructed three-dimensional shape; then, the reconstructed three-dimensional shape is oriented towards camera pose p ₂ Projection, the generated rendered view should be aligned with the input image x ₂ Keeping consistent, other views are similar; to address this consistency requirement, a reconstruction loss function L is defined _rec ：

L _gt ＝||G(E(x ₁ ))-Vb|| ₁₊₂ +||G(E(x ₂ ))-Vb|| ₁₊₂ (2)

(3) The pose of the hidden vector representation is unchanged and is lost to function L _lv : given two randomly sampled views of an object, the encoder E should be able to ignore the pose of the image, mapping them to the same hidden vector; defining hidden vectors relative to image posesIs a invariant loss function L of (2) _lv :

L _lv ＝||E(x ₁ )-E(x ₂ )|| ₂ (3)

L _vi ＝||G(E(x ₁ ))-G(E(x ₂ ))|| ₁₊₂ (4)

L _{semi-supervised} ＝L _rec +αL _gt +βL _lv +γL _vi (5)

4. The method for reconstructing a single view three-dimensional system based on semi-supervised learning as set forth in claim 2, wherein the training of the three-dimensional reconstruction network with the unmarked image data set in step 3 comprises:

design of the loss functions of the arbiter D and the generator GAnd->