CN113393582A

CN113393582A - Three-dimensional object reconstruction algorithm based on deep learning

Info

Publication number: CN113393582A
Application number: CN202110563571.6A
Authority: CN
Inventors: 贾海涛; 刘欣月; 张诗涵; 李玉琳; 邹新雷; 任利; 许文波; 罗俊海
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-09-14

Abstract

The invention discloses a three-dimensional object reconstruction algorithm based on deep learning, which comprises the following steps: inputting a plurality of object two-dimensional images obtained from any angle, preprocessing the two-dimensional images, establishing a convolutional neural network, inputting the two-dimensional images into the established convolutional neural network as training data for training, inputting the two-dimensional images to be tested into a trained convolutional neural network model, and outputting a three-dimensional reconstruction result by the convolutional neural network model. In the invention, the convolutional neural network model comprises an encoder, a decoder and a multi-view feature combination module. The encoder inputs a multi-view two-dimensional image and outputs a two-dimensional characteristic vector, and the two-dimensional characteristic vector needs to be converted into three-dimensional information; inputting the three-dimensional information into a decoder to obtain the three-dimensional prediction voxel occupation of the single image; and finally obtaining the final predicted voxel occupation through a multi-view characteristic combination module. In the testing stage, the accuracy is calculated according to 0-1 occupation and ground real occupation obtained by prediction of the hierarchical prediction strategy.

Description

Three-dimensional object reconstruction algorithm based on deep learning

Technical Field

The invention belongs to the field of computer vision and deep learning, and particularly relates to a three-dimensional object reconstruction algorithm based on deep learning.

Background

In recent years, with the advent of public data sets of three-dimensional objects, complete and accurate reconstruction of three-dimensional geometries from images has become a research hotspot in the fields of computer vision, industrial manufacturing and the like. For example, the AR and VR appearing in the 5G era really feel the reconstruction effect of real-time transmission by using a three-dimensional reconstruction technology; in the field of industrial manufacturing, grabbing and obstacle avoidance of mechanical arms, automatic driving automobile path planning and obstacle avoidance and the like are all full utilization of three-dimensional reconstruction technology. In addition, one can get more information from a three-dimensional model than a two-dimensional image. Therefore, three-dimensional object reconstruction becomes increasingly important.

On the other hand, with the development of computer hardware and artificial intelligence technology, the use of deep learning tools to reconstruct three-dimensional models has become a hot trend in research today. Depth learning based three-dimensional object reconstruction can recover the three-dimensional geometry of an object from a single or multiple images without the need for complex camera precision calibration procedures.

At present, most three-dimensional reconstruction algorithms based on deep learning have some problems: when we see an object from one perspective, it is difficult to infer its overall shape structure because of the object's self-occlusion problem. Corresponding to two-dimensional images, a single image contains limited information, and an accurate and complete three-dimensional model cannot be deduced. In order to solve the problem, researchers propose to reconstruct a three-dimensional model by using a plurality of images of the same object under different viewing angles, and comprehensively consider information contained in the plurality of images.

The current multi-view-based three-dimensional reconstruction algorithm uses the memory capacity of the LSTM to input the characteristics of each image as the LSTM and fuse the contained information. Although the reconstruction effect may be improved to some extent as the number of views increases, this procedure still has drawbacks. Due to the temporal sequence of the LSTM structure, the order of the input images will affect the final reconstruction result, which is obviously inconsistent with the original intention of designing the network model.

Therefore, the invention designs a three-dimensional Object Reconstruction algorithm based on deep learning, which is called a 3D Ponet (3D Reconstruction from Object Network) Network. A convolutional neural network based on multi-view three-dimensional reconstruction maps from two-dimensional images to three-dimensional geometry. The 3D FONET can be trained and tested without inputting any additional information such as object class labels, posture information and the like, and the reconstruction result cannot be changed due to the change of the sequence of input images.

Disclosure of Invention

The invention provides a three-dimensional object reconstruction algorithm based on deep learning, which can be used for training and testing without inputting any additional information such as object class labels, attitude information and the like, and the reconstruction result can not be changed due to the change of the sequence of input images. The invention improves the structures of the encoder and the decoder, and improves the reconstruction effect of the three-dimensional object by applying the layered prediction strategy. See description below for details.

The solution of the invention for solving the technical problem is as follows:

a three-dimensional object reconstruction algorithm based on deep learning, the three-dimensional object reconstruction algorithm comprising the steps of:

step 1, inputting a plurality of object two-dimensional images obtained from any angle;

step 2, establishing a convolutional neural network model;

step 3, inputting the two-dimensional image in the step 1 as training data into the convolutional neural network established in the step 2 for training;

and 4, inputting the two-dimensional image to be detected into the convolutional neural network model trained in the step 3, and outputting a three-dimensional reconstruction result by the convolutional neural network model.

The convolutional neural network model in the step 2 comprises an encoder, a decoder and a multi-view feature combination module. The encoder is composed of a ResNet50 network embedded in SE-Block, the convolutional layers are regularized by using BatchNorm, and a ReLU function is used as an activation function. The decoder consists of a three-dimensional deconvolution layer, a BatchNorm layer, a three-dimensional anti-pooling layer and a ReLU. The input of the encoder is a multi-view two-dimensional image, the output of the encoder is a two-dimensional feature vector, and the two-dimensional feature vector needs to be converted into three-dimensional information; the input of the decoder is three-dimensional information obtained by the vector conversion output by the encoder, and the output of the decoder is three-dimensional prediction voxel occupation of a single image; the input of the multi-view characteristic combination module is the three-dimensional prediction voxel occupation of each two-dimensional image, and the output of the multi-view characteristic combination module is the final prediction voxel occupation.

As a further improvement of the above technical solution, the step 3 specifically includes the following steps:

3.1 randomly initializing the convolutional neural network model;

3.2 independently inputting each image into a two-dimensional encoder for feature extraction;

3.3, converting the two-dimensional characteristic vector extracted in the step 3.2 into three-dimensional information;

3.4, inputting the three-dimensional information obtained in the step 3.3 into a three-dimensional decoder for decoding, so as to generate a form of three-dimensional probability voxels and obtain predicted voxel occupation;

3.5, combining the predicted voxel occupation obtained from each image in the step 3.4 together by a multi-view characteristic combination module so as to obtain the final predicted voxel occupation;

3.6 parameters in the model are optimized step by means of a cross entropy loss function.

As a further improvement of the above technical solution, the step 4 specifically includes the following steps:

4.1, inputting the two-dimensional image set to be detected into the convolutional neural network trained in the steps 3.2 to 3.6 to obtain a predicted voxel probability O;

4.2 according to the 0-1 occupation and the ground real occupation obtained by the prediction of the layered prediction strategy, calculating the accuracy.

The invention has the beneficial effects that: the invention completes the training operation of the convolution neural network model through a plurality of two-dimensional images, and then carries out reconstruction operation on the two-dimensional image to be detected by utilizing the convolution neural network model. The encoder is composed of a ResNet50 network embedded in SE-Block, the convolutional layers are regularized by using BatchNorm, and the active function is a ReLU function. The decoder consists of a three-dimensional deconvolution layer, a BatchNorm layer, a three-dimensional anti-pooling layer and a ReLU. The input of the encoder is a multi-view two-dimensional image, the output of the encoder is a two-dimensional feature vector, and the two-dimensional feature vector needs to be converted into three-dimensional information; the input of the decoder is three-dimensional information obtained by the vector conversion output by the encoder, and the output of the decoder is three-dimensional prediction voxel occupation of a single image; the input of the multi-view characteristic combination module is the three-dimensional prediction voxel occupation of each two-dimensional image, and the output of the multi-view characteristic combination module is the final prediction voxel occupation.

The invention has the beneficial effects that: the invention incorporates a multi-view feature combination module in the model to combine features of multiple input images. According to our experience in daily life, when we see an object, we can know the approximate shape of the object by observing its appearance. Because of the shielding of the object, we can only roughly guess the shape of the object according to life experience, but cannot determine the shape of the object. To know the shape of the object, the most straightforward way is to go round the object. Based on the inspiration, after an image is input, through the designed convolutional neural network, the probability of the current predicted voxel occupation is O_tWhere t represents the t-th image. In this image, the part of the object that can be directly observed will get a relatively determined probability, while the network will try to predict the probability that each voxel grid is occupied according to a priori knowledge, due to the part that is not seen by the object's own occlusion. The more the number of input views is, the more visible parts of the object are, the more voxel grids which can determine the occupation situation are, and the model reconstruction effect is continuously improved.

The invention has the beneficial effects that: the invention improves the traditional multi-view three-dimensional reconstruction algorithm, adds a layered prediction strategy into the model, judges whether the voxel grid is occupied or not from outside to inside according to the final predicted voxel occupation probability O, and dynamically adjusts the threshold value of the voxel grid according to the occupation condition of the outer layer voxel grid. If the occupied number of the outer layer voxel grids is small, a smaller threshold value is selected; otherwise, the other way round. As the prediction progresses deep into the object, the prediction results gradually become less inclined. The introduction of the hierarchical prediction strategy leads to improved reconstruction results for thinner parts of the object.

Drawings

In order to more clearly illustrate the technical solution in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is clear that the described figures are only some embodiments of the invention, not all embodiments, and that a person skilled in the art can also derive other designs and figures from them without inventive effort.

FIG. 1 is a schematic diagram of a convolutional neural network model structure of the present invention;

FIG. 2 is a flow chart of a three-dimensional reconstruction algorithm of the present invention;

FIG. 3 is a block diagram of an encoder network according to the present invention;

FIG. 4 is a model number distribution for the ShapeNet dataset of the present invention;

fig. 5 is a comparison of several algorithm reconstruction results after inputting different numbers of multi-view images.

Detailed Description

The conception, specific structure, and technical effects of the present invention will be described clearly and completely with reference to the accompanying examples and drawings so that the reader can fully understand the objects, features, and effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention.

Example 1

Referring to fig. 1 and fig. 2, the invention discloses a three-dimensional object reconstruction algorithm based on deep learning, comprising the following steps:

step 2, establishing a convolutional neural network model;

The convolutional neural network model in the step 2 comprises an encoder, a decoder and a multi-view feature combination module. The encoder excavates the three-dimensional space structure of the encoder by extracting two-dimensional image features, the encoder is composed of a ResNet50 network embedded into SE-Block, the convolutional layers are regularized by using BatchNorm, and a ReLU function is selected as an activation function. The decoder consists of a three-dimensional deconvolution layer, a BatchNorm layer, a three-dimensional anti-pooling layer and a ReLU. The input of the encoder is a multi-view two-dimensional image, the output of the encoder is a two-dimensional feature vector, and the two-dimensional feature vector needs to be converted into three-dimensional information; the input of the decoder is three-dimensional information obtained by the vector conversion output by the encoder, and the output of the decoder is three-dimensional prediction voxel occupation of a single image; the input of the multi-view characteristic combination module is the three-dimensional prediction voxel occupation of each two-dimensional image, and the output of the multi-view characteristic combination module is the final prediction voxel occupation.

Example 2

The scheme in example 1 is described in detail below with reference to specific calculation formulas and examples, and is described in detail below:

the number of the structural bodies of the convolutional neural network model in the step 2 is 3, and the structural bodies are respectively as follows: encoder, decoder, multiview feature combination module. The encoder network designed by the invention is based on a ResNet network, and is added with SE-Block so that the model has a simple attention mechanism. The ReLU activation function was chosen for each convolutional layer and regularized using BatchNorm. The SE-Block module can improve the expression capability of the features at a small cost, and only different weights need to be distributed to different channels. The network structure of the encoder is shown in fig. 3.

In a specific embodiment of the present invention, the encoder network is embedded with a SE-Block module, which first performs global average pooling on the W × H × C feature maps to obtain a feature map capable of sensing global information of an image, where the size of the feature map is 1 × 1 × C, and the operation is called sequeneze. The specification process follows: the Excitation operation has two full connection layers, and the result value is limited between 0 and 1 through a Sigmoid function. The final result value indicates the weight of each channel, and the attention mechanism is realized.

In the embodiment of the invention, the encoder improves the ResNet50 network. The input size of the encoder is 256 × 256 × 3, and the output of the encoder is 4 × 4 × 2048. The encoder of the present invention is modified as follows:

1. and SE-Block is added to the residual module, so that the model has a simple attention mechanism. Network configurations of ResNet50 and SE-ResNet50 are shown in FIG. 3.

2. The input size is modified, the input size of the ResNet50 network is 224 x 224, and the input size is modified to be 256 x 256;

3. removing the final full connection layer of ResNet 50;

4. the encoder output modification size is 4 x 4.

In a specific embodiment of the present invention, the decoder is configured to decode the three-dimensional feature into a three-dimensional volume. Similar to the encoder, the decoder of the present invention also includes five residual blocks, all of which are composed of Conv3, BN layer, 3D unpoiring, ReLU, except that the last block is composed of convolutional layer. The feature size obtained by the encoder is 4 × 4 × 2048, the data needs to be converted into 4 × 4 × 512, the data is input into a decoder, Conv3d3 and unpoiling processing are performed to obtain a resolution of 32 × 32 × 32 × 2, and finally, the prediction probability O of the three-dimensional volume is obtained by performing normalization by using softmax_i。

Further, in the embodiment of the present invention, a cross entropy loss function commonly used in the classification task is selected to optimize the training of the network model. Assuming that the prediction result follows a binomial distribution, the calculation formula of the loss function is shown as the following formula (1):

wherein, y_iExpressed as positive sample, y_iAnd 0 represents a negative sample. p is a radical of_iRepresenting the probability that sample i is predicted to be a positive sample.

The network designed by the invention needs the sum of voxel cross entropies, and the formula is shown as the following formula (2):

wherein, GT_i,j,_kValues of the voxel grid representing the coordinate positions corresponding to i, j, k in the ground truth values, GT_i,j,_kA value of 1 indicates that the voxel grid is occupied, GT_i,j,_kA value of 0 indicates that the voxel grid is empty. O is_i,j,_kAnd the prediction probability of the voxel grid corresponding to the coordinate position in the final prediction voxel occupation probability is shown.

Further, in an embodiment of the present invention, the multi-view feature combining module is configured to combine features of a plurality of input images. In the algorithm structure designed by the invention, each input image can obtain a prediction voxel occupation probability O_tAnd this probability will be biased towards the visible part of the object. The shooting angle of each input image is different, and the number of the visible object parts formed by fusion is increased along with continuous input of two-dimensional images. With the model fusing more and more information, the occupancy rate of the voxel grid is more and more definite, and the reconstruction performance of the model is continuously improved.

More vividly, the final predicted voxel occupancy probability O can be determined by equation (3).

Wherein, O_i,j,kRepresenting the probability of the voxel grid at (i, j, k);

representing the probability of the voxel grid from the t-th image (i, j, k);

representing the final predicted voxel occupancy probability O, n predicted voxel occupancy probabilities obtained in all n input images

The maximum value is taken.

Further, in a specific embodiment of the present invention, the hierarchical prediction strategy determines whether the voxel grid is occupied from outside to inside according to the final predicted voxel occupancy probability O, and dynamically adjusts the threshold of the voxel grid according to the occupancy of the outer layer voxel grid. If the occupied number of the outer layer voxel grids is small, a smaller threshold value is selected; otherwise, the other way round.

In summary, the present invention provides a novel three-dimensional object reconstruction algorithm that can be trained and tested without inputting any additional information such as object class labels, pose information, etc., and the reconstruction result is not changed due to the change of the sequence of input images.

Example 3

The following experiments were performed to verify the feasibility of the protocols of examples 1 and 2, as described in detail below:

1) experimental data set

The invention selects a ShapeNet data set subset, namely a ShapeNet core data set, to train and test. The sharenetre dataset is a subset of the complete sharenet dataset, which contains 55 common object classes, approximately 51300 three-dimensional models. The invention selects 13 categories with more than 1000 models, 43783 three-dimensional models in total. The selected subset categories of the invention are respectively as follows: airplanes (plane), chairs (chair), cars (car), tables (table), sofas (touch), stools (bench), cabinets (closet), displays (monitor), table lamps (lamp), speakers (speaker), rifles (rifle), telephones (telephone), and boats (vessel).

Each model is a 256 × 256 resolution image taken from 12 different angles and saved as a real voxel footprint with a resolution of 32 × 32 × 32. This data set is hereinafter referred to as the ShapeNet data set. FIG. 5 illustrates the number of models for each category in the data set.

2) Evaluation criteria

The present invention uses a voxel-based Union ratio IoU (Intersection-over-Union) to quantitatively evaluate reconstruction performance. IoU is given by the formula (4):

wherein Prediction represents final predicted voxel occupation, and GroudTruth represents ground real voxel occupation. IoU is closer to 0 if the model predicts a voxel grid with ground truth value of 0 as 1; IoU is closer to 1 if the model predicts a voxel grid with ground truth of 1 as 0.

3) Comparison method

The method was compared experimentally with two methods:

3D-R2N 2: it is proposed by Choy et al to learn the mapping of two-dimensional images of objects to three-dimensional geometries in a purely end-to-end manner from a large amount of training data using deep convolutional neural networks.

3D-FHNet: the road strength et al propose that the network uses a feature combination method that treats each input image equally and a hierarchical prediction strategy for improving the reconstruction effect of thinner and thinner portions of the object, so that the model performs three-dimensional reconstruction of any number of images.

4) Network parameter configuration

In the training, an optimization method adam (adaptive motion estimation) is used for network training. The parameters of the Adam optimizer are set as follows, with a learning rate of 0.0001, updating the learning rate every 10 epochs, a weight decay of 1e-5, beta _1 of 0.9, beta _2 of 0.999, and a Batch size of 8. In addition, Epsilon is 10e-8, which prevents division by 0 during the calculation.

5) Results of the experiment

FIG. 5 illustrates the effect of different view numbers on the three-dimensional reconstruction results

As can be seen from fig. 5, for the 3d tool algorithm proposed by the present invention, as the number of views increases, the effect of three-dimensional object reconstruction also increases. And the Basemodel of the invention: compared with the 3D-FHNet method, in the aspect of single-view reconstruction, the 3D FONET method provided by the invention has a slightly weaker effect than the 3D-FHNet on three-dimensional reconstruction; when 3 views, 6 views and 12 views are input, the reconstruction performance of the 3D FAONet method provided by the invention is better than that of the 3D-FHNet method, which can be attributed to the encoder and decoder modules designed by the invention. Compared with the 3D-R2N2 algorithm, under the condition of the same number of input views, the three-dimensional reconstruction effect of the 3D FONET algorithm provided by the invention is better than that of the 3D-R2N2 algorithm.

In summary, the experimental process, the experimental data and the experimental results in the embodiments of the present invention verify the feasibility of the schemes in

embodiments

1 and 2, and the three-dimensional object reconstruction algorithm provided in the embodiments of the present invention has a good three-dimensional reconstruction capability for a two-dimensional image.

Claims

1. A three-dimensional object reconstruction algorithm based on deep learning is characterized by comprising the following steps:

step 1: inputting a plurality of two-dimensional images of an object obtained from an arbitrary angle;

step 2: establishing a convolutional neural network model;

and step 3: inputting the two-dimensional image in the step 1 as training data into the convolutional neural network established in the step 2 for training;

and 4, step 4: and (3) inputting the two-dimensional image to be detected into the convolutional neural network model trained in the step (3), and outputting a three-dimensional reconstruction result by the convolutional neural network model.

2. The method of claim 1, wherein: the number of the structural bodies of the convolutional neural network model in the step 2 is 3, and the structural bodies are respectively as follows: encoder, decoder, multiview feature combination module. The encoder is composed of a ResNet50 network embedded in SE-Block, the convolutional layers are regularized by using BatchNorm, and a ReLU function is used as an activation function. The decoder consists of a three-dimensional deconvolution layer, a BatchNorm layer, a three-dimensional anti-pooling layer and a ReLU. The input of the encoder is a multi-view two-dimensional image, the output of the encoder is a two-dimensional feature vector, and the two-dimensional feature vector needs to be converted into three-dimensional information; the input of the decoder is three-dimensional information obtained by the vector conversion output by the encoder, and the output of the decoder is three-dimensional prediction voxel occupation of a single image; the input of the multi-view characteristic combination module is the three-dimensional prediction voxel occupation of each two-dimensional image, and the output of the multi-view characteristic combination module is the final prediction voxel occupation.

3. The method according to claim 2, wherein the step 3 comprises the following steps:

3.1 randomly initializing the convolutional neural network model;

3.3 extracting the two-dimensional feature vector F from step 3.2_iConversion into three-dimensional information V_i；

3.4 three-dimensional information V obtained in step 3.3_iInputting the data into a three-dimensional decoder for decoding, thereby generating a form of three-dimensional probability voxels to obtain predicted voxel occupation O_i；

3.5 causing the predicted voxel occupancy O obtained for each image of step 3.4_iCombining the multi-view characteristic combination modules together to obtain the final predicted voxel occupation O;

4. The method of claim 3, wherein: the step 4 specifically comprises the following steps:

5. The method of claim 4, wherein: the layered prediction strategy of the step 4.2 is to judge whether the voxel grid is occupied or not from outside to inside according to the final prediction voxel occupation probability O of the multi-view feature combination module, and dynamically adjust the threshold value of the voxel grid according to the occupation condition of the outer layer voxel grid.