CN112862949A

CN112862949A - Object 3D shape reconstruction method based on multiple views

Info

Publication number: CN112862949A
Application number: CN202110065500.3A
Authority: CN
Inventors: 童超; 陈荣山
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2021-05-28
Anticipated expiration: 2041-01-18
Also published as: CN112862949B

Abstract

A multi-view based object 3D shape reconstruction method is provided. The provided multi-view-based object three-dimensional shape reconstruction model is based on the basic structure of Pixel2Mesh, and an improved three-dimensional reconstruction model is provided from the aspects of increasing a Convlstm layer, increasing a Graph unpoliting layer and designing a Smooth loss function, and experiments show that the improved model has higher reconstruction accuracy than P2M. By adopting the model, firstly, the real object grid model, the rendered image and the camera parameters in the shapenet data set are preprocessed to construct training data, then a multi-view three-dimensional reconstruction model is trained, and finally, the shape of the object corresponding to the image is reconstructed through the model.

Description

Object 3D shape reconstruction method based on multiple views

Technical Field

The invention provides a multi-view-based object 3D shape reconstruction model, belonging to image data processing_(G06T)And three-dimensional reconstruction_(G06T17)The field of the technology.

Background

One of the main goals of three-dimensional reconstruction is to recover the three-dimensional structure of an object from a two-dimensional image. In recent years, with the development of industries such as virtual reality, 3D printing, automatic driving, intelligent medical treatment, movie and television production and the like, people have explosive growth in demand for three-dimensional models, the traditional manual modeling method is difficult to meet the demand, and an accurate and efficient three-dimensional reconstruction method becomes a key for solving the problem.

In the field of three-dimensional reconstruction, a three-dimensional model is recovered from an image or a video by using a conventional image-based reconstruction algorithm generally in a manner of feature matching or mode adaptation, however, due to ambiguity and sparsity of two-dimensional features of the image, the method is often greatly limited, cannot adapt to reconstruction tasks in various scenes, and is difficult to accurately and robustly reconstruct the three-dimensional model.

In recent years, with the rapid development of deep learning techniques, more and more scholars are focusing on studying data-driven three-dimensional reconstruction algorithms, for example, Choy et al (https:// arxiv.org/pdf/1604.00449.pdf) constructs a voxel generation network model using Encoder-3 DLSTM-Decoder; fan et al (https:// open. the cvf. com/content _ CVPR _2017/papers/Fan _ A _ Point _ Set _ CVPR _2017_ papers. pdf) solve the loss problem when the Point cloud network is trained, and construct a deep network for generating the Point cloud model based on the loss problem. The research of generating a three-dimensional model from an RGB image by utilizing a deep learning technology really achieves great success, but the methods based on a voxel or point cloud representation model have the problem that the method cannot be ignored, the three-dimensional object represented by the voxel has larger calculated amount and more required memory compared with a two-dimensional image, the resolution ratio is mainly less than 32 multiplied by 32 due to the limitation of calculation and memory, and the accuracy after reconstruction cannot meet the requirement; and because points in the point cloud model lack connectivity, the information of the object surface can be lacked, and the reconstructed surface is not flat in visual sense.

For the defects of the voxel and point cloud representation model, Wang (https:// open access. the cvf. com/content _ ECCV _ 2018/pages/Nanyang _ Wang _ Pixel2Mesh _ Generation _3D _ ECCV _2018_ pages. pdf) proposes a graph convolution neural network-based deep learning method Pixel2Mesh, which can generate a Mesh representation model with rich surface details end to end, initialize a fixed ellipsoid, and then gradually deform according to image information to approximate a target geometric shape. However, this single-view based approach, due to the ill-posed problem of single images, can usually only generate 3D mesh shapes that look reasonable from the input perspective and poor from other perspectives.

ShapeNet (https:// www.shapenet.org /) is one of the most authoritative datasets in the field of three-dimensional reconstruction at present, contains 55 common item categories and 513000 three-dimensional models, and benefits from Choy's work, and the models in the 13 categories in the dataset correspond to 24 rendered images at different perspectives, and the camera parameters of each rendered image are also provided, so that the multi-view information of the dataset can be utilized to improve on the basis of Pixel2 Mesh.

Disclosure of Invention

The invention aims to research a novel and high-precision object three-dimensional shape reconstruction model under multiple views based on the theory and method of deep learning, and can reconstruct a 3D grid shape corresponding to an object by using rendering image information of multiple views of the object and camera parameters, wherein the model is superior to the current most advanced three-dimensional reconstruction model.

The invention designs a multi-view-based object three-dimensional shape reconstruction model, which is based on a basic structure of Pixel2Mesh and provides an improved three-dimensional reconstruction model from three aspects of increasing a Convlstm layer, increasing a Graph unpoliting layer and designing a Smooth loss function, and experiments show that the improved model has higher reconstruction accuracy than P2M.

By adopting the model, the real object grid model, the rendering image and the camera parameter in the shapenet data set are preprocessed to construct training data, then a multi-view three-dimensional reconstruction model is trained, and finally the shape of the object corresponding to the image is reconstructed through the model.

The invention comprises the following steps:

step 1, data preparation

1. For 13 classes of 3D mesh models with corresponding rendered pictures in the sharenet dataset, 16834 three-dimensional points from the surface of each model are sampled as the sample labels using an average sampling approach.

2. An ellipsoid mesh model is used as a deformation template, and the ellipsoid comprises 156 vertexes and 308 triangular faces; the ellipsoid is placed at a position 0.8m away from the right front of the camera, and the three-axis radius is 0.2m, 0.2m and 0.4m respectively by taking the ellipsoid as the center of a circle.

Step 2, training a three-dimensional reconstruction model:

1. and constructing a three-dimensional reconstruction model for the multi-view object, wherein the three-dimensional reconstruction model is divided into a feature extraction part and a template deformation part, and an integral training strategy is adopted, namely the feature extraction part and the template deformation part are trained simultaneously.

1) The first part is a feature extraction part used for extracting INPUT image features, the network main body architecture of the part uses a VGG-16(https:// arxiv.org/pdf/1409.1556. pdf% 20http:// arxiv.org/abs/1409.1556.pdf) architecture to extract the INPUT image features by using CNN, 3 images (respectively noted as a graph A, B, C) are INPUT from a network INPUT layer INPUT, encoding is sequentially carried out through 18 layers of convolutional layers, and feature extraction output is carried out at 8 th, 11 th, 14 th and 18 th layers, so as to respectively obtain feature maps with sizes of 56 × 56 × 64, 28 × 28 × 128, 14 × 14 × 256 and 7 × 512 for each INPUT image. Then, the feature maps of the same size are processed by ConvLSTM (https:// papers. nips. cc/paper/2015/file/07563a-3fe3bbe7e3ba84431ad9d055af-paper. pdf) layers, for example, 3 56 × 56 × 64 feature maps obtained from 3 input images are processed separately to obtain 3 × 56 × 64 fused pixel feature maps. Therefore, after the ConvLSTM layer processing, 3 fused pixel feature maps of [56 × 56 × 64, 28 × 28 × 128, 14 × 14 × 256, 7 × 7 × 512] are obtained, and for the sake of convenience of distinction, the fused feature obtained from fig. a is referred to as a fused pixel feature map 1, fig. B is a fused pixel feature map 1, fig. C is a fused pixel feature map 3, and the specific processing procedures are described in detail below with reference to fig. 1, fig. 9, and fig. 10 of the specification.

2) The second part is a template deformation part used for deforming the ellipsoid template to reconstruct the 3D shape of the object in the input image. The template deformation part inputs the ellipsoid structure < point, edge, surface > and the fusion pixel feature graph extracted from the first part, and outputs the fusion pixel feature graph as a predicted 3D shape. The network body architecture adopts a G-ResNet (graph residual neural network) (see https:// open. the cvf. com/content _ ECCV _2018/papers/Nanyang _ Wang _ Pixel2Mesh _ Generation _3D _ ECCV _2018_ papers. pdf) architecture, and can be divided into four major parts, namely a deformation module, a graph enabling module and a projection module according to functions.

The deformation module is operative to update vertex features and predict a vertex coordinate position < x, y, z > of the 3D shape for the input, employing a graph convolution network architecture. The input of the deformation module is, for example, the structural data of an input graph comprising N x 963, N being the number of graph vertices, 963 being the feature vector dimension of each vertex, the structure of the input graph being a predefined mesh template structure. The deformation module comprises 14 graph convolution layers. After 14 layers of graph convolution layers, a deformation result of N multiplied by 3 is finally obtained, wherein N is the number of vertexes of an input graph of the deformation module, and 3 indicates 3 components of coordinates < x, y, z >.

The template deformation part comprises a plurality of deformation modules, different deformation modules are used at different stages of a three-dimensional reconstruction model of the template deformation part, and N of each deformation module has different values. FIG. 8 shows a schematic diagram of the processing of the morphing module, and the "other operations" in FIG. 8 include operations performed by the projection module, the graph posing module, the graph unsoiling module, and the like.

The graph unpoliting module is used for structurally increasing the number of vertexes of the ellipsoid template and enriching the surface details of the final predicted shape, specifically, on a triangular surface of the mesh template, the middle points of three sides of the triangle are taken as new vertexes and are connected with each other to establish a new side, so that one triangular surface is subdivided into four small triangular surfaces, the implementation principle of the graph unpoliting module is shown in fig. 3, and the processing process can refer to fig. 7.

The graph posing module is used for structurally deleting the vertex number of the ellipsoid template and removing noise generated in the prediction process on the premise of keeping the overall effect of the prediction shape. Specifically, the method records indexes of new points and old points generated by the previous graph popping module, performs aggregation operation according to the indexes during the graph popping, and adopts average, maximum, minimum aggregation or original value retention operation to aggregate four triangular surfaces into one triangular surface, and the implementation principle of the method is shown in fig. 3, and the processing process can refer to fig. 7.

The projection module is used for projecting the 3D-shaped points into corresponding 2D coordinates in the fused pixel feature map, extracting the pixel features of the vertexes of the three D shapes, inputting N three-dimensional point coordinates < x, y, z > and the fused pixel feature map output by the feature extraction part, and outputting a vector of N x 963, wherein N is the number of the input three-dimensional points, and 963 is the feature vector dimension of each 3-dimensional point, and then sending the vector into the deformation module for updating. Referring to fig. 6, the processing procedure projects the vertex of the three-D shape into the corresponding 2D coordinate in the fused pixel feature map, and then extracts the feature F of the pixel at the coordinate. The fused pixel feature map is composed of image features of sizes, for example, 56 × 56 × 64, 28 × 28 × 128, 14 × 14 × 256, and 7 × 7 × 512.

The template deformation part comprises 7 deformation modules, 5 graph unpooling modules, 2 graph unpooling modules and 7 projection modules, wherein an ellipsoid template is input from the projection modules and sequentially passes through the deformation module 1, the projection module, the graph unpooling module 1, the deformation module 2, the projection module, the graph unpooling module 2, the deformation module 3, the projection module, the graph popolling module 1, the deformation module 4, the projection module, the graph unpolling module 3, the deformation module 5, the projection module, the graph unpolling module 2, the deformation module 6, the projection module, the graph unpolling module 4 and the deformation module 7 to obtain a final predicted shape (2466 vertexes and 4928 surfaces). The framework of the template deformation part can refer to fig. 2, and the specific processing procedure can refer to fig. 10.

2. Setting loss function weight parameters, wherein the reconstructed model loss function is composed of a CD (charfer distance) loss part, a Normal loss part, a Laplacian loss part, an Edge length loss part and a Smooth loss part 5, and setting the weight parameters to be 1, 1.6e-4, 0.3, 0.1 and 1.6e-5 in sequence.

The loss function is applied to the deformation module. Cd (chamfer distance) loss, Normal loss, Laplacian loss, Edge length loss are calculated in all the deformation modules, and when the number of generated 3D shape vertices reaches 2466, Smooth loss is calculated, that is, Smooth loss is additionally calculated in the

deformation modules

3, 5, and 7 of fig. 2.

3. Setting training parameters, wherein the training parameters comprise a model learning optimization mode, a learning rate, a maximum iteration number and the like; for example, Adam optimization is adopted in a model learning optimization mode, the iteration number is set to be 50, the learning rate is 3e-5 for 30 iterations, and 1e-5 for 20 iterations.

4. Selecting a 3D model from a Shapelet data set for example to obtain sampling data and 3 images corresponding to the model, inputting camera parameters into a three-dimensional reconstruction model according to the embodiment of the application, and propagating forward to calculate loss.

5. And updating the weight by back propagation of the loss, wherein the initial value of the weight is a preset value.

6. Training for 30 times under the condition of the learning rate of 3e-5, then changing the learning rate to 1e-5 and training for 20 times to obtain a trained neural network model, and storing the model parameters updated at the last time.

On the basis of fully analyzing the problems of the Pixel2Mesh model, the invention provides an improved three-dimensional reconstruction model from three aspects of increasing a Convlstm layer, increasing a Graph unpoliting layer, designing a Smooth loss function and the like, and experiments show that the improved model has higher reconstruction precision than P2M, namely on CD indexes or shape representation.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a diagram of a three-dimensional reconstruction model according to the present invention;

FIG. 3 is a schematic diagram of the implementation of graph posing and graph unoiling;

FIG. 4 is a Smooth Loss calculation diagram;

FIG. 5 is a graph comparing the effect of other advanced three-dimensional reconstruction models;

FIG. 6 is a schematic view of a process of a projection module

FIG. 7 is a schematic view showing the processing procedure of Graph unpooling and pooling modules

FIG. 8 is a schematic diagram of a deformation module processing procedure;

FIG. 9 is a schematic diagram of the processing procedure of the image feature extraction module;

fig. 10 is a data flow framework diagram of a three-dimensional reconstruction model during processing according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

According to an embodiment of the application, a three-dimensional reconstructed model for generating a three-dimensional structure of an object from a two-dimensional image is provided. The three-dimensional reconstruction model, also referred to as a three-dimensional shape reconstruction network, is implemented, for example, by a computer program, and when the computer program is executed by a processor, the three-dimensional reconstruction model according to an embodiment of the present application is instantiated.

Referring to fig. 1, when the method is used for three-dimensional reconstruction, the specific processing steps are as follows:

1. data preparation

The data sets used for training the three-dimensional reconstruction model provided by the invention are a ShapeNet open source 3D model data set and a rendering image data set ShapeNet rendering provided by choy et al, the number of 3D model samples is 43000 multiple instances, and each instance of sample corresponds to 24 rendering pictures and camera parameters under different view angles (3 pictures and camera parameters corresponding to the pictures are randomly selected from the samples as input during training). The 3D model adopts a mesh structure (point, line and surface) to express, so that loss can be quantified during training, a trimesh library provided by python is used for carrying out average sampling on the surface of the mesh model to obtain 16834 sampling points (x, y and z), the position relation of the sampling points is utilized to calculate and obtain a surface normal vector (m, n and k), normalization is carried out, and the data type is a floating point type.

Because the method provided by the invention is based on the three-dimensional reconstruction model of the template deformation, a 3D mesh shape needs to be initialized as a deformation template. By way of example, an ellipsoid is used as a template, the ellipsoid comprises 156 vertexes and 308 surfaces, the projection parameters of the ellipsoid are configured to be placed at a position which is 0.8m away from the camera and is used as a center, the three-axis radii of the ellipsoid are 0.2m, 0.2m and 0.4m respectively, and the ellipsoid is generated by using MeshLab software. According to an input 3D model, 3 images are randomly selected from 24 images corresponding to the model in a ShapeleRendering data set as input, a python library open-cv is used for reading image data and dividing the image data by 255.0 to realize normalization, a resize function is used for unifying the sizes of the images into 224 multiplied by 3, and simultaneously camera parameters < azimuth angle, elevation, in-plane rotation, distance and visual field > input corresponding to the images are selected for subsequent projection and geometric transformation operation.

2. Building a three-dimensional reconstruction model:

the three-dimensional reconstruction model provided by the invention is built by utilizing the tenserflow + keras deep learning framework, as shown in figure 2. The three-dimensional reconstruction model according to the embodiment of the present application is composed of an image feature extraction part (upper part of fig. 2) and a template deformation part (lower part of fig. 2). The bottom of fig. 2 is a schematic diagram of the processing results of the template deformation process. The network main body architecture of the image feature extraction part is based on the VGG-16 architecture. The feature extraction part uses CNN to extract image features and uses Convlstm to perform feature fusion, and the part needs to train CNN and Convlstm weight parameters.

The input of the image feature extraction section is an image of 3 pictures obtained in the data preparation process. The image is encoded by sequentially passing through 18 convolutional layers from a network INPUT layer INPUT, and extracted features are output at 8 th, 11 th, 14 th and 18 th layers to obtain feature maps with sizes of 56 × 56 × 64, 28 × 28 × 128, 14 × 14 × 256 and 7 × 7 × 512. In order to fuse the feature information of a plurality of images, the ConvLSTM layer structure is designed and processed for feature maps of different sizes, and the feature maps of different sizes are fused.

For example, 3 56 × 56 × 64 feature maps obtained from 3 input images are individually processed to obtain a 3 × 56 × 64 fused pixel feature map. Therefore, after the ConvLSTM layer processing, 3 fused pixel feature maps of [56 × 56 × 64, 28 × 28 × 128, 14 × 14 × 256, 7 × 7 × 512] are obtained, and for the sake of convenience of distinction, the fused feature obtained from the map a is referred to as a fused pixel feature map 1, the fused feature obtained from the map B is referred to as a fused pixel feature map 1, and the fused feature obtained from the map C is referred to as a fused pixel feature map 3.

And the template deformation part is used for deforming the ellipsoid template so as to reconstruct the 3D shape of the object in the input image. The template deformation part obtains the target shape by continuously deforming the initial ellipsoid template, and the bottom of fig. 2 is a schematic diagram of a processing result of the template deformation process, wherein only parameter training needs to be performed on the deformation module, and the projection, Graph unpopoling and Graph popping modules have no training parameters. The template deformation part inputs the ellipsoid structure < point, edge, surface > and the image fusion characteristics extracted by the image characteristic extraction part and outputs the ellipsoid structure < point, edge, surface > and the image fusion characteristics as a reconstructed three-dimensional shape. The template deformation part can be divided into four types, namely a deformation module, a graph posing module, a graph unposing module and a projection module according to functions.

The deformation module functions to update vertex features and predict the vertex coordinate position < x, y, z > of the 3D shape, consisting of 14 layers of map convolutional layers. The input data dimension of the deformation module is Nx 963, N is the number of the vertexes of the ellipsoid, and the output dimension is Nx 3, namely the coordinates of N three-dimensional points < x, y, z >. The template deformation part comprises 7 deformation modules.

For different deformation modules, N takes different values. In the present invention, N ∈ {156, 628, 2466}, and values of N of each deformation module are also labeled in fig. 2, for example, a value of N of the deformation module 1 is 156, and a value of N of the deformation module 2 is 628.

The graph unpoiling module acts to structurally increase the number of vertices of the ellipsoid template or its input for enriching the surface details of the final predicted shape, as shown in fig. 3. By generating new points on the edge and establishing a new connection relationship between the adjacent new points, the number of the vertexes and the surfaces of the ellipsoid can be structurally increased.

The graph posing module acts as a structural pruning of the number of vertices of the ellipsoid template or its input for removing noise generated by the prediction process while preserving the overall effect of the predicted shape, as shown in fig. 3. By executing the inverse operation of the graph unpopoling module, the vertex and the surface number of the ellipsoid template can be structurally deleted, and the vertex characteristics are updated to be local average values, maximum values, minimum values or original values.

The projection module is used for projecting the 3D-shaped points into corresponding 2D coordinates in the image feature map, and extracting pixel features of the 3D-shaped points. Firstly, carrying out space transformation operation on point coordinates by using camera parameters, then projecting the vertex of a deformation template (namely an ellipsoid) onto a fused pixel feature map by using a projection formula, extracting pixel features corresponding to the vertex from the fused pixel feature map by adopting a bilinear interpolation mode, and repeating the projection operation aiming at the fused pixel feature maps generated by different images. As shown in fig. 2, the first 3 projection modules use the fused pixel feature map 1, the 4 th and 5 th projection modules use the fused pixel feature map 2, and the 6 th and 7 th projection modules use the fused pixel feature map 3.

As shown in fig. 2, the template morphing network includes 7 morphing modules, 5 graph unsoling modules, 2 graph pooling modules, and 7 projection modules. The ellipsoid template is input from a projection module, and a final predicted shape (2466 vertexes and 4928 surfaces) is obtained sequentially through a deformation module 1, a projection module, a graph unpoining module 1, a deformation module 2, a projection module, a graph unpoining module 2, a deformation module 3, a projection module, a graph popping module 1, a deformation module 4, a projection module, a graph unpoining module 3, a deformation module 5, a projection module, a graph popping module 2, a deformation module 6, a projection module, a graph unpoining module 4 and a deformation module 7.

3. Loss function design

The model loss function is composed of CD (charfer distance) loss, Normal loss, Laplacian loss, Edge length loss and Smooth loss 5 parts, and weight parameters are respectively set to be 1, 1.6e-4, 0.3, 0.1 and 1.6e-5 in sequence.

CD loss is generated by comparing vertex distance error constraint vertexes between a prediction grid and a real grid, wherein p is a vertex 3D coordinate in the prediction grid, and q is a vertex 3D coordinate in the real grid; s₁，S₂The vertex sets of the prediction grid model and the real grid model are respectively.

Normal loss is constrained by calculating the Normal vector error between the prediction grid and the real gridAnd (5) generating the noodles. Wherein p is a vertex in the prediction grid, k is a neighbor node of p, q is a vertex in the real grid, and n_qAnd the normal vector corresponding to the q point.

Laplacian loss controls the smoothness of the model by comparing the similarity between vertices within the local range of the prediction mesh,

this is consistent with our later proposed goal of Smooth loss, where N (p) is the set of neighbor points for the vertex p, δ_pIs laplacian coordinate, delta ', of vertex p obtained by integrating 3D coordinates of vertex p and neighbor node k'_pIs the updated laplacian coordinates of the vertex p.

Edge length penalty prevents the generation of excessively long edges by counting the sum of the squared lengths of the edges, where p, k are the vertices that are connected to each other.

The Smooth penalty constrains the normal vector of the prediction mesh shape to achieve better smoothness by computing the error of the normal vector between adjacent faces. Fig. 4 shows the calculation principle of the Smooth Loss. f (a, b, c) plane A₁F is a normal vector calculation function, a, b, c and d are vertices on the plane, the calculation result is zero<f(a，b，c)，(a-d)>| is the solving plane A₁Normal vector and plane A₂The inner product of the vectors a-d is used to calculate the error. In addition, to avoid giving image A₃And B₁With negative impact, we also set a threshold a,when the error exceeds α, 0 is set and the parameter α is set to 0.00436. L is_smooth＝∑(|<f(a，b，c)，(a-d)>|≤α？|<f(a，b，c)，(a-d)>|:0)

4. Training network parameters of the three-dimensional reconstruction model:

the model adopts an integral training strategy, and simultaneously trains a feature extraction part and a template deformation part, wherein the feature extraction part needs to train CNN and Convlstm parameters, and the template deformation needs to train G-ResNet network parameters in a deformation module. Cd (chamfer distance) loss, Normal loss, Laplacian loss, Edge length loss are calculated in all deformation modules, and when the generated 3D shape vertex reaches 2466, the Smooth loss is calculated, i.e. additionally calculated in

deformation modules

3, 5, 7, see fig. 2. And calculating loss by forward propagation of training data, and updating model parameters by backward propagation of the loss. Training for 30 times under the condition of the learning rate of 3e-5, then changing the learning rate to 1e-5 and training for 20 times to obtain a trained neural network model, and storing the model parameters updated at the last time. After training is complete, the model parameters are saved in a file.

5. Evaluation of the model of the invention on a data set

We evaluated the proposed three-dimensional reconstruction model based on the sharenet dataset. And the evaluation indexes are CD and F-score, meanwhile, the actual performance effect graph of the model on the reconstructed shape is also evaluated, and the test set division adopts a choy working data set division strategy. We compared this three-dimensional reconstruction model according to an embodiment of the present application with the previously working PSGN, P2M, P2M + +, MVP2M, and OccNet. Compared with the non-improved reconstructed model P2M, the improved model has 48% reduction in CD loss, 11.96% improvement in F-score accuracy, and in some aspects is even better than the most advanced Occenet model. Furthermore, we compared the actual behavior of these models on the reconstructed shape, as in fig. 5, and our method performed better than the current comparison method in most cases. In fig. 5, the reconstruction results of PSG, P2M, MVP2M, P2M + +, OccNet, Ours (invention) and the true shape GT are shown from left to right.

The work and results of PSG, P2M, P2M + +, MVP2M, and OccNet can be obtained from the following links:

PSG：https://openaccess.thecvf.com/content_cvpr_2017/papers/Fan_A_Point_Set_CVPR_2017_paper.pdf

P2M：https://openaccess.thecvf.com/content_ECCV_2018/papers/Nanyang_Wang_Pixel2Mesh_Generating_3D_ECCV_2018_paper.pdf

MVP2M、P2M++:https://openaccess.thecvf.com/content_ICCV_2019/papers/Wen_Pixel2Mesh_Multi-View_3D_Mesh_Generation_via_Deformation_ICCV_2019_paper.pdf

OccNet:https://openaccess.thecvf.com/content_CVPR_2019/papers/Mescheder_Occupancy_Networks_Learning_3D_Reconstruction_in_Function_Space_CVPR_2019_paper.pdf

fig. 1 is a data flow framework diagram of a three-dimensional reconstruction model during processing according to an embodiment of the present application, and the number in the box indicates the size dimension of the main data before the next operation is performed.

The processing process of the three-dimensional reconstruction model is mainly divided into two parts, namely an image characteristic extraction part and a template deformation part.

In the image feature extraction section, the graph a, the graph B, and the graph C are respectively input into the convolution layer of the CNN to extract features, and a feature pyramid composed of feature maps of [56 × 56 × 64, 28 × 28 × 128, 14 × 14 × 256, and 7 × 7 × 512 is constructed for each graph. Then, feature fusion is performed using Convlstm using the feature pyramids of fig. a, B, and C, and a fused pixel feature map corresponding to each of the fig. a, B, and C is obtained (in fig. 10, each of the feature maps is referred to as a fused feature of fig. a, and a fused feature of fig. B and a fused feature of fig. C).

In the template deformation part, its input is the 156 vertex coordinates < x, y, z > (i.e. 156 x 3) of the predefined deformation module (the initial ellipsoid template). Then, features are extracted from the fused features of the graph a for each vertex by a projection operation of the projection module 1 (see also fig. 2), and vertex features of 156 × 963 are obtained. And then inputting the vertex feature vector into a deformation module 1 for coordinate updating to obtain the deformation result of the time, namely 156 new coordinates < x, y, z >. The number of vertices was changed after the deformation module processing using a Graph unsolling module and/or a Graph pooling module. Referring to fig. 2 and 10, the projection module 1 extracts features according to the fused features of the initial ellipsoid module and fig. a; the projection module 2 extracts features according to the fusion features of the deformation module 1 and the graph A; the projection module 3 extracts features according to the fusion features of the deformation module 2 and the graph A; the projection module 4 extracts features according to the fusion features of the deformation module 3 and the graph B; the projection module 5 extracts features according to the fusion features of the deformation module 4 and the graph B; the projection module 6 extracts features according to the fusion features of the deformation module 5 and the graph C; and the projection module 7 extracts features according to the fusion features of the deformation module 6 and the graph C.

Taking Graph unpooling module 1 as an example, the input is 156 × 963 vertex features and the output is 628 × 963 vertex features, so that 156 vertices are expanded into 628 vertices. Taking Graph posing module 1 as an example, the input is 2466 × 963 vertex features and the output is 628 × 963 vertex features, so that 2466 vertices are reduced to 628 vertices. Finally generating a 2466X 3 deformation result by combining the operations of the deformation module, the projection module, the Graph unpoliving module and the Graph pooling module.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A multi-view based object 3D shape reconstruction method, comprising the steps of:

step one, data preparation

Acquiring one or more 3D grid models in a Shapelet data set, carrying out point sampling on the surface of the 3D grid models according to a set threshold value, and constructing 3D point cloud data corresponding to the acquired 3D grid models, wherein the acquired 3D grid models further comprise corresponding sample images;

acquiring an ellipsoid grid model as a deformation template, wherein the ellipsoid grid model comprises an ellipsoid, and the ellipsoid comprises 156 vertexes and 308 triangular surfaces; the parameters of the ellipsoid include that the ellipsoid is placed at a position 0.8m away from the right front of the camera, and the three-axis radius of the ellipsoid is 0.2m, 0.2m and 0.4m respectively by taking the position as the center of a circle;

step two, performing combined training on the feature extraction network and the template deformation network

Sending the 3D point cloud data obtained in the first step, a sample image corresponding to the 3D point cloud data and the parameters of the camera into a 3D shape reconstruction network for training; wherein the 3D point cloud data is used to calculate a loss function;

the 3D shape reconstruction network includes: the device comprises a feature extraction module and a template deformation module; the feature extraction module comprises 18 convolution layers and convLSTM layers;

in the second step, the sample image is input into 18 convolutional layers of the feature extraction module, and feature output is extracted from the 8 th, 11 th, 14 th and 18 th convolutional layers; the ConvLSTM layer fuses the feature outputs extracted from the 8 th, 11 th, 14 th and 18 th convolutional layers into a fused pixel feature map Img;

the template deformation module comprises: 7 deformation modules, 5 graph unpooling modules, 2 graph popping modules, and 7 projection modules;

in the second step, the first projection module fuses the pixel feature map Img according to the deformation template, the camera parameters and the first sample image output by the feature extraction module₁Obtaining the apex feature P of 156 h 963 dimension₁The first deformation module utilizes the vertex feature P₁Generating the first 3D shape M with 156 vertices₁(ii) a The second projection module is according to the 3D shape M₁The camera parameters and the fused pixel feature map Img of the first sample image₁Projecting again to obtain a new vertex feature P with 156 x 963 dimensions₂Then using Graph unpooling module to locate the vertex feature P₂Increasing the number of vertexes to obtain the vertex characteristic P of 628 multiplied by 963₃The second morphing module utilizes the vertex feature P₃Generating a second 3D shape M with 628 vertices₂(ii) a The third projection module is according to the 3D shape M₂The camera parameters and the fused pixel feature map Img of the first sample image₁Projecting again to obtain new 628 ANG 963DVertex feature P of degree₄Then using Graph unpooling module to locate the vertex feature P₄Increasing the number of vertices to obtain the vertex feature P of 2466H 963₅The third morphing module uses the feature to generate a third 3D shape M having 2466 vertices₃(ii) a The fourth projection module is according to the 3D shape M₃The camera parameters and the fused pixel feature map Img of the second sample image output by the feature extraction module₂Projecting again to obtain new apex feature P of 2466 x 963 dimension₆Then using Graph posing module to locate the vertex feature P₆The upper reduction of the number of vertexes results in the vertex feature P of 628 Bi 963₇The fourth deformation module utilizes the vertex feature P₇Generate a fourth 3D shape M with 628 vertices₄(ii) a The fifth projection module is according to the 3D shape M₄The camera parameters and the fused pixel feature map Img of the second sample image₂Projecting again to obtain new vertex feature P of 628 x 963 dimension₈Then using Graph unpooling module to locate the vertex feature P₈Increasing the number of vertices to obtain the vertex feature P of 2466H 963₉The fifth deformation module utilizes the vertex feature P₉Generating a fifth 3D shape M with 2466 vertices₅(ii) a The sixth projection module is according to the 3D shape M₅The camera parameters and a fused pixel feature map Img of a third sample image output by the feature extraction module₃Projecting again to obtain new apex feature P of 2466 x 963 dimension₁₀Then using Graph posing module to locate the vertex feature P₁₀The upper reduction of the number of vertexes results in the vertex feature P of 628 Bi 963₁₁The sixth deformation module utilizes the vertex feature P₁₁Generate a sixth 3D shape M with 628 vertices₆(ii) a The seventh projection module is according to the 3D shape M₆The camera parameters and the fused pixel feature map Img of the third sample image₃Projecting again to obtain new vertex feature P of 628 x 963 dimension₁₂Then using Graph unpooling module to locate the vertex feature P₁₂Increasing the number of vertices to obtain the vertex feature P of 2466H 963₁₃The seventh morphing module uses the vertex featureSign P₁₃Generating a seventh 3D shape M with 2466 vertices₇And combining the 3D shape M₇Output as a final result;

the second step further comprises: respectively calculating CD loss, Normal loss, Laplacian loss and edgelength loss with the 3D point cloud data acquired in the step one by using the 3D shapes output by the first, second, third, fourth, fifth, sixth and seventh deformation modules, and using the CD loss, the Normal loss, the Laplacian loss and the edgelength loss to supervise the training of the 3D shape reconstruction network; and further calculating Smooth loss of the 3D shape output by the third, fifth and seventh deformation modules for restraining the surface smoothness of the 2466 vertex 3D shapes generated by the Smooth loss calculation module;

the output of the seventh deformation module is used as the 3D shape of the object output by the 3D shape reconstruction network;

and thirdly, reconstructing the 3D shape of the object by using the trained 3D shape reconstruction network.

2. The method of claim 1, wherein

The sample images corresponding to the acquired 3D grid model comprise a first sample image, a second sample image and a third sample image.

3. The method of claim 2, wherein

The feature extraction module is used for extracting features of each input sample image and fusing the extracted features to obtain each fused pixel feature map required by the projection module;

the deformation module is used to update the vertex features and predict the vertex coordinate position < x, y, z > of the 3D shape to generate an intermediate 3D shape or a final 3D shape;

the Graph unpooling module is used for structurally increasing the number of vertexes so as to enrich the surface details of the final predicted shape;

the Graph Pooling module is used for structurally deleting the number of vertexes so as to remove noise generated in the prediction process on the premise of keeping the overall effect of the prediction shape;

the projection module is used for projecting the three-dimensional points in the 3D shape into corresponding 2D coordinates in the fusion pixel feature map so as to extract the pixel features of the three-dimensional points.

4. The method of claim 3, wherein

The feature extraction module comprises 18 convolutional layers and 4 Convlstm layers;

wherein, the 1 st and 2 nd convolution layer core size is 3 x 16, the 3 rd to 5 th convolution layer core size is 3 x 32, the 6 th to 8 th convolution layer core size is 3 x 64, the 9 th to 11 th convolution layer core size is 3 x 128, the 12 th to 14 th convolution layer core size is 3 x 256, the 15 th to 18 th convolution layer core size is 3 x 512, and each convolution layer is followed by a Relu layer; the sizes of 4 Convlstm layer cores are respectively 3 multiplied by 64, 3 multiplied by 128, 3 multiplied by 256 and 3 multiplied by 512, and a Relu layer is arranged behind each layer;

the template deformation module comprises 7 deformation modules, wherein the 1 st deformation module processes graph structure data with 156 vertexes, and the graph structure is an initial ellipsoid template structure; the 2 nd, 4 th and 6 th deformation modules respectively process the graph structure data with 628 vertices; the 3 rd, 5 th and 7 th deformation modules process the graph structure data with 2466 vertexes;

each deformation module consists of 14 layers of graph convolution layers, the input data dimension of the 1 st layer of graph convolution layer is N multiplied by 963, and the output data dimension is N multiplied by 256; the input and output data dimensions of the 2 nd to 13 th layer of graph convolution layers are both N multiplied by 256, the input data dimension of the 14 th layer of graph convolution layer is N multiplied by 256, the output data dimension is N multiplied by 3, and N is the number of vertexes;

the projection module projects the three-dimensional point < x, y, z > to the fusion pixel characteristic graph plane by using a projection formula, and extracts a pixel value by using bilinear interpolation;

the graph unpoliving module structurally increases the number of vertexes and surfaces of an ellipsoid by generating new points on the edge and establishing a new connection relation between adjacent new points;

the graph posing module structurally deletes the vertex and the surface number of the ellipsoid template by executing the inverse operation of the graph posing module, and updates the vertex characteristics to the local average value, the maximum value, the minimum value or the original value.

5. The method of claim 4, wherein

The 18 convolutional layers of the feature extraction module adopt a VGG-16 architecture, the size of an input 3 sample image is 224 multiplied by 3, features are extracted through the 18 convolutional layers, feature maps extracted from the 8 th, 11 th, 14 th and 18 th layers are output, and feature maps with the sizes of 56 multiplied by 64, 28 multiplied by 128, 14 multiplied by 256 and 7 multiplied by 512 of each picture are respectively obtained;

the deformation module adopts a graph residual error neural network architecture, is composed of 14 layers of graph convolution layers, establishes jump connection according to the sequence of 1, 3, 5, 7, 9, 11 and 13, and adds the output of the current layer and the output of the previous layer to obtain the final output result of the layer.

6. The method of claim 5, wherein

The three-dimensional reconstruction model extracts image features by using a feature extraction module and generates a 3D shape by using a template deformation module; the template deformation module comprises 7 deformation modules, 5 graph unpooling modules, 2 graph popping modules and 7 projection modules;

an ellipsoid serving as a deformation template is input from a first projection module, and sequentially passes through a first deformation module, a second projection module, a first graph unpooling module, a second deformation module, a third projection module, a second graph unpooling module, a third deformation module, a fourth projection module, a first graph popping module, a fourth deformation module, a fifth projection module, a third graph unpooling module, a fifth deformation module, a sixth projection module, a second graph popping module, a sixth deformation module, a seventh projection module, a fourth graph unpooling module and a seventh deformation module to obtain a final predicted shape, wherein the final predicted shape comprises 2466 vertexes and 4928 surfaces;

calculating CD loss, Normal loss, Laplacian loss and edgelength loss by the 3D shape M output by the first, second, third, fourth, fifth, sixth and seventh deformation modules and the 3D point cloud data acquired in the step one, and using the CD loss, the Normal loss, the Laplacian loss and the edgelength loss to supervise the training of the 3D shape reconstruction network; and the output of the third, fifth, and seventh morphing modules also calculates the Smooth loss for constraining the surface smoothness of the 2466 vertex 3D shapes it generates;

the output of the seventh deformation module is the object 3D shape output by the 3D shape reconstruction network.

7. The method of claim 5, wherein

The loss function is:

CD loss L_cdGenerating a vertex by comparing vertex distance errors between the prediction grid and the real grid, wherein p is a vertex 3D coordinate in the prediction grid, and q is a vertex 3D coordinate in the real grid; s₁，S₂Respectively a vertex set of the prediction grid and a vertex set of the real grid model;

normal loss L_normalGenerating a normal vector error constraint surface between a prediction grid and a real grid by calculating, wherein p is a vertex in the prediction grid, k is a neighbor node of p, q is a vertex in the real grid, and n_qThe normal vector corresponding to the q point is taken as the vector,

laplacian loss controls the smoothness of a model by comparing the similarity between vertices within a local range of a prediction mesh, where N (p) is the set of neighbor points for vertex p, δ_pIs laplacian coordinate, delta ', of vertex p obtained by integrating 3D coordinates of vertex p and neighbor node k'_pFor the updated laplacian coordinates of vertex p,

edge length loss L_edgelengthThe generation of overlong edges is prevented by counting the length squared sum of the edges, where p, k are the vertices connected to each other,

smooth loss L_smoothConstraining normal vectors of predicted mesh shape by calculating error of normal vectors between adjacent surfaces, f (a, b, c) surface A₁F is a normal vector calculation function, a, b, c and d are vertices on the plane, the calculation result is zero<f(a，b，c)，(a-d)>| is the solving plane A₁Normal vector and plane A₂The inner product of vectors a-d; set to 0 when the error exceeds a threshold α, with parameter α set to 0.00436;

L_smooth＝∑(|<f(a，b，c)，(a-d)>|≤α？|<f(a，b，c)，(a-d)>|:0)。