CN114140601A

CN114140601A - Three-dimensional grid reconstruction method and system based on single image under deep learning framework

Info

Publication number: CN114140601A
Application number: CN202111522113.4A
Authority: CN
Inventors: 李秀梅; 何鑫睿; 孙军梅; 白煌
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-04

Abstract

The invention discloses a three-dimensional grid reconstruction method and a three-dimensional grid reconstruction system based on a single image under a deep learning framework. The invention obtains three scale characteristic graphs by using a VGG16 network as an image characteristic extraction network, and completes a complete reconstruction task by using a grid deformation network based on the obtained characteristics and an initial elliptical grid model, wherein the grid deformation network comprises a graph volume module and local characteristic information fused, and a Transformer coding module supplements global characteristic information. The method effectively improves the accuracy of the three-dimensional grid reconstruction of the single image.

Description

Three-dimensional grid reconstruction method and system based on single image under deep learning framework

Technical Field

The invention belongs to the field of deep learning and image three-dimensional mesh reconstruction, and particularly relates to a three-dimensional mesh reconstruction method and system based on a single image under a deep learning framework.

Background

Three-dimensional modeling is an important computer aided design technology and plays a fundamental role in the industries of building engineering, industrial design, movie animation and the like. Nowadays, with the acceleration of industrialization process in various fields, the requirement for three-dimensional modeling efficiency is continuously improved. Compared with the traditional manual modeling, the three-dimensional reconstruction based on the single image has the characteristics of high reconstruction efficiency and low cost. Therefore, accurate three-dimensional model reconstruction from a single image is of increasing interest to researchers. Meanwhile, in recent years, deep learning is continuously developed and has a great success in the field of computer vision, so that research work combining deep learning and single image three-dimensional reconstruction becomes a hotspot.

The structural representation forms of the three-dimensional model for image reconstruction comprise various types of voxels, point clouds, grids and the like, the three-dimensional reconstruction research based on the images is mainly carried out based on the representation of the voxels and the point clouds in the early stage, but the geometric representation capability of the reconstructed model based on the voxels is limited, and inherent quantization errors exist. Point cloud based reconstructed models lack a complete surface representation. The mesh is represented by an irregular model structure and comprises vertexes, edges and surfaces, the mesh has the characteristics of light weight and rich shape details, the local neighborhood relation of the model can be further provided based on the reconstruction represented by the mesh, the continuity representation of the surface of the object is realized, and meanwhile, more accurate error calculation can be carried out based on the vertex coordinates. Therefore, model reconstruction studies based on mesh representations are becoming hot.

However, due to the ordering of the vertices of the mesh structure, the number of adjacent vertices of each point in the topological surface is not consistent, and the convolutional neural network cannot directly process the data of the non-euclidean structure, so that the mesh-based reconstruction work is full of challenges.

Disclosure of Invention

The invention aims to provide a single-image-based three-dimensional grid reconstruction method and a single-image-based three-dimensional grid reconstruction system under a deep learning framework.

A three-dimensional image reconstruction method based on deep learning specifically comprises the following steps:

acquiring a two-dimensional image and corresponding three-dimensional model data;

secondly, preprocessing the two-dimensional image and the three-dimensional model data;

step three, building and training a neural network;

3-1, building a complete reconstruction network

The complete reconstruction network comprises an image feature extraction network and a grid deformation network; the image feature extraction network is used for inputting a single two-dimensional image to obtain three scale feature maps; the grid deformation network is used for inputting an initial elliptical grid model and three scale feature maps output by the image feature extraction network, and carrying out grid deformation on the initial elliptical grid model so as to complete reconstruction;

3-2 model training

Inputting the image data preprocessed in the step two into the complete reconstruction network in the step 3-1 for training; calculating loss values of a reconstruction model and a target model output by the complete reconstruction network through a fusion type loss function in the training process, and then continuously reducing the loss values through network iteration parameters through back propagation of the loss values to finish training;

and step four, inputting a single two-dimensional image to be reconstructed according to the network model built and trained in the step three, and finally outputting a reconstructed three-dimensional model.

Another object of the present invention is to provide a depth learning-based three-dimensional image reconstruction system, which includes:

the data acquisition module is used for acquiring a two-dimensional image and a corresponding three-dimensional model;

the data preprocessing module is used for adjusting the format and the size of the two-dimensional image and the corresponding three-dimensional model;

and the complete reconstruction network module is used for training the complete reconstruction network by using the data processed by the data preprocessing module and then reconstructing a corresponding three-dimensional model of the two-dimensional image to be reconstructed by using the complete reconstruction network.

The invention has the following beneficial effects:

1. the invention is oriented to the single image three-dimensional grid reconstruction task, introduces a graph volume module to effectively process non-Euclidean grid model data, and completes three-dimensional grid reconstruction based on local grid information.

2. The method is oriented to a single image three-dimensional grid reconstruction task, and based on a graph volume module cascade transform coding module, global information is supplemented, and reconstruction accuracy is improved.

Drawings

Fig. 1 is a model structure diagram of a complete reconstruction network in the present invention.

Fig. 2 is a model configuration diagram of an image feature extraction network in the present invention.

Fig. 3 is a model structure diagram of a mesh deformation network in the present invention.

Detailed Description

The invention is further analyzed with reference to the following specific examples.

A three-dimensional image reconstruction method based on deep learning comprises the following steps:

step one, acquiring a two-dimensional image and corresponding three-dimensional model data. The data source is the Shapen public dataset.

And step two, preprocessing the two-dimensional image and the three-dimensional model data. The preprocessing method comprises two-dimensional image scaling, clipping and normalization, and three-dimensional model Poisson disc algorithm pre-sampling.

And step three, building and training a neural network.

3-1, building a complete reconstruction network

The complete reconstruction network shown in fig. 1 comprises an image feature extraction network and a mesh deformation network. The image feature extraction network is used for inputting a single two-dimensional image to obtain three scale feature maps; the mesh deformation network is used for inputting an initial elliptical mesh model and three scale feature maps output by the image feature extraction network, performing mesh deformation on the initial elliptical mesh model to complete reconstruction, namely performing vertex sampling on the input initial elliptical mesh model in the three scale feature maps for many times by the mesh deformation network, deforming the initial elliptical mesh by using a plurality of graph convolution modules and a Transfomer coding module according to a sampling result, performing vertex expansion on the mesh by using a plurality of vertex anti-pooling modules, and finally outputting a reconstructed three-dimensional model; the invention provides a cascade graph convolution module and a Transfomer coding module which are used for processing a grid model together, and can effectively combine local information and global information to complete reconstruction.

3-2 model training

And inputting the image data preprocessed in the step two into the complete reconstruction network in the step 3-1 for training. In the training process, loss values of a reconstruction model and a target model output by the complete reconstruction network are calculated through a fusion type loss function, and then the loss values are propagated reversely through the loss values, and the network iteration parameters continuously reduce the loss values to complete the training.

In step 3-1, the image feature extraction network structure shown in fig. 2 is formed by 13 two-dimensional convolution units connected in series of the existing VGG16 network. The first two-dimensional convolution unit is used as the input end of the feature extraction network and is connected with the preprocessed two-dimensional image data; the output end of the seventh two-dimensional convolution unit outputs a shallow layer feature map; the output end of the tenth two-dimensional convolution unit outputs a middle layer characteristic diagram; the output end of the thirteenth two-dimensional convolution unit outputs a deep characteristic map;

the shallow feature map, the middle feature map and the deep feature output by the feature extraction network are superposed to form three scale feature maps which are used as the input of the mesh deformation network;

the mesh deformation network as shown in fig. 3 includes a first deformation unit, a second deformation unit, and a third deformation unit connected in series in sequence;

the first deformation unit comprises a first vertex feature sampling module, a first graph convolution module, a first vertex inverse pooling module and a first Transfomer coding module; the first graph convolution module consists of five first to fifth graph convolution units which are connected in series; the first input end of the first vertex characteristic sampling module is used as the first input end of the mesh deformation network and is connected with an input initial elliptical mesh model, the second input end of the first vertex characteristic sampling module is used as the second input end of the mesh deformation network and is connected with three scale characteristic graphs output by the characteristic extraction network, the first output end of the first vertex characteristic sampling module is connected with the input end of the first graph convolution module, and the second output end of the first vertex characteristic sampling module is connected with the input end of the first Transfomer coding module; the output end data of the first graph convolution module and the output end data of the first Transfomer coding module are added and then connected with the input end of the first vertex anti-pooling module; the output end of the first vertex inverse pooling module is used as the output end of the first deformation unit and is connected with the second input end of the second deformation unit;

the second deformation unit comprises a second vertex feature sampling module, a second graph convolution module, a second vertex anti-pooling module and a second Transfomer coding module; the second graph convolution module is composed of five sixth graph convolution units to ten graph convolution units which are connected in series; the first input end of the second vertex characteristic sampling module is used as the third input end of the mesh deformation network and is connected with three scale characteristic graphs output by the characteristic extraction network, the second input end is connected with the output end of the first deformation unit, the first output end is connected with the input end of the second graph convolution module, and the second output end is connected with the input end of the second Transfomer coding module; the output end data of the second graph convolution module and the output end data of the second Transfomer coding module are added and then connected with the input end of the second vertex inverse pooling module; the output end of the second vertex anti-pooling module is used as the output end of the second deformation unit and is connected with the second input end of the third deformation unit;

the third deformation unit comprises a third vertex feature sampling module, a third graph convolution module and a third Transfomer coding module; the third graph convolution module is composed of five eleventh to fifteenth graph convolution units which are connected in series; the first input end of the third vertex characteristic sampling module is used as the fourth input end of the mesh deformation network and is connected with three scale characteristic graphs output by the characteristic extraction network, the second input end of the third vertex characteristic sampling module is connected with the output end of the second deformation unit, the first output end of the third vertex characteristic sampling module is connected with the input end of the third graph convolution module, and the second output end of the third vertex characteristic sampling module is connected with the input end of the third Transfomer coding module; adding the output end data of the third graph convolution module and the output end data of the third Transfomer coding module to serve as a reconstructed three-dimensional model of the grid deformation network for output;

the first Transfomer coding module comprises a first linear layer, a position coding layer, a multi-head attention mechanism layer, a first normalization layer, a forward propagation layer, a second normalization layer and a second linear layer which are sequentially connected in series; the second Transfomer coding module comprises a third linear layer, a position coding layer, a multi-head attention mechanism layer, a third normalization layer, a forward propagation layer, a fourth normalization layer and a fourth linear layer which are sequentially connected in series; the third Transfomer coding module comprises a fifth linear layer, a position coding layer, a multi-head attention mechanism layer, a fifth normalization layer, a forward propagation layer, a sixth normalization layer and a sixth linear layer which are sequentially connected in series;

3-2, performing model training by using a fusion loss function in the formula (6), wherein the fusion loss is formed by adding chamfer distance loss, side length loss, normal vector loss and Laplace loss;

the chamfer distance loss is expressed by the formula (1):

wherein P, Q are the two compared point sets, which herein represent the set of vertices of the reconstructed three-dimensional model and the real three-dimensional model, respectively; p and q represent unit vertexes in the corresponding point set; | P | represents the number of elements of the point set P; the calculation method is that the squared Euclidean distances from all unit points to the nearest point in the other point set are calculated for the two point sets respectively, and then the nearest average distances among the point sets are obtained respectively and added.

Side length loss formula (2):

where p is any vertex coordinate and k is any vertex coordinate in the set of contiguous vertices n (p) for vertex p, this penalty limits the appearance of excess long edges of the mesh model.

See formula (3) for normal vector loss:

wherein p represents the peak of the reconstructed three-dimensional model, q represents the peak with the nearest distance of the peak p in the real model, namely the peak is regarded as the approximate point of p, and n_qDenotes the normal phasor of the vertex q, since p and q are approximate points n_qThen, k represents the adjacent vertex of p, and the vector difference between the two is the required vertex edge vector,<,>representing the inter-vector inner product calculation.

Laplace loss formula (4):

wherein l_pDenotes the Laplacian coordinate of the current mesh vertex, l'_pCalculating Laplace coordinates of the grid vertexes before deformation of each deformation unit in a Laplace coordinate calculation mode shown in a formula (5);

wherein p is the vertex coordinate, l_pIs the Laplace coordinate of the vertex, N (p) represents the adjacent vertex set with the coordinate being p vertex, k represents the coordinate of some adjacent vertex in the set, and the coordinate reflects the relative position of each vertex in the local mesh plane.

The fusion loss comprises the above four losses, which are expressed as follows:

loss＝loss_cd+αloss_edg+βloss_n+γloss_lap (6)

wherein α is loss_edgCorresponding loss weight, β is loss_nCorresponding loss weight, γ is loss_lapThe corresponding loss weight; the invention takes alpha as 0.1, beta as 0.3 and gamma as 0.0001.

In order to verify the effectiveness of the invention, a plurality of existing reconstruction methods such as 3D-R2N2, PSGN, Pixel2Mesh and the like are compared, 13 different types of three-dimensional models are selected for training and testing based on a Shapenet data set in an experiment, and a CD (charfer distance) value and an F (F-Score) value are used as evaluation indexes, wherein the lower the CD value is, the better the reconstruction effect is, and the higher the F value is, the better the reconstruction effect is.

The experimental results are shown in tables 1 and 2, and the network model of the invention achieves excellent reconstruction effect compared with the existing model under two evaluation indexes.

Table 1 different network reconstruction results CD value comparison

TABLE 2 comparison of F-values for different network reconstruction results

Claims

1. A three-dimensional image reconstruction method based on deep learning is characterized by comprising the following steps:

step three, building and training a neural network;

3-1, building a complete reconstruction network

3-2 model training

2. The method as claimed in claim 1, wherein the preprocessing method in step two includes scaling, clipping, and normalizing the two-dimensional image, and the three-dimensional model is pre-sampled by using Poisson disk algorithm.

3. The deep learning-based image three-dimensional reconstruction method as claimed in claim 1, wherein the image feature extraction network structure comprises 13 two-dimensional convolution units connected in series of a VGG16 network; the first two-dimensional convolution unit is used as the input end of the feature extraction network and is connected with the preprocessed two-dimensional image data; the output end of the seventh two-dimensional convolution unit outputs a shallow layer feature map; the output end of the tenth two-dimensional convolution unit outputs a middle layer characteristic diagram; and the output end of the thirteenth two-dimensional convolution unit outputs the deep characteristic map.

4. The method as claimed in claim 1 or 3, wherein the shallow feature map, the middle feature map and the deep feature outputted by the feature extraction network are superimposed to form three scale feature maps as the input of the mesh deformation network.

5. The method as claimed in claim 1, wherein the mesh deformation network includes a first deformation unit, a second deformation unit, and a third deformation unit connected in series in sequence;

the first deformation unit comprises a first vertex feature sampling module, a first graph convolution module, a first vertex inverse pooling module and a first Transfomer coding module; the first input end of the first vertex characteristic sampling module is used as the first input end of the mesh deformation network and is connected with an input initial elliptical mesh model, the second input end of the first vertex characteristic sampling module is used as the second input end of the mesh deformation network and is connected with three scale characteristic graphs output by the characteristic extraction network, the first output end of the first vertex characteristic sampling module is connected with the input end of the first graph convolution module, and the second output end of the first vertex characteristic sampling module is connected with the input end of the first Transfomer coding module; the output end data of the first graph convolution module and the output end data of the first Transfomer coding module are added and then connected with the input end of the first vertex anti-pooling module; the output end of the first vertex inverse pooling module is used as the output end of the first deformation unit and is connected with the second input end of the second deformation unit;

the second deformation unit comprises a second vertex feature sampling module, a second graph convolution module, a second vertex anti-pooling module and a second Transfomer coding module; the first input end of the second vertex characteristic sampling module is used as the third input end of the mesh deformation network and is connected with three scale characteristic graphs output by the characteristic extraction network, the second input end is connected with the output end of the first deformation unit, the first output end is connected with the input end of the second graph convolution module, and the second output end is connected with the input end of the second Transfomer coding module; the output end data of the second graph convolution module and the output end data of the second Transfomer coding module are added and then connected with the input end of the second vertex inverse pooling module; the output end of the second vertex anti-pooling module is used as the output end of the second deformation unit and is connected with the second input end of the third deformation unit;

the third deformation unit comprises a third vertex feature sampling module, a third graph convolution module and a third Transfomer coding module; the first input end of the third vertex characteristic sampling module is used as the fourth input end of the mesh deformation network and is connected with three scale characteristic graphs output by the characteristic extraction network, the second input end of the third vertex characteristic sampling module is connected with the output end of the second deformation unit, the first output end of the third vertex characteristic sampling module is connected with the input end of the third graph convolution module, and the second output end of the third vertex characteristic sampling module is connected with the input end of the third Transfomer coding module; and adding the output end data of the third graph convolution module and the output end data of the third Transfomer coding module to output as a reconstructed three-dimensional model of the grid deformation network.

6. The method as claimed in claim 5, wherein the first image convolution module comprises five first to fifth image convolution units connected in series; the second graph convolution module comprises five sixth graph convolution units to ten graph convolution units which are connected in series; the third graph convolution module includes five eleventh to fifteenth graph convolution units connected in series.

7. The method for three-dimensional image reconstruction based on deep learning of claim 5, wherein the first fransformer coding module comprises a first linear layer, a position coding layer, a multi-head attention mechanism layer, a first layer normalization layer, a forward propagation layer, a second layer normalization layer, and a second linear layer, which are sequentially connected in series; the second Transfomer coding module comprises a third linear layer, a position coding layer, a multi-head attention mechanism layer, a third normalization layer, a forward propagation layer, a fourth normalization layer and a fourth linear layer which are sequentially connected in series; the third Transfomer coding module comprises a fifth linear layer, a position coding layer, a multi-head attention mechanism layer, a fifth normalization layer, a forward propagation layer, a sixth normalization layer and a sixth linear layer which are sequentially connected in series.

8. The three-dimensional image reconstruction method based on deep learning as claimed in claim 1, wherein the fusion loss function in the model training of step 3-2 is shown in formula (6), and the fusion loss is formed by adding chamfer distance loss, side length loss, normal vector loss and laplace loss;

the chamfer distance loss is expressed by the formula (1):

p and Q are vertex sets of the reconstructed three-dimensional model and the real three-dimensional model; p and q represent unit vertexes in the corresponding point set; | P | represents the number of elements of the point set P;

side length loss formula (2):

where k is any vertex coordinate in the contiguous set of vertices N (p) for vertex p;

see formula (3) for normal vector loss:

wherein n is_qThe normal phasor of the vertex q is represented,<,>representing the inner product calculation between vectors;

laplace loss formula (4):

wherein l_pDenotes the Laplacian coordinate, l 'of the current mesh vertex p'_pCalculating the Laplace coordinate of the grid vertex p before deformation of each deformation unit in a Laplace coordinate calculation mode shown in a formula (5);

fusion loss is as follows:

loss＝loss_cd+αloss_edg+βloss_n+γloss_lap (6)

wherein α is loss_edgCorresponding loss weight, β is loss_nCorresponding loss weight, γ is loss_lapCorresponding loss weight.

9. An image three-dimensional reconstruction system based on deep learning is characterized by comprising: