CN113658322A

CN113658322A - Visual transform-based three-dimensional voxel reconstruction method

Info

Publication number: CN113658322A
Application number: CN202110876128.4A
Authority: CN
Inventors: 石振锋; 郭帅君
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-11-16

Abstract

A visual transform-based three-dimensional voxel reconstruction method relates to the field of three-dimensional voxel reconstruction. Due to the fact that key information of some objects is lost or parts of the surfaces of the objects are self-shielded, characteristic points of the images cannot be extracted, and therefore the three-dimensional reconstruction voxels of the images fail. The visual transform-based three-dimensional reconstruction voxel method comprises the following steps: inputting image information, and extracting image characteristics with different dimensions by using a coding layer based on a visual transform module; decoding the characteristics of the image through three-dimensional transposition convolution to obtain rough voxel information; designing a three-dimensional visual transformer structure reconstruction voxel, improving the precision of voxel information, or learning corresponding voxel weight by utilizing image information, and fusing the voxel by utilizing a network output layer according to the three-dimensional convolution of a weight generation layer to obtain a reconstructed voxel. The three-dimensional reconstruction voxel method can rapidly recover the voxel of the object under the conditions of single view and multiple views, thereby reflecting the integral structure of the original object.

Description

Visual transform-based three-dimensional voxel reconstruction method

Technical Field

A visual transform-based three-dimensional voxel reconstruction method relates to the field of three-dimensional voxel reconstruction.

Background

The objects contacted by people in the objective world are three-dimensional, and people can observe the objects from a three-dimensional angle, so that the structures and the properties of the objects can be better analyzed, for example, whether the space in a car of an automobile is wide enough to be conveniently and comfortably seated by people; whether the schoolbag has the interlayers or not and whether students can use a plurality of interlayers to store and arrange different books or not. However, for computer vision, such an analysis process is difficult, because the general representation form of the object in the computer is a two-dimensional image, and this representation form has a lot of information loss compared with a three-dimensional object, so in some applications, it is necessary to recover the three-dimensional structure of the object by a certain technical means.

At present, a plurality of methods utilize the characteristic points of the image sequences and combine the relation among the image sequences to realize three-dimensional reconstruction. However, due to the absence of some key information, some difficulties may be brought to the reconstruction process: how to restore the invisible part is a problem which must be considered in the process of three-dimensional reconstruction, and the solution of the problem needs to utilize a specific technology to guess the three-dimensional structure of the image; some methods need to calibrate the camera in advance, so the method is not suitable for being used in a specific scene; when a portion of the object surface is self-occluded, the feature points of the image may not be extracted, resulting in failure of reconstruction. Therefore, how to reconstruct a three-dimensional voxel under the condition of information deficiency also becomes one of the problems to be solved urgently.

Disclosure of Invention

The method solves the problems that the existing method for reconstructing the voxel in three dimensions needs to utilize a specific technology to guess the three-dimensional structure of an image, and some methods need to calibrate a camera in advance, so the method is not suitable for being used in a specific scene; when a portion of the object surface is self-occluded, the feature points of the image may not be extracted, resulting in failure of reconstruction.

A visual transform-based three-dimensional reconstruction voxel method, which reconstructs an object as a single view, comprising:

the single-view image to be reconstructed is used as an initial value of an input single-view image in the neural network;

inputting a single-view image initial value to a coding layer based on a visual transform module, and extracting features of the image on different dimensions;

decoding the extracted image features through three-dimensional transposition convolution to obtain rough voxel information;

and designing a three-dimensional visual transformer structure reconstruction voxel, and improving the precision of the rough voxel information to obtain a final voxel model after the single-view three-dimensional reconstruction.

The method comprises the following steps of inputting a single-view image initial value to an encoding layer based on a visual transform module, extracting features of the image on different dimensions, and obtaining the image by adopting the following method:

converting the image into the form of an image block;

the input of the coding layer is

x∈R^H×W×C

Wherein H represents the long dimension of the input information; w represents the wide dimension of the input information; c represents a feature dimension to be extracted; r represents matrix information;

extracting image blocks according to a sliding window sequence, wherein the sliding window is

l×l

Each image block x_pIs of a size of

l×l×C

The step length of sliding window sliding is s, the total number E of the image blocks is

(H-l/s)×(W-l/s)

Flattening each image block into

x'_p∈R^D×1

The scales for obtaining the image block x' are:

x'∈R^E×D；

a self-attention mechanism is introduced to each image block:

key x in self-attention mechanism for obtaining image block by full connection layer_kQuery x_qValue x_v:

x_k＝x'W_k,W_k∈R^D×D',x_k∈R^E×D'

x_q＝x'W_q,W_q∈R^D×D',x_q∈R^E×D'

x_v＝x'W_v,W_v∈R^D×D',x_v∈R^E×D'

Wherein D' represents a new feature dimension; w_kA weight matrix representing the key; w_qA weight matrix representing the query; w_vRepresents a weight matrix of values;

calculating the similarity between the query and the key by matrix dot product to obtain a weight matrix x_w：

The weight matrix is normalized using the Softmax function:

wherein the content of the first and second substances,

represents the corresponding element of the ith jth column;

visual transform module based coding layer learns features x in an image by multiplying a weight matrix and a value_att：

x_att＝x_w·x_v

The multi-layer perceptron mechanism using the neural network is composed of a plurality of fully connected layers and a dropout layer.

The process of designing the three-dimensional visual transformer structure reconstruction voxel, and improving the precision of the rough voxel information to obtain the final voxel model after the single-view three-dimensional reconstruction is as follows:

coding layer input x ∈ R^C×H×W×LWherein L represents the high dimensionality of the input information;

voxel modules are extracted using a three-dimensional sliding window and stacked in the feature dimensions:

the sliding window is l × l × l, and the block size of each extracted voxel is

x_p∈R^D×1

D＝l×l×l×c

Arranging each image block in sequence to form a matrix to obtain a characteristic information matrix of the three-dimensional voxel block:

x'∈R^N×D

wherein

N＝(H-l/s)×(W-l/s)×(L-l/s)

N represents the number of three-dimensional voxel blocks;

obtaining a characteristic matrix of the three-dimensional transform;

calculating keys, queries and values in the three-dimensional voxel block, and finally obtaining characteristic output;

the voxel model of size 32 × 32 × 32 is output.

A visual transform-based three-dimensional reconstruction voxel method, the method reconstructing an object that is multi-view, the method comprising:

inputting image information; learning features in the image by using a network coding layer; performing feature decoding using a three-dimensional transposed convolution and generating coarse voxel information; and learning the corresponding voxel weight by using the image information, fusing the voxels by using a network output layer according to the three-dimensional convolution of the generated weight layer to obtain the probability value of the fused voxels, and generating a voxel model.

The method comprises the following steps of learning corresponding voxel weight by utilizing image information, fusing voxels by utilizing a network output layer according to three-dimensional convolution of a generated weight layer, and obtaining the voxel by adopting the following method:

the three-dimensional feature sets learned from different images are

S＝{x₁,x₂,L,x_n}

Wherein: x is the number of_n∈R^C×H×W×LRepresenting the learned features of the nth image; h represents the long dimension of the input information, W represents the wide dimension of the input information, L represents the high dimension of the input information, and C represents the characteristic dimension;

learning an attention score C ═ C using three-dimensional convolutional layers₁,c₂,L,c_n}

Wherein:

c_n＝g(x_n),c_n∈R^H×W×L

g represents a plurality of three-dimensional convolution operations;

the obtained attention score is normalized into an attention weight through a Softmax function

S＝{s₁,s₂,L,s_n}

Wherein:

represents the value of the ith weight at the (h, w, l) position;

the probability of each image being a voxel generated by a single view three-dimensional reconstruction of

V＝{v₁,v₂,L,v_n}

Wherein v is_n∈R^H×W×L；

And multiplying the weight and the corresponding voxel, adding the weights, and performing complete fusion, wherein the completely fused voxel y is as follows:

where x represents a multiplication by element operation.

And obtaining the probability value of the fused voxel and generating a voxel model, wherein the probability value of the voxel is as follows:

where (h, w, l) represents the voxel position.

The visual transform-based three-dimensional reconstruction voxel method further comprises designing a loss function, wherein the loss function is a binary cross loss function:

wherein, y_(i,j,k)RepresentsTrue value of voxel block at (i, j, k), value 1 or 0, p_(i,j,k)Representing a predicted voxel probability value; the smaller the loss function, the closer the predicted value is to the true value.

The invention has the advantages that:

the three-dimensional reconstruction voxel method does not need to utilize a specific technology to guess the three-dimensional structure of the image; the camera does not need to be calibrated in advance, and the method is suitable for all scenes.

When a part of the surface of an object is shielded by self, the existing voxel reconstruction method can cause reconstruction failure due to the fact that feature points of an image technology cannot be extracted.

The method can infer the unknown structure of the object under the condition of information loss, namely a single view, and the whole structure of the object can be reduced through the voxels generated by the model. For a multi-view three-dimensional reconstruction voxel model, a probability value of different image reconstruction voxels is fused by using an attention module based on three-dimensional convolution, the model can process any number of image inputs in parallel, and the reconstruction result is independent of the sequence of the image inputs.

The method can be applied to the three-dimensional reconstruction voxel models of all scenes.

Drawings

FIG. 1 is a single-view three-dimensional reconstruction voxel model overall structure

Fig. 2 to 4 are images of each stage in the process of implementing three-dimensional reconstruction of voxels for a single view according to the first embodiment, and an input image, a target voxel, and a model reconstruction voxel are sequentially provided from left to right.

Fig. 5 to 7 are images at each stage in the process of implementing a three-dimensional reconstruction voxel for multiple views according to the second embodiment, in which the first three images are model input images, and then target voxels are sequentially used to model reconstructed voxels.

Detailed Description

Several embodiments of the present invention are described with reference to the accompanying drawings.

First embodiment, the present embodiment is described with reference to fig. 1 to 4, in which a visual transform-based three-dimensional voxel reconstruction method reconstructs an object in a single view, the method includes:

converting the image into the form of an image block;

the input of the coding layer is

x∈R^H×W×C

l×l

Each image block x_pIs of a size of

l×l×C

(H-l/s)×(W-l/s)

Flattening each image block into

x'_p∈R^D×1

The scales for obtaining the image block x' are:

x'∈R^E×D

a self-attention mechanism is introduced to each image block:

x_k＝x'W_k,W_k∈R^D×D',x_k∈R^E×D'

x_q＝x'W_q,W_q∈R^D×D',x_q∈R^E×D'

x_v＝x'W_v,W_v∈R^D×D',x_v∈R^E×D'

The weight matrix is normalized using the Softmax function:

wherein the content of the first and second substances,

represents the corresponding element of the ith jth column;

x_att＝x_w·x_v

The visual transformer structure not only considers the relation between each pixel point in an image and the surrounding pixel points, but also learns the key characteristics in the image blocks by introducing the self-attention mechanism to each image block.

x_p∈R^D×1

D＝l×l×l×c

x'∈R^N×D

wherein

N＝(H-l/s)×(W-l/s)×(L-l/s)

N represents the number of three-dimensional voxel blocks;

obtaining a characteristic matrix of the three-dimensional transform;

the voxel probability value of size 32 × 32 × 32 is output while generating the voxel model.

The three-dimensional reconstruction voxel method further comprises using the degree of overlap IoU as an evaluation index:

wherein p is_(i,j,k)Representing the predicted value of the voxel at (i, j, k), gt_(i,j,k)Represents the true value of the voxel at (I, j, k), t represents the voxel threshold, and I (-) represents the indicator function;

a higher value of IoU represents a better reconstruction result.

Specifically, as shown in table 1, table 1 is a comparison of the results of the overlapping degree IoU of the single-view three-dimensional reconstructed voxel model and other models under the test set, and 43736 three-dimensional models of 13 types provided in 3D-R2N2 are used as the data set. Among them, 32700 models were divided into training sets and the rest were divided into test sets. Meanwhile, the 3D-R2N2 also gives 24 rendering maps with random visual angles of each model, and the rendering maps are used as the input of the network, so that the experimental result of the three-dimensional reconstruction voxel model in the single view is shown. Where the voxel threshold t is taken to be 0.4, a higher value of IoU represents a better reconstruction result. It can be seen that there is a large difference in the degree of overlap IoU for the different categories. The reconstruction effect of the automobile and the cabinet type is good because the object structures in the automobile and the cabinet type are relatively similar, so that the network can conveniently estimate the structures of the automobile and the cabinet type, and the reconstruction effects of the lamp and the display are general because on one hand, the invisible parts of the objects are more, and on the other hand, the difference of the object structures is more, so that the model cannot be used for estimating the structures of the automobile and the cabinet type.

Table 1 also shows the experimental results of comparing the visual transform-based three-dimensional reconstruction voxel model with other models in the case of a single view, wherein the images obtained at each stage in the process of performing three-dimensional reconstruction voxel model on the single views of the airplane, the automobile and the rifle by using the method of the present invention are shown in fig. 2, fig. 3 and fig. 4, respectively. The OGN uses an octree model to represent target voxels, which can represent higher resolution 3D output in a limited memory space, but its representation is more complex and network training becomes difficult as resolution increases. It can be seen from the graph that the model has a good reconstruction effect in the example type of airplane and automobile.

TABLE 1 Single-view visual transformer-based three-dimensional reconstructed voxel model vs. overlap IoU under other models

In a second embodiment, the present embodiment is described with reference to fig. 5 to 7, in which a visual transformer-based three-dimensional voxel reconstruction method is described, the object reconstructed by the method is a multi-view, and the method includes:

the three-dimensional feature sets learned from different images are

S＝{x₁,x₂,L,x_n}

Wherein:

c_n＝g(x_n),c_n∈R^H×W×L

g represents a plurality of three-dimensional convolution operations;

S＝{s₁,s₂,L,s_n}

Wherein:

represents the value of the ith weight at the (h, w, l) position;

V＝{v₁,v₂,L,v_n}

Wherein v is_n∈R^H×W×L；

where x represents a multiplication by element operation.

The weight is generated by using the characteristics, and the idea of the attention mechanism is used for endowing the image with a larger weight at a place with good reconstruction effect, so that the voxel probability values generated by the images can be better fused.

where (h, w, l) represents the voxel position.

The convolution layer for extracting image features and the three-dimensional convolution layer parameters for generating attention weights are shared, and when the number of input images is different, the method can adaptively utilize the shared parameters for voxel fusion. Specifically, as shown in table 2, table 2 compares the overlapping degree IoU of the three-dimensional reconstruction of the model under different numbers of input images, wherein the images obtained at each stage in the process of performing the three-dimensional reconstruction respectively for multiple views of the lamp, the mobile phone and the airplane by using the method of the present invention are shown in fig. 5 to 7. It can be seen that the effect of the model on the test set is continuously getting better as the number of viewpoints increases. The description model applies features of multiple images to reconstruct voxels. Compared with a single-view result, the multi-view reconstruction result is obviously improved. For example: chair, adding an image can make overlap IoU improve 0.3, and the effect is more obvious.

Table 2 visual transform-based voxel model reconstruction of voxel overlap IoU in multi-view case

Table 3 shows the comparison between the multi-view reconstruction result of the model and the results of other methods, and the results show that the model can better reconstruct the voxel model of the object when the number of input images is different. And the model is improved in reconstruction speed compared with a 3D-R2N2 model, and a better result is reconstructed by using fewer parameters. And the multi-view model can adaptively complete the reconstruction task by using any number of images, and is more flexible than a 3D-R2N2 model.

Table 3 comparison of IoU results for visual transformer-based voxel model with other models under multi-view reconstruction

Claims

1. A method for reconstructing voxels in three dimensions based on visual transforms, wherein the object reconstructed by the method is a single view, the method comprising:

the single view to be reconstructed is used as an initial value of an input single view image in the neural network;

2. The method according to claim 1, wherein the initial value of the single-view image is input to an encoding layer based on a visual transform module, and features of the image over different dimensions are extracted, and the method is obtained by:

converting the image into the form of an image block;

the input of the coding layer is

x∈R^H×W×C

Wherein H represents the long dimension of the input information; w represents the wide dimension of the input information; c represents a feature dimension to be extracted; r represents matrix information.

l×l

Each image block x_pIs of a size of

l×l×C

(H-l/s)×(W-l/s)

Flattening each image block into

x'_p∈R^D×1

The scales for obtaining the image block x' are:

x'∈R^E×D

a self-attention mechanism is introduced to each image block:

x_k＝x'W_k,W_k∈R^D×D',x_k∈R^E×D'

x_q＝x'W_q,W_q∈R^D×D',x_q∈R^E×D'

x_v＝x'W_v,W_v∈R^D×D',x_v∈R^E×D'

computing queries by matrix dot productSimilarity with keys yields a weight matrix x_w：

The weight matrix is normalized using the Softmax function:

wherein the content of the first and second substances,

represents the corresponding element of the ith jth column;

x_att＝x_w·x_v

3. The method for reconstructing the voxel based on the visual transform according to claim 1, wherein the process of designing the three-dimensional visual transform structure reconstructed voxel, performing precision improvement on rough voxel information, and obtaining a final voxel model after single-view three-dimensional reconstruction is as follows:

x_p∈R^D×1

D＝l×l×l×c

x'∈R^N×D

wherein

N＝(H-l/s)×(W-l/s)×(L-l/s)

N represents the number of three-dimensional voxel blocks;

obtaining a characteristic matrix of the three-dimensional transform;

4. A visual transform-based voxel three-dimensional reconstruction method is characterized in that an object reconstructed by the method is a multi-view, and the method inputs image information; learning features in the image by using a network coding layer; performing feature decoding using a three-dimensional transposed convolution and generating coarse voxel information; learning corresponding voxel weight by using image information, and fusing voxels by using a network output layer according to the three-dimensional convolution of the generated weight layer; and obtaining the probability value of the fused voxel and generating a voxel model.

5. The method for reconstructing voxels in three dimensions based on visual transform according to claim 4, wherein the learning of corresponding voxel weights is performed using image information, and voxels are fused using a network output layer according to a three-dimensional convolution that generates weight layers; the method comprises the following steps:

the three-dimensional feature sets learned from different images are

S＝{x₁,x₂,L,x_n}

Wherein x_n∈R^C×H×W×LRepresenting the learned features of the nth image; h represents the long dimension of the input information, W represents the wide dimension of the input information, L represents the high dimension of the input information, and C represents the characteristic dimension;

Wherein:

c_n＝g(x_n),c_n∈R^H×W×L

g represents a plurality of three-dimensional convolution operations;

S＝{s₁,s₂,L,s_n}

Wherein:

represents the value of the ith weight at the (h, w, l) position;

V＝{v₁,v₂,L,v_n}

Wherein v is_n∈R^H×W×L；

where x represents a multiplication by element operation.

6. The method for reconstructing three-dimensional voxels based on visual transform according to claim 4, wherein the probability values of the fused voxels are obtained and a voxel model is generated, and the probability values of the voxels are as follows:

where (h, w, l) represents the voxel position.

7. The visual fransformer-based three-dimensional reconstruction voxel method of claim 1 or claim 4, wherein the visual fransformer-based three-dimensional reconstruction voxel method further comprises designing a loss function, wherein the loss function is a binary cross-loss function:

wherein, y_(i,j,k)Representing the true value of the voxel block at (i, j, k), with a value of 1 or 0, p_(i,j,k)Representing a predicted voxel probability value; the smaller the loss function, the closer the predicted value is to the true value.

8. The visual fransformer-based three-dimensional reconstruction voxel method of claim 1 or claim 4, further comprising using the degree of overlap IoU as an evaluation index:

a higher value of IoU represents a better reconstruction result.

9. A computer device, characterized by: comprising a memory in which a computer program is stored and a processor which, when running the computer program stored by the memory, executes a visual fransformer-based three-dimensional reconstruction voxel method according to any of claims 1-8.