CN113658322A - Visual transform-based three-dimensional voxel reconstruction method - Google Patents

Visual transform-based three-dimensional voxel reconstruction method Download PDF

Info

Publication number
CN113658322A
CN113658322A CN202110876128.4A CN202110876128A CN113658322A CN 113658322 A CN113658322 A CN 113658322A CN 202110876128 A CN202110876128 A CN 202110876128A CN 113658322 A CN113658322 A CN 113658322A
Authority
CN
China
Prior art keywords
voxel
dimensional
image
reconstruction
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110876128.4A
Other languages
Chinese (zh)
Inventor
石振锋
郭帅君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110876128.4A priority Critical patent/CN113658322A/en
Publication of CN113658322A publication Critical patent/CN113658322A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

A visual transform-based three-dimensional voxel reconstruction method relates to the field of three-dimensional voxel reconstruction. Due to the fact that key information of some objects is lost or parts of the surfaces of the objects are self-shielded, characteristic points of the images cannot be extracted, and therefore the three-dimensional reconstruction voxels of the images fail. The visual transform-based three-dimensional reconstruction voxel method comprises the following steps: inputting image information, and extracting image characteristics with different dimensions by using a coding layer based on a visual transform module; decoding the characteristics of the image through three-dimensional transposition convolution to obtain rough voxel information; designing a three-dimensional visual transformer structure reconstruction voxel, improving the precision of voxel information, or learning corresponding voxel weight by utilizing image information, and fusing the voxel by utilizing a network output layer according to the three-dimensional convolution of a weight generation layer to obtain a reconstructed voxel. The three-dimensional reconstruction voxel method can rapidly recover the voxel of the object under the conditions of single view and multiple views, thereby reflecting the integral structure of the original object.

Description

Visual transform-based three-dimensional voxel reconstruction method
Technical Field
A visual transform-based three-dimensional voxel reconstruction method relates to the field of three-dimensional voxel reconstruction.
Background
The objects contacted by people in the objective world are three-dimensional, and people can observe the objects from a three-dimensional angle, so that the structures and the properties of the objects can be better analyzed, for example, whether the space in a car of an automobile is wide enough to be conveniently and comfortably seated by people; whether the schoolbag has the interlayers or not and whether students can use a plurality of interlayers to store and arrange different books or not. However, for computer vision, such an analysis process is difficult, because the general representation form of the object in the computer is a two-dimensional image, and this representation form has a lot of information loss compared with a three-dimensional object, so in some applications, it is necessary to recover the three-dimensional structure of the object by a certain technical means.
At present, a plurality of methods utilize the characteristic points of the image sequences and combine the relation among the image sequences to realize three-dimensional reconstruction. However, due to the absence of some key information, some difficulties may be brought to the reconstruction process: how to restore the invisible part is a problem which must be considered in the process of three-dimensional reconstruction, and the solution of the problem needs to utilize a specific technology to guess the three-dimensional structure of the image; some methods need to calibrate the camera in advance, so the method is not suitable for being used in a specific scene; when a portion of the object surface is self-occluded, the feature points of the image may not be extracted, resulting in failure of reconstruction. Therefore, how to reconstruct a three-dimensional voxel under the condition of information deficiency also becomes one of the problems to be solved urgently.
Disclosure of Invention
The method solves the problems that the existing method for reconstructing the voxel in three dimensions needs to utilize a specific technology to guess the three-dimensional structure of an image, and some methods need to calibrate a camera in advance, so the method is not suitable for being used in a specific scene; when a portion of the object surface is self-occluded, the feature points of the image may not be extracted, resulting in failure of reconstruction.
A visual transform-based three-dimensional reconstruction voxel method, which reconstructs an object as a single view, comprising:
the single-view image to be reconstructed is used as an initial value of an input single-view image in the neural network;
inputting a single-view image initial value to a coding layer based on a visual transform module, and extracting features of the image on different dimensions;
decoding the extracted image features through three-dimensional transposition convolution to obtain rough voxel information;
and designing a three-dimensional visual transformer structure reconstruction voxel, and improving the precision of the rough voxel information to obtain a final voxel model after the single-view three-dimensional reconstruction.
The method comprises the following steps of inputting a single-view image initial value to an encoding layer based on a visual transform module, extracting features of the image on different dimensions, and obtaining the image by adopting the following method:
converting the image into the form of an image block;
the input of the coding layer is
x∈RH×W×C
Wherein H represents the long dimension of the input information; w represents the wide dimension of the input information; c represents a feature dimension to be extracted; r represents matrix information;
extracting image blocks according to a sliding window sequence, wherein the sliding window is
l×l
Each image block xpIs of a size of
l×l×C
The step length of sliding window sliding is s, the total number E of the image blocks is
(H-l/s)×(W-l/s)
Flattening each image block into
x'p∈RD×1
The scales for obtaining the image block x' are:
x'∈RE×D
a self-attention mechanism is introduced to each image block:
key x in self-attention mechanism for obtaining image block by full connection layerkQuery xqValue xv:
xk=x'Wk,Wk∈RD×D',xk∈RE×D'
xq=x'Wq,Wq∈RD×D',xq∈RE×D'
xv=x'Wv,Wv∈RD×D',xv∈RE×D'
Wherein D' represents a new feature dimension; wkA weight matrix representing the key; wqA weight matrix representing the query; wvRepresents a weight matrix of values;
calculating the similarity between the query and the key by matrix dot product to obtain a weight matrix xw
Figure BDA0003190385360000031
The weight matrix is normalized using the Softmax function:
Figure BDA0003190385360000032
wherein the content of the first and second substances,
Figure BDA0003190385360000033
represents the corresponding element of the ith jth column;
visual transform module based coding layer learns features x in an image by multiplying a weight matrix and a valueatt
xatt=xw·xv
The multi-layer perceptron mechanism using the neural network is composed of a plurality of fully connected layers and a dropout layer.
The process of designing the three-dimensional visual transformer structure reconstruction voxel, and improving the precision of the rough voxel information to obtain the final voxel model after the single-view three-dimensional reconstruction is as follows:
coding layer input x ∈ RC×H×W×LWherein L represents the high dimensionality of the input information;
voxel modules are extracted using a three-dimensional sliding window and stacked in the feature dimensions:
the sliding window is l × l × l, and the block size of each extracted voxel is
xp∈RD×1
D=l×l×l×c
Arranging each image block in sequence to form a matrix to obtain a characteristic information matrix of the three-dimensional voxel block:
x'∈RN×D
wherein
N=(H-l/s)×(W-l/s)×(L-l/s)
N represents the number of three-dimensional voxel blocks;
obtaining a characteristic matrix of the three-dimensional transform;
calculating keys, queries and values in the three-dimensional voxel block, and finally obtaining characteristic output;
the voxel model of size 32 × 32 × 32 is output.
A visual transform-based three-dimensional reconstruction voxel method, the method reconstructing an object that is multi-view, the method comprising:
inputting image information; learning features in the image by using a network coding layer; performing feature decoding using a three-dimensional transposed convolution and generating coarse voxel information; and learning the corresponding voxel weight by using the image information, fusing the voxels by using a network output layer according to the three-dimensional convolution of the generated weight layer to obtain the probability value of the fused voxels, and generating a voxel model.
The method comprises the following steps of learning corresponding voxel weight by utilizing image information, fusing voxels by utilizing a network output layer according to three-dimensional convolution of a generated weight layer, and obtaining the voxel by adopting the following method:
the three-dimensional feature sets learned from different images are
S={x1,x2,L,xn}
Wherein: x is the number ofn∈RC×H×W×LRepresenting the learned features of the nth image; h represents the long dimension of the input information, W represents the wide dimension of the input information, L represents the high dimension of the input information, and C represents the characteristic dimension;
learning an attention score C ═ C using three-dimensional convolutional layers1,c2,L,cn}
Wherein:
cn=g(xn),cn∈RH×W×L
g represents a plurality of three-dimensional convolution operations;
the obtained attention score is normalized into an attention weight through a Softmax function
S={s1,s2,L,sn}
Wherein:
Figure BDA0003190385360000043
represents the value of the ith weight at the (h, w, l) position;
the probability of each image being a voxel generated by a single view three-dimensional reconstruction of
V={v1,v2,L,vn}
Wherein v isn∈RH×W×L
And multiplying the weight and the corresponding voxel, adding the weights, and performing complete fusion, wherein the completely fused voxel y is as follows:
Figure BDA0003190385360000041
where x represents a multiplication by element operation.
And obtaining the probability value of the fused voxel and generating a voxel model, wherein the probability value of the voxel is as follows:
Figure BDA0003190385360000042
where (h, w, l) represents the voxel position.
The visual transform-based three-dimensional reconstruction voxel method further comprises designing a loss function, wherein the loss function is a binary cross loss function:
Figure BDA0003190385360000051
wherein, y(i,j,k)RepresentsTrue value of voxel block at (i, j, k), value 1 or 0, p(i,j,k)Representing a predicted voxel probability value; the smaller the loss function, the closer the predicted value is to the true value.
The invention has the advantages that:
the three-dimensional reconstruction voxel method does not need to utilize a specific technology to guess the three-dimensional structure of the image; the camera does not need to be calibrated in advance, and the method is suitable for all scenes.
When a part of the surface of an object is shielded by self, the existing voxel reconstruction method can cause reconstruction failure due to the fact that feature points of an image technology cannot be extracted.
The method can infer the unknown structure of the object under the condition of information loss, namely a single view, and the whole structure of the object can be reduced through the voxels generated by the model. For a multi-view three-dimensional reconstruction voxel model, a probability value of different image reconstruction voxels is fused by using an attention module based on three-dimensional convolution, the model can process any number of image inputs in parallel, and the reconstruction result is independent of the sequence of the image inputs.
The method can be applied to the three-dimensional reconstruction voxel models of all scenes.
Drawings
FIG. 1 is a single-view three-dimensional reconstruction voxel model overall structure
Fig. 2 to 4 are images of each stage in the process of implementing three-dimensional reconstruction of voxels for a single view according to the first embodiment, and an input image, a target voxel, and a model reconstruction voxel are sequentially provided from left to right.
Fig. 5 to 7 are images at each stage in the process of implementing a three-dimensional reconstruction voxel for multiple views according to the second embodiment, in which the first three images are model input images, and then target voxels are sequentially used to model reconstructed voxels.
Detailed Description
Several embodiments of the present invention are described with reference to the accompanying drawings.
First embodiment, the present embodiment is described with reference to fig. 1 to 4, in which a visual transform-based three-dimensional voxel reconstruction method reconstructs an object in a single view, the method includes:
the single-view image to be reconstructed is used as an initial value of an input single-view image in the neural network;
inputting a single-view image initial value to a coding layer based on a visual transform module, and extracting features of the image on different dimensions;
decoding the extracted image features through three-dimensional transposition convolution to obtain rough voxel information;
and designing a three-dimensional visual transformer structure reconstruction voxel, and improving the precision of the rough voxel information to obtain a final voxel model after the single-view three-dimensional reconstruction.
The method comprises the following steps of inputting a single-view image initial value to an encoding layer based on a visual transform module, extracting features of the image on different dimensions, and obtaining the image by adopting the following method:
converting the image into the form of an image block;
the input of the coding layer is
x∈RH×W×C
Wherein H represents the long dimension of the input information; w represents the wide dimension of the input information; c represents a feature dimension to be extracted; r represents matrix information;
extracting image blocks according to a sliding window sequence, wherein the sliding window is
l×l
Each image block xpIs of a size of
l×l×C
The step length of sliding window sliding is s, the total number E of the image blocks is
(H-l/s)×(W-l/s)
Flattening each image block into
x'p∈RD×1
The scales for obtaining the image block x' are:
x'∈RE×D
a self-attention mechanism is introduced to each image block:
key x in self-attention mechanism for obtaining image block by full connection layerkQuery xqValue xv:
xk=x'Wk,Wk∈RD×D',xk∈RE×D'
xq=x'Wq,Wq∈RD×D',xq∈RE×D'
xv=x'Wv,Wv∈RD×D',xv∈RE×D'
Wherein D' represents a new feature dimension; wkA weight matrix representing the key; wqA weight matrix representing the query; wvRepresents a weight matrix of values;
calculating the similarity between the query and the key by matrix dot product to obtain a weight matrix xw
Figure BDA0003190385360000061
The weight matrix is normalized using the Softmax function:
Figure BDA0003190385360000071
wherein the content of the first and second substances,
Figure BDA0003190385360000072
represents the corresponding element of the ith jth column;
visual transform module based coding layer learns features x in an image by multiplying a weight matrix and a valueatt
xatt=xw·xv
The visual transformer structure not only considers the relation between each pixel point in an image and the surrounding pixel points, but also learns the key characteristics in the image blocks by introducing the self-attention mechanism to each image block.
The process of designing the three-dimensional visual transformer structure reconstruction voxel, and improving the precision of the rough voxel information to obtain the final voxel model after the single-view three-dimensional reconstruction is as follows:
coding layer input x ∈ RC×H×W×LWherein L represents the high dimensionality of the input information;
voxel modules are extracted using a three-dimensional sliding window and stacked in the feature dimensions:
the sliding window is l × l × l, and the block size of each extracted voxel is
xp∈RD×1
D=l×l×l×c
Arranging each image block in sequence to form a matrix to obtain a characteristic information matrix of the three-dimensional voxel block:
x'∈RN×D
wherein
N=(H-l/s)×(W-l/s)×(L-l/s)
N represents the number of three-dimensional voxel blocks;
obtaining a characteristic matrix of the three-dimensional transform;
calculating keys, queries and values in the three-dimensional voxel block, and finally obtaining characteristic output;
the voxel probability value of size 32 × 32 × 32 is output while generating the voxel model.
The three-dimensional reconstruction voxel method further comprises using the degree of overlap IoU as an evaluation index:
Figure BDA0003190385360000073
wherein p is(i,j,k)Representing the predicted value of the voxel at (i, j, k), gt(i,j,k)Represents the true value of the voxel at (I, j, k), t represents the voxel threshold, and I (-) represents the indicator function;
a higher value of IoU represents a better reconstruction result.
Specifically, as shown in table 1, table 1 is a comparison of the results of the overlapping degree IoU of the single-view three-dimensional reconstructed voxel model and other models under the test set, and 43736 three-dimensional models of 13 types provided in 3D-R2N2 are used as the data set. Among them, 32700 models were divided into training sets and the rest were divided into test sets. Meanwhile, the 3D-R2N2 also gives 24 rendering maps with random visual angles of each model, and the rendering maps are used as the input of the network, so that the experimental result of the three-dimensional reconstruction voxel model in the single view is shown. Where the voxel threshold t is taken to be 0.4, a higher value of IoU represents a better reconstruction result. It can be seen that there is a large difference in the degree of overlap IoU for the different categories. The reconstruction effect of the automobile and the cabinet type is good because the object structures in the automobile and the cabinet type are relatively similar, so that the network can conveniently estimate the structures of the automobile and the cabinet type, and the reconstruction effects of the lamp and the display are general because on one hand, the invisible parts of the objects are more, and on the other hand, the difference of the object structures is more, so that the model cannot be used for estimating the structures of the automobile and the cabinet type.
Table 1 also shows the experimental results of comparing the visual transform-based three-dimensional reconstruction voxel model with other models in the case of a single view, wherein the images obtained at each stage in the process of performing three-dimensional reconstruction voxel model on the single views of the airplane, the automobile and the rifle by using the method of the present invention are shown in fig. 2, fig. 3 and fig. 4, respectively. The OGN uses an octree model to represent target voxels, which can represent higher resolution 3D output in a limited memory space, but its representation is more complex and network training becomes difficult as resolution increases. It can be seen from the graph that the model has a good reconstruction effect in the example type of airplane and automobile.
TABLE 1 Single-view visual transformer-based three-dimensional reconstructed voxel model vs. overlap IoU under other models
Figure BDA0003190385360000081
In a second embodiment, the present embodiment is described with reference to fig. 5 to 7, in which a visual transformer-based three-dimensional voxel reconstruction method is described, the object reconstructed by the method is a multi-view, and the method includes:
inputting image information; learning features in the image by using a network coding layer; performing feature decoding using a three-dimensional transposed convolution and generating coarse voxel information; and learning the corresponding voxel weight by using the image information, fusing the voxels by using a network output layer according to the three-dimensional convolution of the generated weight layer to obtain the probability value of the fused voxels, and generating a voxel model.
The method comprises the following steps of learning corresponding voxel weight by utilizing image information, fusing voxels by utilizing a network output layer according to three-dimensional convolution of a generated weight layer, and obtaining the voxel by adopting the following method:
the three-dimensional feature sets learned from different images are
S={x1,x2,L,xn}
Wherein: x is the number ofn∈RC×H×W×LRepresenting the learned features of the nth image; h represents the long dimension of the input information, W represents the wide dimension of the input information, L represents the high dimension of the input information, and C represents the characteristic dimension;
learning an attention score C ═ C using three-dimensional convolutional layers1,c2,L,cn}
Wherein:
cn=g(xn),cn∈RH×W×L
g represents a plurality of three-dimensional convolution operations;
the obtained attention score is normalized into an attention weight through a Softmax function
S={s1,s2,L,sn}
Wherein:
Figure BDA0003190385360000091
represents the value of the ith weight at the (h, w, l) position;
the probability of each image being a voxel generated by a single view three-dimensional reconstruction of
V={v1,v2,L,vn}
Wherein v isn∈RH×W×L
And multiplying the weight and the corresponding voxel, adding the weights, and performing complete fusion, wherein the completely fused voxel y is as follows:
Figure BDA0003190385360000092
where x represents a multiplication by element operation.
The weight is generated by using the characteristics, and the idea of the attention mechanism is used for endowing the image with a larger weight at a place with good reconstruction effect, so that the voxel probability values generated by the images can be better fused.
And obtaining the probability value of the fused voxel and generating a voxel model, wherein the probability value of the voxel is as follows:
Figure BDA0003190385360000101
where (h, w, l) represents the voxel position.
The convolution layer for extracting image features and the three-dimensional convolution layer parameters for generating attention weights are shared, and when the number of input images is different, the method can adaptively utilize the shared parameters for voxel fusion. Specifically, as shown in table 2, table 2 compares the overlapping degree IoU of the three-dimensional reconstruction of the model under different numbers of input images, wherein the images obtained at each stage in the process of performing the three-dimensional reconstruction respectively for multiple views of the lamp, the mobile phone and the airplane by using the method of the present invention are shown in fig. 5 to 7. It can be seen that the effect of the model on the test set is continuously getting better as the number of viewpoints increases. The description model applies features of multiple images to reconstruct voxels. Compared with a single-view result, the multi-view reconstruction result is obviously improved. For example: chair, adding an image can make overlap IoU improve 0.3, and the effect is more obvious.
Table 2 visual transform-based voxel model reconstruction of voxel overlap IoU in multi-view case
Figure BDA0003190385360000102
Table 3 shows the comparison between the multi-view reconstruction result of the model and the results of other methods, and the results show that the model can better reconstruct the voxel model of the object when the number of input images is different. And the model is improved in reconstruction speed compared with a 3D-R2N2 model, and a better result is reconstructed by using fewer parameters. And the multi-view model can adaptively complete the reconstruction task by using any number of images, and is more flexible than a 3D-R2N2 model.
Table 3 comparison of IoU results for visual transformer-based voxel model with other models under multi-view reconstruction
Figure BDA0003190385360000103
Figure BDA0003190385360000111

Claims (9)

1. A method for reconstructing voxels in three dimensions based on visual transforms, wherein the object reconstructed by the method is a single view, the method comprising:
the single view to be reconstructed is used as an initial value of an input single view image in the neural network;
inputting a single-view image initial value to a coding layer based on a visual transform module, and extracting features of the image on different dimensions;
decoding the extracted image features through three-dimensional transposition convolution to obtain rough voxel information;
and designing a three-dimensional visual transformer structure reconstruction voxel, and improving the precision of the rough voxel information to obtain a final voxel model after the single-view three-dimensional reconstruction.
2. The method according to claim 1, wherein the initial value of the single-view image is input to an encoding layer based on a visual transform module, and features of the image over different dimensions are extracted, and the method is obtained by:
converting the image into the form of an image block;
the input of the coding layer is
x∈RH×W×C
Wherein H represents the long dimension of the input information; w represents the wide dimension of the input information; c represents a feature dimension to be extracted; r represents matrix information.
Extracting image blocks according to a sliding window sequence, wherein the sliding window is
l×l
Each image block xpIs of a size of
l×l×C
The step length of sliding window sliding is s, the total number E of the image blocks is
(H-l/s)×(W-l/s)
Flattening each image block into
x'p∈RD×1
The scales for obtaining the image block x' are:
x'∈RE×D
a self-attention mechanism is introduced to each image block:
key x in self-attention mechanism for obtaining image block by full connection layerkQuery xqValue xv:
xk=x'Wk,Wk∈RD×D',xk∈RE×D'
xq=x'Wq,Wq∈RD×D',xq∈RE×D'
xv=x'Wv,Wv∈RD×D',xv∈RE×D'
Wherein D' represents a new feature dimension; wkA weight matrix representing the key; wqA weight matrix representing the query; wvRepresents a weight matrix of values;
computing queries by matrix dot productSimilarity with keys yields a weight matrix xw
Figure FDA0003190385350000021
The weight matrix is normalized using the Softmax function:
Figure FDA0003190385350000022
wherein the content of the first and second substances,
Figure FDA0003190385350000023
represents the corresponding element of the ith jth column;
visual transform module based coding layer learns features x in an image by multiplying a weight matrix and a valueatt
xatt=xw·xv
The multi-layer perceptron mechanism using the neural network is composed of a plurality of fully connected layers and a dropout layer.
3. The method for reconstructing the voxel based on the visual transform according to claim 1, wherein the process of designing the three-dimensional visual transform structure reconstructed voxel, performing precision improvement on rough voxel information, and obtaining a final voxel model after single-view three-dimensional reconstruction is as follows:
coding layer input x ∈ RC×H×W×LWherein L represents the high dimensionality of the input information;
voxel modules are extracted using a three-dimensional sliding window and stacked in the feature dimensions:
the sliding window is l × l × l, and the block size of each extracted voxel is
xp∈RD×1
D=l×l×l×c
Arranging each image block in sequence to form a matrix to obtain a characteristic information matrix of the three-dimensional voxel block:
x'∈RN×D
wherein
N=(H-l/s)×(W-l/s)×(L-l/s)
N represents the number of three-dimensional voxel blocks;
obtaining a characteristic matrix of the three-dimensional transform;
calculating keys, queries and values in the three-dimensional voxel block, and finally obtaining characteristic output;
the voxel probability value of size 32 × 32 × 32 is output while generating the voxel model.
4. A visual transform-based voxel three-dimensional reconstruction method is characterized in that an object reconstructed by the method is a multi-view, and the method inputs image information; learning features in the image by using a network coding layer; performing feature decoding using a three-dimensional transposed convolution and generating coarse voxel information; learning corresponding voxel weight by using image information, and fusing voxels by using a network output layer according to the three-dimensional convolution of the generated weight layer; and obtaining the probability value of the fused voxel and generating a voxel model.
5. The method for reconstructing voxels in three dimensions based on visual transform according to claim 4, wherein the learning of corresponding voxel weights is performed using image information, and voxels are fused using a network output layer according to a three-dimensional convolution that generates weight layers; the method comprises the following steps:
the three-dimensional feature sets learned from different images are
S={x1,x2,L,xn}
Wherein xn∈RC×H×W×LRepresenting the learned features of the nth image; h represents the long dimension of the input information, W represents the wide dimension of the input information, L represents the high dimension of the input information, and C represents the characteristic dimension;
learning an attention score C ═ C using three-dimensional convolutional layers1,c2,L,cn}
Wherein:
cn=g(xn),cn∈RH×W×L
g represents a plurality of three-dimensional convolution operations;
the obtained attention score is normalized into an attention weight through a Softmax function
S={s1,s2,L,sn}
Wherein:
Figure FDA0003190385350000031
represents the value of the ith weight at the (h, w, l) position;
the probability of each image being a voxel generated by a single view three-dimensional reconstruction of
V={v1,v2,L,vn}
Wherein v isn∈RH×W×L
And multiplying the weight and the corresponding voxel, adding the weights, and performing complete fusion, wherein the completely fused voxel y is as follows:
Figure FDA0003190385350000032
where x represents a multiplication by element operation.
6. The method for reconstructing three-dimensional voxels based on visual transform according to claim 4, wherein the probability values of the fused voxels are obtained and a voxel model is generated, and the probability values of the voxels are as follows:
Figure FDA0003190385350000041
where (h, w, l) represents the voxel position.
7. The visual fransformer-based three-dimensional reconstruction voxel method of claim 1 or claim 4, wherein the visual fransformer-based three-dimensional reconstruction voxel method further comprises designing a loss function, wherein the loss function is a binary cross-loss function:
Figure FDA0003190385350000042
wherein, y(i,j,k)Representing the true value of the voxel block at (i, j, k), with a value of 1 or 0, p(i,j,k)Representing a predicted voxel probability value; the smaller the loss function, the closer the predicted value is to the true value.
8. The visual fransformer-based three-dimensional reconstruction voxel method of claim 1 or claim 4, further comprising using the degree of overlap IoU as an evaluation index:
Figure FDA0003190385350000043
wherein p is(i,j,k)Representing the predicted value of the voxel at (i, j, k), gt(i,j,k)Represents the true value of the voxel at (I, j, k), t represents the voxel threshold, and I (-) represents the indicator function;
a higher value of IoU represents a better reconstruction result.
9. A computer device, characterized by: comprising a memory in which a computer program is stored and a processor which, when running the computer program stored by the memory, executes a visual fransformer-based three-dimensional reconstruction voxel method according to any of claims 1-8.
CN202110876128.4A 2021-07-30 2021-07-30 Visual transform-based three-dimensional voxel reconstruction method Pending CN113658322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110876128.4A CN113658322A (en) 2021-07-30 2021-07-30 Visual transform-based three-dimensional voxel reconstruction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110876128.4A CN113658322A (en) 2021-07-30 2021-07-30 Visual transform-based three-dimensional voxel reconstruction method

Publications (1)

Publication Number Publication Date
CN113658322A true CN113658322A (en) 2021-11-16

Family

ID=78478195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110876128.4A Pending CN113658322A (en) 2021-07-30 2021-07-30 Visual transform-based three-dimensional voxel reconstruction method

Country Status (1)

Country Link
CN (1) CN113658322A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114119838A (en) * 2022-01-24 2022-03-01 阿里巴巴(中国)有限公司 Voxel model and image generation method, equipment and storage medium
CN115619950A (en) * 2022-10-13 2023-01-17 中国地质大学(武汉) Three-dimensional geological modeling method based on deep learning
CN117496075A (en) * 2024-01-02 2024-02-02 中南大学 Single-view three-dimensional reconstruction method, system, equipment and storage medium
WO2024078049A1 (en) * 2022-10-10 2024-04-18 Shanghaitech University System and method for near real-time and unsupervised coordinate projection network for computed tomography images reconstruction

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114119838A (en) * 2022-01-24 2022-03-01 阿里巴巴(中国)有限公司 Voxel model and image generation method, equipment and storage medium
WO2024078049A1 (en) * 2022-10-10 2024-04-18 Shanghaitech University System and method for near real-time and unsupervised coordinate projection network for computed tomography images reconstruction
CN115619950A (en) * 2022-10-13 2023-01-17 中国地质大学(武汉) Three-dimensional geological modeling method based on deep learning
CN115619950B (en) * 2022-10-13 2024-01-19 中国地质大学(武汉) Three-dimensional geological modeling method based on deep learning
CN117496075A (en) * 2024-01-02 2024-02-02 中南大学 Single-view three-dimensional reconstruction method, system, equipment and storage medium
CN117496075B (en) * 2024-01-02 2024-03-22 中南大学 Single-view three-dimensional reconstruction method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113658322A (en) Visual transform-based three-dimensional voxel reconstruction method
Qiu et al. Geometric back-projection network for point cloud classification
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
Jam et al. A comprehensive review of past and present image inpainting methods
CN110659727B (en) Sketch-based image generation method
CN111612807A (en) Small target image segmentation method based on scale and edge information
CN112396645A (en) Monocular image depth estimation method and system based on convolution residual learning
Bahri et al. Robust Kronecker component analysis
CN114418030A (en) Image classification method, and training method and device of image classification model
CN115526891B (en) Training method and related device for defect data set generation model
CN114612902A (en) Image semantic segmentation method, device, equipment, storage medium and program product
Bogacz et al. Period classification of 3D cuneiform tablets with geometric neural networks
CN115546032A (en) Single-frame image super-resolution method based on feature fusion and attention mechanism
CN116630183A (en) Text image restoration method based on generated type countermeasure network
CN113538662B (en) Single-view three-dimensional object reconstruction method and device based on RGB data
CN113706404B (en) Depression angle face image correction method and system based on self-attention mechanism
Zhou et al. Personalized and occupational-aware age progression by generative adversarial networks
CN113096239B (en) Three-dimensional point cloud reconstruction method based on deep learning
Hoffman et al. Probnerf: Uncertainty-aware inference of 3d shapes from 2d images
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
CN117011219A (en) Method, apparatus, device, storage medium and program product for detecting quality of article
CN115424310A (en) Weak label learning method for expression separation task in human face rehearsal
Rivera et al. Trilateral convolutional neural network for 3D shape reconstruction of objects from a single depth view
Liu et al. Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination