CN115731365A - Grid model reconstruction method, system, device and medium based on two-dimensional image - Google Patents

Grid model reconstruction method, system, device and medium based on two-dimensional image Download PDF

Info

Publication number
CN115731365A
CN115731365A CN202211463350.2A CN202211463350A CN115731365A CN 115731365 A CN115731365 A CN 115731365A CN 202211463350 A CN202211463350 A CN 202211463350A CN 115731365 A CN115731365 A CN 115731365A
Authority
CN
China
Prior art keywords
depth
model
dimensional
dimensional image
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211463350.2A
Other languages
Chinese (zh)
Inventor
柯建生
王兵
戴振军
陈学斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Pole 3d Information Technology Co ltd
Original Assignee
Guangzhou Pole 3d Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Pole 3d Information Technology Co ltd filed Critical Guangzhou Pole 3d Information Technology Co ltd
Priority to CN202211463350.2A priority Critical patent/CN115731365A/en
Publication of CN115731365A publication Critical patent/CN115731365A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a method, a system, a device and a medium for reconstructing a grid model based on a two-dimensional image, wherein the method comprises the following steps: acquiring a two-dimensional image of the surface of a target object to be reconstructed; performing box depth prediction on an image depth interval of the two-dimensional image through a self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval; describing the depth value of a pixel in the two-dimensional image according to the linear combination, constructing to obtain a depth map, and constructing a model point cloud of a target object; constructing a space voxel according to the model point cloud through the truncated signed distance function, and performing frame fusion on the two-dimensional image to obtain a signed distance field; the method has the advantages that the model surface is extracted through the isosurface according to the distance field with the symbol, the three-dimensional grid model of the target object is obtained through the model surface reduction, the method can achieve more accurate and reliable three-dimensional reconstruction effect, the reconstruction time consumption of the method is shorter, the three-dimensional reconstruction can be completed quickly and in high quality, and the method can be widely applied to the technical field of computer vision.

Description

Grid model reconstruction method, system, device and medium based on two-dimensional image
Technical Field
The invention relates to the technical field of computer vision, in particular to a method, a system, a device and a medium for reconstructing a grid model based on a two-dimensional image.
Background
The three-dimensional model reconstruction method based on the deep learning and the surface reconstruction achieves the aim of restoring a three-dimensional model from an image by training a depth estimation neural network and solving a differential equation based on physics, and is widely applied to CG (computer graphics), 3DCV (three-dimensional computer vision) and AR (augmented reality). For example, in augmented reality, three-dimensional reconstruction needs to be performed accurately, consistently, and in real-time in order to achieve realistic and immersive interaction between the AR effect and the surrounding physical scene.
In the field of computer vision, three-dimensional models are typically represented by meshes (Mesh). Meshes are collections of vertices, lines and faces that make up a three-dimensional object, and large models are typically constructed from smaller interconnected planes (usually triangles or rectangles). In the polygonal mesh, each vertex holds x, y, z coordinate information, and each face contains vertices and interconnections between vertices. For two-dimensional images, reconstructing a three-dimensional model directly from an image is a relatively challenging problem, since there is only one direction and color of each pixel of the image.
In the related technical solution, the three-dimensional model reconstruction framework divides the calculation flow of model reconstruction into two steps, namely, estimating a corresponding depth map from a color image, reconstructing a point cloud model using the estimated depth map, and then further reconstructing a surface. Depth estimation refers to finding a mapping from the color of each pixel point in an image to the respective corresponding depth. After the depth information is acquired, the inverse projection transformation from the two-dimensional space to the three-dimensional Point Cloud (Point Cloud) can be completed through the Camera internal parameter Matrix (Camera internal Matrix) which originally generates the map. However, in terms of depth estimation, the related art scheme often uses a Multi-View sequence (Multi-View step) method for depth map estimation. However, this type of method relies on multiple pictures of different perspectives of a single object for recovery, and the estimated depth values are mostly missing. However, the global context interaction information cannot be captured by the convolutional neural network-based method, and the predicted depth value is not accurate enough. In the aspect of surface reconstruction, the poisson reconstruction method is long in time consumption due to the fact that iterative solution is needed, and the poisson reconstruction method cannot be used when real-time home display is conducted.
Disclosure of Invention
In view of the above, to at least partially solve one of the above technical problems or disadvantages, an embodiment of the present invention is directed to a mesh model reconstruction method based on a two-dimensional image, which fully utilizes global context mutual information to obtain a more accurate depth value; the technical scheme of the application also provides a system, a device and a medium corresponding to the method.
On one hand, the technical scheme of the application provides a mesh model reconstruction method based on a two-dimensional image, which comprises the following steps:
acquiring a two-dimensional image of the surface of a target object to be reconstructed;
performing bin depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth bins; the depth box is used for representing a depth value interval;
describing the depth value of pixels in the two-dimensional image according to the linear combination, constructing to obtain a depth map, and constructing a model point cloud of a target object according to the depth map;
constructing a space voxel according to the model point cloud through a truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the space voxel and the nearest target object surface and a first weight updated by the space voxel to obtain a signed distance field;
and extracting a model surface through an iso-surface according to the signed distance field, and restoring according to the model surface to obtain the three-dimensional grid model of the target object.
In a possible embodiment of the solution of the present application, the training process of the self-attention model includes:
training through first training data to obtain an encoder in the self-attention model;
constructing and obtaining a decoder in the self-attention model according to a feature up-sampling module;
and constructing the self-attention model according to the encoder, the decoder and the self-attention module.
In a possible embodiment of the present application, performing bin depth prediction on an image depth interval of the two-dimensional image through a trained self-attention model to obtain a linear combination of depth bins, includes:
inputting the two-dimensional image into the encoder for encoding, and inputting an encoding result into the decoder for decoding to obtain the image characteristics of the two-dimensional image;
performing global attention calculation according to the image characteristics to determine the width vector of a depth box corresponding to the two-dimensional image; the width vector characterizes a resolution and local pixel level information of the two-dimensional image;
and performing convolution operation on the width vector and the image characteristics to obtain a range attention characteristic diagram, and determining the linear combination of the depth box according to the range attention characteristic diagram.
In a possible embodiment of the present disclosure, the determining, by performing global attention calculation according to the image features, a width vector of a depth bin corresponding to the two-dimensional image includes:
inputting the image features to a code convolution module, and outputting according to the size of an inner core, the step length and the number of output channels of the code convolution module to obtain a first tensor of the image features;
flattening the first tensor according to the effective sequence length of the self-attention module to obtain a second tensor;
and performing activation operation output through an activation function in a multilayer perceptron according to the second tensor to obtain a first vector, and performing normalization processing on the first vector to obtain a width vector of the depth box.
In a possible embodiment of the present disclosure, the convolving the width vector with the image feature to obtain a range attention feature map, and determining the linear combination of the depth bins according to the range attention feature map includes:
inputting the range attention feature map into a convolution kernel for convolution operation, and performing classification prediction on the result of the convolution operation to obtain a classification prediction score value;
calculating a first probability of the center of the depth bin from the width vectors of the depth bins, determining a linear combination of the depth bins from the first probability and the fractional value; the linear combination is used to describe the depth value of the pixel.
In a possible embodiment of the present disclosure, the constructing a spatial voxel according to the model point cloud by using a truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest target surface and a first weight of the spatial voxel update to obtain a signed distance field includes:
determining a first position point of the space voxel in a world coordinate system, and determining a first mapping point of the first position point in a camera coordinate system according to a camera pose matrix corresponding to depth data in the depth map;
according to the camera coordinate system camera internal reference matrix, performing established back projection on the first mapping point to obtain a second position point in the depth map;
determining a second distance between the first mapping point and the origin of the camera coordinate system, calculating a directed distance field according to the second distance and the depth value of the second position point, and determining the first distance according to the directed distance field;
calculating to obtain the first weight according to the projection light of the first position point, the included angle of the surface normal vector and the second distance;
determining a signed distance field for the spatial voxel based on the first distance of the spatial voxel in the current frame and the first weight.
In a possible embodiment of the solution of the present application, the extracting a model surface from the signed distance field by iso-surface, and obtaining a three-dimensional mesh model of the object from the model surface by restoration, comprises:
a floating point tensor corresponding to the distance field through a third location point in the three-dimensional mesh model,
inputting the floating point tensor to a trained three-dimensional convolution network, and performing operation output through an activation function in the three-dimensional convolution network to obtain a three-dimensional grid model;
the training process of the three-dimensional convolution network comprises the following steps:
constructing to obtain second training data according to narrow-band image data around the surface of the object in the historical data;
and inputting the second training data into the three-dimensional convolution network to output to obtain a prediction result, constructing a loss function through a binary mask of the prediction result, and adjusting parameters of the three-dimensional convolution network according to the loss function.
On the other hand, the technical solution of the present application further provides a mesh model reconstruction system based on a two-dimensional image, the system including:
the device comprises a first unit, a second unit and a third unit, wherein the first unit is used for acquiring a two-dimensional image of the surface of a target object to be reconstructed;
the second unit is used for performing box depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval;
the third unit is used for describing the depth value of the pixel in the two-dimensional image according to the linear combination, constructing a depth map and constructing a model point cloud of a target object according to the depth map;
a fourth unit, configured to construct a spatial voxel according to the model point cloud by using a truncated signed distance function, and perform frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest target surface and a first weight of the spatial voxel update to obtain a signed distance field;
and a fifth unit for extracting a model surface from the signed distance field through an iso-surface, and obtaining a three-dimensional mesh model of the target object by restoration according to the model surface.
On the other hand, the technical scheme of the application also provides a mesh model reconstruction device based on a two-dimensional image, and the device comprises at least one processor; at least one memory for storing at least one program; when executed by the at least one processor, cause the at least one processor to execute a method for mesh model reconstruction based on two-dimensional images as described in the first aspect.
On the other hand, the present technical solution also provides a storage medium, in which a processor-executable program is stored, and when the processor-executable program is executed by a processor, the processor-executable program is configured to perform the mesh model reconstruction method based on two-dimensional images according to any one of the first aspect.
Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
the technical scheme provides a novel depth estimation and surface reconstruction method for a three-dimensional reconstruction part in an augmented reality home display system, the method can accurately restore depth information from a color image through a depth estimation model based on self attention, then a hidden potential surface is calculated from the depth information by using a signed distance function based on truncation so as to construct a spatial voxel, and finally a surface is extracted by using a moving cube method, so that a more accurate and reliable three-dimensional reconstruction effect is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart illustrating steps of a mesh model reconstruction method based on a two-dimensional image according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a spatial voxel constructed according to the present disclosure;
fig. 3 is a schematic diagram of a neural network model constructed in the surface extraction process according to the technical scheme of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. For the step numbers in the following embodiments, they are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
As noted in the background section, the related three-dimensional reconstruction technique generally includes two parts, depth estimation and surface reconstruction, the main purpose of depth estimation is to estimate a depth map from an image and then to restore a point cloud, and the purpose of surface reconstruction is to restore a continuous mesh surface representation from a discrete point cloud.
In some related technologies, the depth estimation method includes a conventional Multi-View sequential (Multi-View Stereo) method based on a convolutional neural network. In the traditional method, a multi-view map is usually required as input, the relationship between pixel points in the multi-view is modeled, and partial three-dimensional structure information is restored through pixel matching. The model based on the convolutional neural network is an end-to-end modeling method, the characteristic representation in a color image is extracted through a multilayer convolutional neural network, and then a corresponding depth map is estimated through a gradient descent and back propagation method.
In addition, surface reconstruction typically uses a poisson reconstruction method. Poisson reconstruction firstly adopts a self-adaptive space grid division method (the depth of a grid is adjusted according to the density of a point cloud), an octree is defined according to the position of a sampling point set, under the condition of uniform sampling, a distance field is used for approximating the depth of an indication function by using trilinear interpolation, then a Laplace matrix is used for iterative solution, and finally an isosurface is extracted and reconstructed.
Aiming at the defects and shortcomings that a convolutional neural network-based method cannot capture global context interaction information and the predicted depth value is not accurate enough, and a Poisson reconstruction method needs iteration solution and consumes longer time and the like due to the fact that the depth value loss exists in a Multi-View sequence (Multi-View Stereo) method, the technical scheme of the application provides an accurate and rapid model reconstruction method aiming at an augmented reality home display frame of an image reconstruction three-dimensional model, and can carry out image-point cloud-surface three-dimensional reconstruction processing from a two-dimensional image.
In a first aspect, as shown in fig. 1, the present application provides a mesh model reconstruction method based on a two-dimensional image, and the method includes steps S100 to S500:
s100, acquiring a two-dimensional image of the surface of a target object to be reconstructed;
specifically, in the embodiment, two-dimensional images of the surface of the object to be reconstructed are obtained by shooting at different angles and viewing angles of the object.
S200, performing box depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval;
in particular, in embodiments, to improve the accuracy of depth estimation, the depth estimation problem may be converted into a classification task. In the embodiment, an adaptive binning strategy is adopted to divide the depth interval into D = (D) min ,d max ) And (6) carrying out box separation. The interval N is fixed for a given data set, determined by the characteristics of the different data sets, or set manually to a reasonable range. Then, the width b of each bin is calculated from the input image using an adaptive method. However, discretizing the depth interval D into bins and assigning each pixel to a single bin may result in depth discretization artifacts. To solve this problem, we predict the final depth as a linear combination of the bin centers, enabling the model to estimate smoothly varying depth values.
S300, describing the depth value of the pixels in the two-dimensional image according to the linear combination, constructing to obtain a depth map, and constructing a model point cloud of a target object according to the depth map;
specifically, in the embodiment, a depth map corresponding to the two-dimensional image is obtained by integrating depth values of each pixel in the two-dimensional image, and then the depth map is converted into a model point cloud, and in the embodiment, the image coordinate system is converted into a world coordinate system to obtain a point cloud corresponding to the depth map; the constraint condition for the transformation is the camera internal reference for the shooting in step S100.
S400, constructing a space voxel according to the model point cloud through a truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the space voxel and the nearest target surface and a first weight updated by the space voxel to obtain a signed distance field;
specifically, after obtaining the depth map and reconstructing the model point cloud, we will construct spatial voxels from the point cloud using a truncated signed distance function. In the embodiment shown in fig. 2, a large space is first created as the three-dimensional model to be created, and this space may completely include the model in the embodiment, and the space is composed of many small voxels.
Where each voxel corresponds to a point in space, this point is evaluated by two quantities:
1. the distance of this voxel to the nearest surface (which may be referred to as zero crossing), denoted TSDF (x) in the example, i.e. the signed distance voxel;
2. the weight at voxel update is denoted by w in the embodiment.
In the embodiment, the TSDF values and the weight values of all voxels in the current frame may be obtained, if the current frame is the first frame, the first frame is the fusion result, otherwise, the current frame and the previous fusion result need to be fused. The new image frames may be merged into the merged frame one by one. Finally, embodiments can obtain signed distance fields with better detail and higher accuracy, and can be input to the next step of extracting the surface.
S500, extracting a model surface through an isosurface according to the signed distance field, and restoring to obtain a three-dimensional grid model of the target object according to the model surface;
specifically, after the signed distance field is obtained, the embodiment recovers the three-dimensional mesh model by extracting the model surface using iso-surface extraction. The embodiment first proposes a data-driven mesh reconstruction method (NDC) based on dual contours, in which a neural network is used to predict vertex positions, which eliminates the need for gradients in the input and takes into account the contextual information inherent in the training data.
So far, the embodiment completes the reconstruction of a three-dimensional mesh model by starting from a monocular color image, performing depth estimation and spatial voxel construction and finally using a three-dimensional convolution surface extraction method; the embodiment utilizes the self-attention model to achieve the purposes of high accuracy and high-speed depth estimation from the color image, and simultaneously utilizes a method based on truncated signed distance function and three-dimensional convolution surface extraction to carry out rapid mesh model reconstruction processing.
In some possible embodiments, the training process of the self-attention model may include steps S201 to S303:
s201, training through first training data to obtain an encoder in the self-attention model;
s202, constructing and obtaining a decoder in the self-attention model according to a feature up-sampling module;
s203, constructing and obtaining the self-attention model according to the encoder, the decoder and the self-attention module.
In particular, in the embodiment, in terms of model architecture, most existing models adopt the paradigm of an encoder, attention, and decoder. In practice, however, using the self-attention model at a higher resolution may help to improve the accuracy of the model estimation. Based on this, an encoder-decoder-self-attention architecture is used in some possible implementations to accomplish this task. Our model uses efficientNet B5 trained on ImageNet as the encoder and a standard feature upsampling module as the decoder, with the input image passing through the encoder and decoder as the decoded feature
Figure BDA0003956282500000071
The data is transmitted to a self-attention module for calculation; where h denotes the height of the image, w denotes the width of the image, C d Is the number of channels of the intermediate feature,
Figure BDA0003956282500000072
for a first value, a vector can be described, the exponential part of which represents the dimensions.
In some possible embodiment manners, the step S200 of performing bin depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth bins may include steps S210 to S230:
s210, inputting the two-dimensional image to the encoder for encoding, and inputting an encoding result to the decoder for decoding to obtain the image characteristics of the two-dimensional image;
in particular, in an embodiment, a graph is inputLike the encoder and decoder trained in the previous steps as the decoded features
Figure BDA0003956282500000081
The self-attention module performs calculation.
S220, performing global attention calculation according to the image characteristics to determine the width vector of a depth box corresponding to the two-dimensional image;
in particular, in an embodiment, in estimating bins, estimating sub-intervals within a depth range D in which a given image is more likely to occur requires combining both local structural information and global distribution information. Thus, the embodiment proposes to use global attention to calculate the width vector b of the bin for each input image.
More specifically, step S220 in the embodiment may further include steps S221 to S223:
s221, outputting the number of channels to obtain a first tensor of the image characteristics;
s222, flattening the first tensor according to the effective sequence length of the self-attention module to obtain a second tensor;
and S223, performing activation operation output through an activation function in the multilayer perceptron according to the second tensor to obtain a first vector, and performing normalization processing on the first vector to obtain a width vector of the depth box.
Specifically, in the embodiment, the decoded features are passed through a coding convolution module with a kernel size of p × p, a step size of p, and an output channel number of E. Then, we can get the size of
Figure BDA0003956282500000082
The tensor of (a). The embodiment further flattens this tensor into
Figure BDA0003956282500000083
Wherein
Figure BDA0003956282500000084
Effective sequence length as a self-attention module. The embodiment inputs the tensor into the self-attention module, and outputs a series of coding results after processing
Figure BDA0003956282500000085
Embodiments use a multi-layer perceptron for further encoding on the tensor of the first output. The multi-layered perceptron uses the ReLU activation function for activation and outputs an N-dimensional vector b'. Finally, the embodiment normalizes the vector b' to make the sum 1, and finally obtains the box width vector b:
Figure BDA0003956282500000086
wherein e =10 -3 (ii) a This slight positive e is used to ensure that the width of each bin is strictly positive. Normalization introduces competition between bin widths, forcing the network to focus on sub-intervals within prediction D by regions of prediction D that are more depth-dependent.
S230, performing convolution operation on the width vector and the image features to obtain a range attention feature map, and determining a linear combination of the depth box according to the range attention feature map;
in particular, high resolution and local pixel level information can be represented simultaneously by features from the attention module, and more global information can be effectively contained. The embodiment concatenates this output from the attention module with the decoded features and then convolves them with a set of 1 × 1 convolution kernels to obtain the range attention feature map R. This is equivalent to calculating a dot product attention weight between the pixel features considered as "keys" and the self attention output as "query".
More specifically, step S230 in the embodiment may further include steps S231-S232:
s231, inputting the range attention feature map into a convolution kernel for convolution operation, and performing classification prediction on the result of the convolution operation to obtain a classification prediction score value;
s232, calculating a first probability of the center position of the depth box according to the width vector of the depth box, and determining a linear combination of the depth box according to the first probability and the fraction value; the linear combination is used for describing the depth value of the pixel;
in an embodiment, in the mixed regression module, the range attention feature map R is input into the 1 × 1 convolution module again, and Softmax scores p of the N channels are obtained after one Softmax operation k Where k =1, …, N. Then, the probability c (b) = { c (b) = of N depth bin centers is calculated from the bin width vector b 1 ),c(b 2 ),…,c(b N )}:
Figure BDA0003956282500000091
Finally, the depth value of each pixel is finally obtained
Figure BDA0003956282500000092
Is a linear combination of the Softmax fraction at this pixel position:
Figure BDA0003956282500000093
in some possible embodiments, in an embodiment, the step S400 of constructing a spatial voxel from the model point cloud by using a truncated signed distance function, and performing frame fusion on the two-dimensional image to obtain a signed distance field according to a first distance between the spatial voxel and a nearest surface of the object and an updated first weight of the spatial voxel may include steps S410-S450:
s410, determining a first position point of the space voxel in a world coordinate system, and determining a first mapping point of the first position point in a camera coordinate system according to a camera pose matrix corresponding to depth data in the depth map;
in particular, in an embodiment, a traversal is first performed for the constructed spatial voxels. Taking a three-dimensional position point p of a voxel in a world coordinate system as an example; that is, the position point p is marked as a first position point, and in the embodiment, the mapping point v of the point p in the world coordinate system in the camera coordinate system can be obtained from the camera pose matrix of the depth data, which is the first mapping point.
S420, performing back projection on the first mapping point according to the camera internal reference matrix of the camera coordinate system to obtain a second position point in the depth map;
in the embodiment, a corresponding pixel point x in the depth image is solved by a camera internal reference matrix and a back projection v point, and a second position point is obtained; wherein, the depth Value of the pixel point x is Value (x).
S430, determining a second distance between the first mapping point and the origin of the camera coordinate system, calculating a directed distance field according to the second distance and the depth value of the second position point, and determining the first distance according to the directed distance field;
specifically, in the embodiment, the depth Value of the pixel point x is Value (x), the Distance from the point v to the origin of the camera coordinate is Distance (v), and then the SDF Value of p is SDF (p) = Value (x) -Distance (v). Introducing a truncation distance to reduce performance consumption, calculating the TSDF (p), within the truncation distance u,
Figure BDA0003956282500000101
otherwise, if SDF (p)>0,TSDF(p)=1,SDF(p)<0,TSDF(p)=-1。
S440, calculating to obtain the first weight according to the projection light of the first position point, the included angle of the surface normal vector and the second distance;
in particular, in the embodiment, the calculation formula of the first weight W (p) is:
Figure BDA0003956282500000102
in the formula, theta is an included angle between the projection light and a surface normal vector. The TSDF values and weight values of all voxels of the frame can be calculated through step S440.
S450, according to the first distance of the space voxel in the current frameAnd the first weight determines a signed distance field of the spatial voxel; if the current frame is the first frame, the first frame is the fusion result, otherwise, the current frame and the previous fusion result are required to be fused. Example TSDF Fuse(p) Fused TSDF values, W, as voxel p Fuse(p) To fuse weight values, TSDF Cur (p) TSDF value, W, for the current frame of voxel p Cur And (p) is the weight value of the current image frame. Embodiments may pass TSDF Cur (p) updating TSDF Fuse(p) . Wherein TSDF Fuse(p) The following calculation formula is satisfied:
Figure BDA0003956282500000103
W(p)=W Fuse (p)+W Cur (p)
in particular embodiments, by TSDF Fuse (p) and W (p) may merge the new frame into the merged frame. Finally, embodiments can obtain signed distance fields with better detail and higher accuracy, and can be input to the next step of extracting the surface.
In some possible embodiments, in step S500, extracting a model surface from the signed distance field by iso-surface, and obtaining a three-dimensional mesh model of the object by restoring from the model surface, may include steps S510-S520:
s510, passing a floating point tensor corresponding to the distance field of a third position point in the three-dimensional grid model;
s520, inputting the floating point tensor to the trained three-dimensional convolution network, and performing operation output through an activation function in the three-dimensional convolution network to obtain the three-dimensional grid model;
in particular, in embodiments, a dual-contour-based data-driven mesh reconstruction method (NDC) is proposed, which uses neural networks to predict vertex positions, which eliminates the need for gradients in the input, and takes into account the contextual information inherent in the training data. The formula for NDC is as follows:
Figure BDA0003956282500000104
where I represents the input signed distance field and θ represents a learnable parameter. Model f of the example v First, a distance field representation Φ at a grid vertex X is sampled as a floating-point tensor shaped | X |; s denotes a mesh vertex sign, V denotes a vertex of an edge, G denotes a discretized mesh, and F denotes a bi-directional surface, which is created only when a lattice edge connects lattice vertices of opposite signs. Embodiments then use a three-dimensional neural network to process this tensor. The three-dimensional convolution network has 6 layers in total, and the convolution kernel size of the first 3 layers is 3 3 And the kernel size of the last 3 layers is 1 3 Total receptive field is 7 3 . Embodiments employ a hidden layer with a smaller parameter number of 64 channels to improve the computational efficiency of the network. Sigmoid is used as the activation function on the output layer and leak Relu elsewhere.
The training process of the three-dimensional convolution network comprises steps S511-S512:
s511, constructing to obtain second training data according to image data of narrow bands around the object surface in the historical data;
s512, inputting the second training data into the three-dimensional convolution network to output to obtain a prediction result, constructing a loss function through a binary mask of the prediction result, and adjusting parameters of the three-dimensional convolution network according to the loss function;
in particular, in the embodiment, the neural network is specifically structured as shown in fig. 3, and in the model training, the embodiment uses the input data to supervise the prediction of the network in the narrow band around the input surface, and the embodiment uses the binary mask M S 、M V To evaluate whether it is within a narrow band, 1 if it is, and 0 otherwise, because the surface can only change around the signed distance field. Training the network is undertaken from the L2 Loss of the false real vertices:
Figure BDA0003956282500000111
wherein u denotes a Hadamard product, so we can train a neural network from the signed distance field to the surface of the mesh, where V gt Representing the true values of the mesh in the training set.
On the other hand, the technical solution of the present application further provides a mesh model reconstruction system based on a two-dimensional image, the system including:
the device comprises a first unit, a second unit and a third unit, wherein the first unit is used for acquiring a two-dimensional image of the surface of a target object to be reconstructed;
the second unit is used for performing box depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval;
the third unit is used for describing the depth value of the pixel in the two-dimensional image according to the linear combination, constructing a depth map and constructing a model point cloud of a target object according to the depth map;
a fourth unit, configured to construct a spatial voxel according to the model point cloud by using a truncated signed distance function, and perform frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest target surface and a first weight of the spatial voxel update to obtain a signed distance field;
and a fifth unit for extracting a model surface from the signed distance field through an iso-surface, and obtaining a three-dimensional mesh model of the target object by restoration according to the model surface.
On the other hand, the technical solution of the present application further provides a mesh model reconstruction device based on a two-dimensional image, the device including: at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to execute the two-dimensional image based mesh model reconstruction method according to the second aspect.
An embodiment of the present invention further provides a storage medium, where a corresponding execution program is stored, and the program is executed by a processor, so as to implement the mesh model reconstruction method based on a two-dimensional image in the first aspect.
From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:
compared with the traditional method, the depth estimation of the first part in the technical scheme of the application introduces the global receptive field, and can better aggregate context information and non-local characteristics, so that the depth estimation result has the advantage of higher accuracy compared with the existing method. The second and third parts of the invention introduce a novel method for extracting the three-dimensional convolution surface and the signed distance function based on truncation, and have the advantage of quickly reconstructing the isosurface from the point cloud.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. The grid model reconstruction method based on the two-dimensional image is characterized by comprising the following steps of:
acquiring a two-dimensional image of the surface of a target object to be reconstructed;
performing bin depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth bins; the depth box is used for representing a depth value interval;
describing the depth value of pixels in the two-dimensional image according to the linear combination, constructing to obtain a depth map, and constructing a model point cloud of a target object according to the depth map;
constructing a space voxel according to the model point cloud through a truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the space voxel and the nearest target object surface and a first weight updated by the space voxel to obtain a signed distance field;
and extracting a model surface through an iso-surface according to the signed distance field, and restoring according to the model surface to obtain the three-dimensional grid model of the target object.
2. The two-dimensional image-based mesh model reconstruction method according to claim 1, wherein the training process of the self-attention model comprises:
training through first training data to obtain an encoder in the self-attention model;
constructing and obtaining a decoder in the self-attention model according to a feature up-sampling module;
and constructing the self-attention model according to the encoder, the decoder and the self-attention module.
3. The mesh model reconstruction method based on two-dimensional image according to claim 2, wherein the step of performing bin depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth bins comprises:
inputting the two-dimensional image into the encoder for encoding, and inputting an encoding result into the decoder for decoding to obtain the image characteristics of the two-dimensional image;
performing global attention calculation according to the image characteristics to determine the width vector of a depth box corresponding to the two-dimensional image; and performing convolution operation on the width vector and the image characteristics to obtain a range attention characteristic diagram, and determining the linear combination of the depth box according to the range attention characteristic diagram.
4. The method for reconstructing a mesh model based on a two-dimensional image according to claim 3, wherein the determining the width vector of the depth box corresponding to the two-dimensional image by performing the global attention calculation according to the image features comprises:
inputting the image features to a code convolution module, and outputting according to the size of an inner core, the step length and the number of output channels of the code convolution module to obtain a first tensor of the image features;
flattening the first tensor according to the effective sequence length of the self-attention module to obtain a second tensor;
and performing activation operation output through an activation function in a multilayer perceptron according to the second tensor to obtain a first vector, and performing normalization processing on the first vector to obtain a width vector of the depth box.
5. The two-dimensional image-based mesh model reconstruction method according to claim 3, wherein the convolving the width vectors with the image features to obtain a range attention feature map, and determining the linear combination of the depth bins according to the range attention feature map comprises:
inputting the range attention feature map into a convolution kernel for convolution operation, and performing classification prediction on the result of the convolution operation to obtain a classification prediction score value;
calculating a first probability of a center position of a depth bin according to the width vectors of the depth bins, and determining a linear combination of the depth bins according to the first probability and the fractional value; the linear combination is used to describe the depth value of the pixel.
6. The method of claim 1, wherein constructing a spatial voxel from the model point cloud by the truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest surface of the object and an updated first weight of the spatial voxel to obtain a signed distance field comprises:
determining a first position point of the space voxel in a world coordinate system, and determining a first mapping point of the first position point in a camera coordinate system according to a camera pose matrix corresponding to depth data in the depth map;
according to the camera coordinate system and the camera internal reference matrix, performing established back projection on the first mapping point to obtain a second position point in the depth map;
determining a second distance between the first mapping point and the origin of the camera coordinate system, calculating a directed distance field according to the second distance and the depth value of the second position point, and determining the first distance according to the directed distance field;
calculating to obtain the first weight according to the projection light of the first position point, the included angle of the surface normal vector and the second distance;
a signed distance field for the spatial voxel is determined based on the first distance of the spatial voxel in the current frame and the first weight.
7. The method of claim 1, wherein the extracting a model surface from the signed distance field by iso-surface and reconstructing the three-dimensional mesh model of the object from the model surface comprises:
a floating point tensor corresponding to the distance field through a third location point in the three-dimensional mesh model;
inputting the floating point tensor to a trained three-dimensional convolution network, and performing operation output through an activation function in the three-dimensional convolution network to obtain a three-dimensional grid model;
the training process of the three-dimensional convolution network comprises the following steps:
constructing to obtain second training data according to narrow-band image data around the surface of the object in the historical data;
and inputting the second training data into the three-dimensional convolution network to output to obtain a prediction result, constructing a loss function through a binary mask of the prediction result, and adjusting parameters of the three-dimensional convolution network according to the loss function.
8. A mesh model reconstruction system based on two-dimensional images is characterized by comprising:
the device comprises a first unit, a second unit and a third unit, wherein the first unit is used for acquiring a two-dimensional image of the surface of a target object to be reconstructed;
the second unit is used for performing box depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval;
the third unit is used for describing the depth value of the pixel in the two-dimensional image according to the linear combination, constructing a depth map and constructing a model point cloud of a target object according to the depth map;
a fourth unit, configured to construct a spatial voxel according to the model point cloud by using a truncated signed distance function, and perform frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest target surface and a first weight of the spatial voxel update to obtain a signed distance field;
and a fifth unit for extracting a model surface from the signed distance field through an iso-surface, and obtaining a three-dimensional mesh model of the target object by restoration according to the model surface.
9. A mesh model reconstruction apparatus based on two-dimensional images, the apparatus comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to perform the method for two-dimensional image based mesh model reconstruction according to any one of claims 1-7.
10. A storage medium having stored therein a processor-executable program, which when executed by a processor is adapted to execute the method of two-dimensional image based mesh model reconstruction according to any one of claims 1-7.
CN202211463350.2A 2022-11-22 2022-11-22 Grid model reconstruction method, system, device and medium based on two-dimensional image Pending CN115731365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211463350.2A CN115731365A (en) 2022-11-22 2022-11-22 Grid model reconstruction method, system, device and medium based on two-dimensional image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211463350.2A CN115731365A (en) 2022-11-22 2022-11-22 Grid model reconstruction method, system, device and medium based on two-dimensional image

Publications (1)

Publication Number Publication Date
CN115731365A true CN115731365A (en) 2023-03-03

Family

ID=85297145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211463350.2A Pending CN115731365A (en) 2022-11-22 2022-11-22 Grid model reconstruction method, system, device and medium based on two-dimensional image

Country Status (1)

Country Link
CN (1) CN115731365A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740681A (en) * 2023-08-10 2023-09-12 小米汽车科技有限公司 Target detection method, device, vehicle and storage medium
CN116797726A (en) * 2023-05-20 2023-09-22 北京大学 Organ three-dimensional reconstruction method, device, electronic equipment and storage medium
CN117496075A (en) * 2024-01-02 2024-02-02 中南大学 Single-view three-dimensional reconstruction method, system, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223387A (en) * 2019-05-17 2019-09-10 武汉奥贝赛维数码科技有限公司 A kind of reconstructing three-dimensional model technology based on deep learning
CN112288011A (en) * 2020-10-30 2021-01-29 闽江学院 Image matching method based on self-attention deep neural network
CN113222033A (en) * 2021-05-19 2021-08-06 北京数研科技发展有限公司 Monocular image estimation method based on multi-classification regression model and self-attention mechanism
WO2022165722A1 (en) * 2021-02-04 2022-08-11 华为技术有限公司 Monocular depth estimation method, apparatus and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223387A (en) * 2019-05-17 2019-09-10 武汉奥贝赛维数码科技有限公司 A kind of reconstructing three-dimensional model technology based on deep learning
CN112288011A (en) * 2020-10-30 2021-01-29 闽江学院 Image matching method based on self-attention deep neural network
WO2022165722A1 (en) * 2021-02-04 2022-08-11 华为技术有限公司 Monocular depth estimation method, apparatus and device
CN113222033A (en) * 2021-05-19 2021-08-06 北京数研科技发展有限公司 Monocular image estimation method based on multi-classification regression model and self-attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BHAT SF,ALHASHIM I AND WONKA P: "AdaBins:Depth Estimation Using Adaptive Bins", 《IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
CHEN ZQ,TAGLIASACCHI A,FUNKHOUSER T,ET AL: "Neural Dual Contouring", 《ACM TRANS.GRAPH》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797726A (en) * 2023-05-20 2023-09-22 北京大学 Organ three-dimensional reconstruction method, device, electronic equipment and storage medium
CN116797726B (en) * 2023-05-20 2024-05-07 北京大学 Organ three-dimensional reconstruction method, device, electronic equipment and storage medium
CN116740681A (en) * 2023-08-10 2023-09-12 小米汽车科技有限公司 Target detection method, device, vehicle and storage medium
CN116740681B (en) * 2023-08-10 2023-11-21 小米汽车科技有限公司 Target detection method, device, vehicle and storage medium
CN117496075A (en) * 2024-01-02 2024-02-02 中南大学 Single-view three-dimensional reconstruction method, system, equipment and storage medium
CN117496075B (en) * 2024-01-02 2024-03-22 中南大学 Single-view three-dimensional reconstruction method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
Insafutdinov et al. Unsupervised learning of shape and pose with differentiable point clouds
CN110443842B (en) Depth map prediction method based on visual angle fusion
CN115731365A (en) Grid model reconstruction method, system, device and medium based on two-dimensional image
CN111986307A (en) 3D object reconstruction using photometric grid representation
Zhang et al. Critical regularizations for neural surface reconstruction in the wild
CN114782634B (en) Monocular image dressing human body reconstruction method and system based on surface hidden function
CN113313828B (en) Three-dimensional reconstruction method and system based on single-picture intrinsic image decomposition
JP7294788B2 (en) Classification of 2D images according to the type of 3D placement
US20220327730A1 (en) Method for training neural network, system for training neural network, and neural network
CN114742966A (en) Three-dimensional scene reconstruction method and device based on image
JP2022036023A (en) Variation auto encoder for outputting 3d model
CN115272437A (en) Image depth estimation method and device based on global and local features
KR20210058638A (en) Apparatus and method for image processing
Huang et al. A bayesian approach to multi-view 4d modeling
Chen et al. Multi-view Pixel2Mesh++: 3D reconstruction via Pixel2Mesh with more images
Maxim et al. A survey on the current state of the art on deep learning 3D reconstruction
CN113989441A (en) Three-dimensional cartoon model automatic generation method and system based on single face image
CN117274446A (en) Scene video processing method, device, equipment and storage medium
Ehret et al. Regularization of NeRFs using differential geometry
CN115841546A (en) Scene structure associated subway station multi-view vector simulation rendering method and system
Hu et al. 3D map reconstruction using a monocular camera for smart cities
US20230145498A1 (en) Image reprojection and multi-image inpainting based on geometric depth parameters
EP3958167B1 (en) A method for training a neural network to deliver the viewpoints of objects using unlabeled pairs of images, and the corresponding system
Ye et al. Indoor Scene Reconstruction with Fine-Grained Details Using Hybrid Representation and Normal Prior Enhancement
CN113808275A (en) Single-image three-dimensional reconstruction method based on GCN and topology modification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination