CN115731365A

CN115731365A - Grid model reconstruction method, system, device and medium based on two-dimensional image

Info

Publication number: CN115731365A
Application number: CN202211463350.2A
Authority: CN
Inventors: 柯建生; 王兵; 戴振军; 陈学斌
Original assignee: Guangzhou Pole 3d Information Technology Co ltd
Current assignee: Guangzhou Pole 3d Information Technology Co ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-03

Abstract

The invention provides a method, a system, a device and a medium for reconstructing a grid model based on a two-dimensional image, wherein the method comprises the following steps: acquiring a two-dimensional image of the surface of a target object to be reconstructed; performing box depth prediction on an image depth interval of the two-dimensional image through a self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval; describing the depth value of a pixel in the two-dimensional image according to the linear combination, constructing to obtain a depth map, and constructing a model point cloud of a target object; constructing a space voxel according to the model point cloud through the truncated signed distance function, and performing frame fusion on the two-dimensional image to obtain a signed distance field; the method has the advantages that the model surface is extracted through the isosurface according to the distance field with the symbol, the three-dimensional grid model of the target object is obtained through the model surface reduction, the method can achieve more accurate and reliable three-dimensional reconstruction effect, the reconstruction time consumption of the method is shorter, the three-dimensional reconstruction can be completed quickly and in high quality, and the method can be widely applied to the technical field of computer vision.

Description

Grid model reconstruction method, system, device and medium based on two-dimensional image

Technical Field

The invention relates to the technical field of computer vision, in particular to a method, a system, a device and a medium for reconstructing a grid model based on a two-dimensional image.

Background

The three-dimensional model reconstruction method based on the deep learning and the surface reconstruction achieves the aim of restoring a three-dimensional model from an image by training a depth estimation neural network and solving a differential equation based on physics, and is widely applied to CG (computer graphics), 3DCV (three-dimensional computer vision) and AR (augmented reality). For example, in augmented reality, three-dimensional reconstruction needs to be performed accurately, consistently, and in real-time in order to achieve realistic and immersive interaction between the AR effect and the surrounding physical scene.

In the field of computer vision, three-dimensional models are typically represented by meshes (Mesh). Meshes are collections of vertices, lines and faces that make up a three-dimensional object, and large models are typically constructed from smaller interconnected planes (usually triangles or rectangles). In the polygonal mesh, each vertex holds x, y, z coordinate information, and each face contains vertices and interconnections between vertices. For two-dimensional images, reconstructing a three-dimensional model directly from an image is a relatively challenging problem, since there is only one direction and color of each pixel of the image.

In the related technical solution, the three-dimensional model reconstruction framework divides the calculation flow of model reconstruction into two steps, namely, estimating a corresponding depth map from a color image, reconstructing a point cloud model using the estimated depth map, and then further reconstructing a surface. Depth estimation refers to finding a mapping from the color of each pixel point in an image to the respective corresponding depth. After the depth information is acquired, the inverse projection transformation from the two-dimensional space to the three-dimensional Point Cloud (Point Cloud) can be completed through the Camera internal parameter Matrix (Camera internal Matrix) which originally generates the map. However, in terms of depth estimation, the related art scheme often uses a Multi-View sequence (Multi-View step) method for depth map estimation. However, this type of method relies on multiple pictures of different perspectives of a single object for recovery, and the estimated depth values are mostly missing. However, the global context interaction information cannot be captured by the convolutional neural network-based method, and the predicted depth value is not accurate enough. In the aspect of surface reconstruction, the poisson reconstruction method is long in time consumption due to the fact that iterative solution is needed, and the poisson reconstruction method cannot be used when real-time home display is conducted.

Disclosure of Invention

In view of the above, to at least partially solve one of the above technical problems or disadvantages, an embodiment of the present invention is directed to a mesh model reconstruction method based on a two-dimensional image, which fully utilizes global context mutual information to obtain a more accurate depth value; the technical scheme of the application also provides a system, a device and a medium corresponding to the method.

On one hand, the technical scheme of the application provides a mesh model reconstruction method based on a two-dimensional image, which comprises the following steps:

acquiring a two-dimensional image of the surface of a target object to be reconstructed;

performing bin depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth bins; the depth box is used for representing a depth value interval;

describing the depth value of pixels in the two-dimensional image according to the linear combination, constructing to obtain a depth map, and constructing a model point cloud of a target object according to the depth map;

constructing a space voxel according to the model point cloud through a truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the space voxel and the nearest target object surface and a first weight updated by the space voxel to obtain a signed distance field;

and extracting a model surface through an iso-surface according to the signed distance field, and restoring according to the model surface to obtain the three-dimensional grid model of the target object.

In a possible embodiment of the solution of the present application, the training process of the self-attention model includes:

training through first training data to obtain an encoder in the self-attention model;

constructing and obtaining a decoder in the self-attention model according to a feature up-sampling module;

and constructing the self-attention model according to the encoder, the decoder and the self-attention module.

In a possible embodiment of the present application, performing bin depth prediction on an image depth interval of the two-dimensional image through a trained self-attention model to obtain a linear combination of depth bins, includes:

inputting the two-dimensional image into the encoder for encoding, and inputting an encoding result into the decoder for decoding to obtain the image characteristics of the two-dimensional image;

performing global attention calculation according to the image characteristics to determine the width vector of a depth box corresponding to the two-dimensional image; the width vector characterizes a resolution and local pixel level information of the two-dimensional image;

and performing convolution operation on the width vector and the image characteristics to obtain a range attention characteristic diagram, and determining the linear combination of the depth box according to the range attention characteristic diagram.

In a possible embodiment of the present disclosure, the determining, by performing global attention calculation according to the image features, a width vector of a depth bin corresponding to the two-dimensional image includes:

inputting the image features to a code convolution module, and outputting according to the size of an inner core, the step length and the number of output channels of the code convolution module to obtain a first tensor of the image features;

flattening the first tensor according to the effective sequence length of the self-attention module to obtain a second tensor;

and performing activation operation output through an activation function in a multilayer perceptron according to the second tensor to obtain a first vector, and performing normalization processing on the first vector to obtain a width vector of the depth box.

In a possible embodiment of the present disclosure, the convolving the width vector with the image feature to obtain a range attention feature map, and determining the linear combination of the depth bins according to the range attention feature map includes:

inputting the range attention feature map into a convolution kernel for convolution operation, and performing classification prediction on the result of the convolution operation to obtain a classification prediction score value;

calculating a first probability of the center of the depth bin from the width vectors of the depth bins, determining a linear combination of the depth bins from the first probability and the fractional value; the linear combination is used to describe the depth value of the pixel.

In a possible embodiment of the present disclosure, the constructing a spatial voxel according to the model point cloud by using a truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest target surface and a first weight of the spatial voxel update to obtain a signed distance field includes:

determining a first position point of the space voxel in a world coordinate system, and determining a first mapping point of the first position point in a camera coordinate system according to a camera pose matrix corresponding to depth data in the depth map;

according to the camera coordinate system camera internal reference matrix, performing established back projection on the first mapping point to obtain a second position point in the depth map;

determining a second distance between the first mapping point and the origin of the camera coordinate system, calculating a directed distance field according to the second distance and the depth value of the second position point, and determining the first distance according to the directed distance field;

calculating to obtain the first weight according to the projection light of the first position point, the included angle of the surface normal vector and the second distance;

determining a signed distance field for the spatial voxel based on the first distance of the spatial voxel in the current frame and the first weight.

In a possible embodiment of the solution of the present application, the extracting a model surface from the signed distance field by iso-surface, and obtaining a three-dimensional mesh model of the object from the model surface by restoration, comprises:

a floating point tensor corresponding to the distance field through a third location point in the three-dimensional mesh model,

inputting the floating point tensor to a trained three-dimensional convolution network, and performing operation output through an activation function in the three-dimensional convolution network to obtain a three-dimensional grid model;

the training process of the three-dimensional convolution network comprises the following steps:

constructing to obtain second training data according to narrow-band image data around the surface of the object in the historical data;

and inputting the second training data into the three-dimensional convolution network to output to obtain a prediction result, constructing a loss function through a binary mask of the prediction result, and adjusting parameters of the three-dimensional convolution network according to the loss function.

On the other hand, the technical solution of the present application further provides a mesh model reconstruction system based on a two-dimensional image, the system including:

the device comprises a first unit, a second unit and a third unit, wherein the first unit is used for acquiring a two-dimensional image of the surface of a target object to be reconstructed;

the second unit is used for performing box depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval;

the third unit is used for describing the depth value of the pixel in the two-dimensional image according to the linear combination, constructing a depth map and constructing a model point cloud of a target object according to the depth map;

a fourth unit, configured to construct a spatial voxel according to the model point cloud by using a truncated signed distance function, and perform frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest target surface and a first weight of the spatial voxel update to obtain a signed distance field;

and a fifth unit for extracting a model surface from the signed distance field through an iso-surface, and obtaining a three-dimensional mesh model of the target object by restoration according to the model surface.

On the other hand, the technical scheme of the application also provides a mesh model reconstruction device based on a two-dimensional image, and the device comprises at least one processor; at least one memory for storing at least one program; when executed by the at least one processor, cause the at least one processor to execute a method for mesh model reconstruction based on two-dimensional images as described in the first aspect.

On the other hand, the present technical solution also provides a storage medium, in which a processor-executable program is stored, and when the processor-executable program is executed by a processor, the processor-executable program is configured to perform the mesh model reconstruction method based on two-dimensional images according to any one of the first aspect.

Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

the technical scheme provides a novel depth estimation and surface reconstruction method for a three-dimensional reconstruction part in an augmented reality home display system, the method can accurately restore depth information from a color image through a depth estimation model based on self attention, then a hidden potential surface is calculated from the depth information by using a signed distance function based on truncation so as to construct a spatial voxel, and finally a surface is extracted by using a moving cube method, so that a more accurate and reliable three-dimensional reconstruction effect is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating steps of a mesh model reconstruction method based on a two-dimensional image according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a spatial voxel constructed according to the present disclosure;

fig. 3 is a schematic diagram of a neural network model constructed in the surface extraction process according to the technical scheme of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. For the step numbers in the following embodiments, they are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

As noted in the background section, the related three-dimensional reconstruction technique generally includes two parts, depth estimation and surface reconstruction, the main purpose of depth estimation is to estimate a depth map from an image and then to restore a point cloud, and the purpose of surface reconstruction is to restore a continuous mesh surface representation from a discrete point cloud.

In some related technologies, the depth estimation method includes a conventional Multi-View sequential (Multi-View Stereo) method based on a convolutional neural network. In the traditional method, a multi-view map is usually required as input, the relationship between pixel points in the multi-view is modeled, and partial three-dimensional structure information is restored through pixel matching. The model based on the convolutional neural network is an end-to-end modeling method, the characteristic representation in a color image is extracted through a multilayer convolutional neural network, and then a corresponding depth map is estimated through a gradient descent and back propagation method.

In addition, surface reconstruction typically uses a poisson reconstruction method. Poisson reconstruction firstly adopts a self-adaptive space grid division method (the depth of a grid is adjusted according to the density of a point cloud), an octree is defined according to the position of a sampling point set, under the condition of uniform sampling, a distance field is used for approximating the depth of an indication function by using trilinear interpolation, then a Laplace matrix is used for iterative solution, and finally an isosurface is extracted and reconstructed.

Aiming at the defects and shortcomings that a convolutional neural network-based method cannot capture global context interaction information and the predicted depth value is not accurate enough, and a Poisson reconstruction method needs iteration solution and consumes longer time and the like due to the fact that the depth value loss exists in a Multi-View sequence (Multi-View Stereo) method, the technical scheme of the application provides an accurate and rapid model reconstruction method aiming at an augmented reality home display frame of an image reconstruction three-dimensional model, and can carry out image-point cloud-surface three-dimensional reconstruction processing from a two-dimensional image.

In a first aspect, as shown in fig. 1, the present application provides a mesh model reconstruction method based on a two-dimensional image, and the method includes steps S100 to S500:

s100, acquiring a two-dimensional image of the surface of a target object to be reconstructed;

specifically, in the embodiment, two-dimensional images of the surface of the object to be reconstructed are obtained by shooting at different angles and viewing angles of the object.

S200, performing box depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth boxes; the depth box is used for representing a depth value interval;

in particular, in embodiments, to improve the accuracy of depth estimation, the depth estimation problem may be converted into a classification task. In the embodiment, an adaptive binning strategy is adopted to divide the depth interval into D = (D) _min ,d _max ) And (6) carrying out box separation. The interval N is fixed for a given data set, determined by the characteristics of the different data sets, or set manually to a reasonable range. Then, the width b of each bin is calculated from the input image using an adaptive method. However, discretizing the depth interval D into bins and assigning each pixel to a single bin may result in depth discretization artifacts. To solve this problem, we predict the final depth as a linear combination of the bin centers, enabling the model to estimate smoothly varying depth values.

S300, describing the depth value of the pixels in the two-dimensional image according to the linear combination, constructing to obtain a depth map, and constructing a model point cloud of a target object according to the depth map;

specifically, in the embodiment, a depth map corresponding to the two-dimensional image is obtained by integrating depth values of each pixel in the two-dimensional image, and then the depth map is converted into a model point cloud, and in the embodiment, the image coordinate system is converted into a world coordinate system to obtain a point cloud corresponding to the depth map; the constraint condition for the transformation is the camera internal reference for the shooting in step S100.

S400, constructing a space voxel according to the model point cloud through a truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the space voxel and the nearest target surface and a first weight updated by the space voxel to obtain a signed distance field;

specifically, after obtaining the depth map and reconstructing the model point cloud, we will construct spatial voxels from the point cloud using a truncated signed distance function. In the embodiment shown in fig. 2, a large space is first created as the three-dimensional model to be created, and this space may completely include the model in the embodiment, and the space is composed of many small voxels.

Where each voxel corresponds to a point in space, this point is evaluated by two quantities:

1. the distance of this voxel to the nearest surface (which may be referred to as zero crossing), denoted TSDF (x) in the example, i.e. the signed distance voxel;

2. the weight at voxel update is denoted by w in the embodiment.

In the embodiment, the TSDF values and the weight values of all voxels in the current frame may be obtained, if the current frame is the first frame, the first frame is the fusion result, otherwise, the current frame and the previous fusion result need to be fused. The new image frames may be merged into the merged frame one by one. Finally, embodiments can obtain signed distance fields with better detail and higher accuracy, and can be input to the next step of extracting the surface.

S500, extracting a model surface through an isosurface according to the signed distance field, and restoring to obtain a three-dimensional grid model of the target object according to the model surface;

specifically, after the signed distance field is obtained, the embodiment recovers the three-dimensional mesh model by extracting the model surface using iso-surface extraction. The embodiment first proposes a data-driven mesh reconstruction method (NDC) based on dual contours, in which a neural network is used to predict vertex positions, which eliminates the need for gradients in the input and takes into account the contextual information inherent in the training data.

So far, the embodiment completes the reconstruction of a three-dimensional mesh model by starting from a monocular color image, performing depth estimation and spatial voxel construction and finally using a three-dimensional convolution surface extraction method; the embodiment utilizes the self-attention model to achieve the purposes of high accuracy and high-speed depth estimation from the color image, and simultaneously utilizes a method based on truncated signed distance function and three-dimensional convolution surface extraction to carry out rapid mesh model reconstruction processing.

In some possible embodiments, the training process of the self-attention model may include steps S201 to S303:

s201, training through first training data to obtain an encoder in the self-attention model;

s202, constructing and obtaining a decoder in the self-attention model according to a feature up-sampling module;

s203, constructing and obtaining the self-attention model according to the encoder, the decoder and the self-attention module.

In particular, in the embodiment, in terms of model architecture, most existing models adopt the paradigm of an encoder, attention, and decoder. In practice, however, using the self-attention model at a higher resolution may help to improve the accuracy of the model estimation. Based on this, an encoder-decoder-self-attention architecture is used in some possible implementations to accomplish this task. Our model uses efficientNet B5 trained on ImageNet as the encoder and a standard feature upsampling module as the decoder, with the input image passing through the encoder and decoder as the decoded feature

The data is transmitted to a self-attention module for calculation; where h denotes the height of the image, w denotes the width of the image, C _d Is the number of channels of the intermediate feature,

for a first value, a vector can be described, the exponential part of which represents the dimensions.

In some possible embodiment manners, the step S200 of performing bin depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth bins may include steps S210 to S230:

s210, inputting the two-dimensional image to the encoder for encoding, and inputting an encoding result to the decoder for decoding to obtain the image characteristics of the two-dimensional image;

in particular, in an embodiment, a graph is inputLike the encoder and decoder trained in the previous steps as the decoded features

The self-attention module performs calculation.

S220, performing global attention calculation according to the image characteristics to determine the width vector of a depth box corresponding to the two-dimensional image;

in particular, in an embodiment, in estimating bins, estimating sub-intervals within a depth range D in which a given image is more likely to occur requires combining both local structural information and global distribution information. Thus, the embodiment proposes to use global attention to calculate the width vector b of the bin for each input image.

More specifically, step S220 in the embodiment may further include steps S221 to S223:

s221, outputting the number of channels to obtain a first tensor of the image characteristics;

s222, flattening the first tensor according to the effective sequence length of the self-attention module to obtain a second tensor;

and S223, performing activation operation output through an activation function in the multilayer perceptron according to the second tensor to obtain a first vector, and performing normalization processing on the first vector to obtain a width vector of the depth box.

Specifically, in the embodiment, the decoded features are passed through a coding convolution module with a kernel size of p × p, a step size of p, and an output channel number of E. Then, we can get the size of

The tensor of (a). The embodiment further flattens this tensor into

Wherein

Effective sequence length as a self-attention module. The embodiment inputs the tensor into the self-attention module, and outputs a series of coding results after processing

Embodiments use a multi-layer perceptron for further encoding on the tensor of the first output. The multi-layered perceptron uses the ReLU activation function for activation and outputs an N-dimensional vector b'. Finally, the embodiment normalizes the vector b' to make the sum 1, and finally obtains the box width vector b:

wherein e =10 ^-3 (ii) a This slight positive e is used to ensure that the width of each bin is strictly positive. Normalization introduces competition between bin widths, forcing the network to focus on sub-intervals within prediction D by regions of prediction D that are more depth-dependent.

S230, performing convolution operation on the width vector and the image features to obtain a range attention feature map, and determining a linear combination of the depth box according to the range attention feature map;

in particular, high resolution and local pixel level information can be represented simultaneously by features from the attention module, and more global information can be effectively contained. The embodiment concatenates this output from the attention module with the decoded features and then convolves them with a set of 1 × 1 convolution kernels to obtain the range attention feature map R. This is equivalent to calculating a dot product attention weight between the pixel features considered as "keys" and the self attention output as "query".

More specifically, step S230 in the embodiment may further include steps S231-S232:

s231, inputting the range attention feature map into a convolution kernel for convolution operation, and performing classification prediction on the result of the convolution operation to obtain a classification prediction score value;

s232, calculating a first probability of the center position of the depth box according to the width vector of the depth box, and determining a linear combination of the depth box according to the first probability and the fraction value; the linear combination is used for describing the depth value of the pixel;

in an embodiment, in the mixed regression module, the range attention feature map R is input into the 1 × 1 convolution module again, and Softmax scores p of the N channels are obtained after one Softmax operation _k Where k =1, …, N. Then, the probability c (b) = { c (b) = of N depth bin centers is calculated from the bin width vector b ₁ ),c(b ₂ ),…,c(b _N )}：

Finally, the depth value of each pixel is finally obtained

Is a linear combination of the Softmax fraction at this pixel position:

in some possible embodiments, in an embodiment, the step S400 of constructing a spatial voxel from the model point cloud by using a truncated signed distance function, and performing frame fusion on the two-dimensional image to obtain a signed distance field according to a first distance between the spatial voxel and a nearest surface of the object and an updated first weight of the spatial voxel may include steps S410-S450:

s410, determining a first position point of the space voxel in a world coordinate system, and determining a first mapping point of the first position point in a camera coordinate system according to a camera pose matrix corresponding to depth data in the depth map;

in particular, in an embodiment, a traversal is first performed for the constructed spatial voxels. Taking a three-dimensional position point p of a voxel in a world coordinate system as an example; that is, the position point p is marked as a first position point, and in the embodiment, the mapping point v of the point p in the world coordinate system in the camera coordinate system can be obtained from the camera pose matrix of the depth data, which is the first mapping point.

S420, performing back projection on the first mapping point according to the camera internal reference matrix of the camera coordinate system to obtain a second position point in the depth map;

in the embodiment, a corresponding pixel point x in the depth image is solved by a camera internal reference matrix and a back projection v point, and a second position point is obtained; wherein, the depth Value of the pixel point x is Value (x).

S430, determining a second distance between the first mapping point and the origin of the camera coordinate system, calculating a directed distance field according to the second distance and the depth value of the second position point, and determining the first distance according to the directed distance field;

specifically, in the embodiment, the depth Value of the pixel point x is Value (x), the Distance from the point v to the origin of the camera coordinate is Distance (v), and then the SDF Value of p is SDF (p) = Value (x) -Distance (v). Introducing a truncation distance to reduce performance consumption, calculating the TSDF (p), within the truncation distance u,

otherwise, if SDF (p)>0，TSDF(p)＝1，SDF(p)<0，TSDF(p)＝-1。

S440, calculating to obtain the first weight according to the projection light of the first position point, the included angle of the surface normal vector and the second distance;

in particular, in the embodiment, the calculation formula of the first weight W (p) is:

in the formula, theta is an included angle between the projection light and a surface normal vector. The TSDF values and weight values of all voxels of the frame can be calculated through step S440.

S450, according to the first distance of the space voxel in the current frameAnd the first weight determines a signed distance field of the spatial voxel; if the current frame is the first frame, the first frame is the fusion result, otherwise, the current frame and the previous fusion result are required to be fused. Example TSDF _Fuse(p) Fused TSDF values, W, as voxel p _Fuse(p) To fuse weight values, TSDF _Cur (p) TSDF value, W, for the current frame of voxel p _Cur And (p) is the weight value of the current image frame. Embodiments may pass TSDF _Cur (p) updating TSDF _Fuse(p) . Wherein TSDF _Fuse(p) The following calculation formula is satisfied:

W(p)＝W _Fuse (p)+W _Cur (p)

in particular embodiments, by TSDF _Fuse (p) and W (p) may merge the new frame into the merged frame. Finally, embodiments can obtain signed distance fields with better detail and higher accuracy, and can be input to the next step of extracting the surface.

In some possible embodiments, in step S500, extracting a model surface from the signed distance field by iso-surface, and obtaining a three-dimensional mesh model of the object by restoring from the model surface, may include steps S510-S520:

s510, passing a floating point tensor corresponding to the distance field of a third position point in the three-dimensional grid model;

s520, inputting the floating point tensor to the trained three-dimensional convolution network, and performing operation output through an activation function in the three-dimensional convolution network to obtain the three-dimensional grid model;

in particular, in embodiments, a dual-contour-based data-driven mesh reconstruction method (NDC) is proposed, which uses neural networks to predict vertex positions, which eliminates the need for gradients in the input, and takes into account the contextual information inherent in the training data. The formula for NDC is as follows:

where I represents the input signed distance field and θ represents a learnable parameter. Model f of the example _v First, a distance field representation Φ at a grid vertex X is sampled as a floating-point tensor shaped | X |; s denotes a mesh vertex sign, V denotes a vertex of an edge, G denotes a discretized mesh, and F denotes a bi-directional surface, which is created only when a lattice edge connects lattice vertices of opposite signs. Embodiments then use a three-dimensional neural network to process this tensor. The three-dimensional convolution network has 6 layers in total, and the convolution kernel size of the first 3 layers is 3 ³ And the kernel size of the last 3 layers is 1 ³ Total receptive field is 7 ³ . Embodiments employ a hidden layer with a smaller parameter number of 64 channels to improve the computational efficiency of the network. Sigmoid is used as the activation function on the output layer and leak Relu elsewhere.

The training process of the three-dimensional convolution network comprises steps S511-S512:

s511, constructing to obtain second training data according to image data of narrow bands around the object surface in the historical data;

s512, inputting the second training data into the three-dimensional convolution network to output to obtain a prediction result, constructing a loss function through a binary mask of the prediction result, and adjusting parameters of the three-dimensional convolution network according to the loss function;

in particular, in the embodiment, the neural network is specifically structured as shown in fig. 3, and in the model training, the embodiment uses the input data to supervise the prediction of the network in the narrow band around the input surface, and the embodiment uses the binary mask M _S 、M _V To evaluate whether it is within a narrow band, 1 if it is, and 0 otherwise, because the surface can only change around the signed distance field. Training the network is undertaken from the L2 Loss of the false real vertices:

wherein u denotes a Hadamard product, so we can train a neural network from the signed distance field to the surface of the mesh, where V _gt Representing the true values of the mesh in the training set.

On the other hand, the technical solution of the present application further provides a mesh model reconstruction device based on a two-dimensional image, the device including: at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to execute the two-dimensional image based mesh model reconstruction method according to the second aspect.

An embodiment of the present invention further provides a storage medium, where a corresponding execution program is stored, and the program is executed by a processor, so as to implement the mesh model reconstruction method based on a two-dimensional image in the first aspect.

From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:

compared with the traditional method, the depth estimation of the first part in the technical scheme of the application introduces the global receptive field, and can better aggregate context information and non-local characteristics, so that the depth estimation result has the advantage of higher accuracy compared with the existing method. The second and third parts of the invention introduce a novel method for extracting the three-dimensional convolution surface and the signed distance function based on truncation, and have the advantage of quickly reconstructing the isosurface from the point cloud.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The grid model reconstruction method based on the two-dimensional image is characterized by comprising the following steps of:

2. The two-dimensional image-based mesh model reconstruction method according to claim 1, wherein the training process of the self-attention model comprises:

3. The mesh model reconstruction method based on two-dimensional image according to claim 2, wherein the step of performing bin depth prediction on the image depth interval of the two-dimensional image through the trained self-attention model to obtain a linear combination of depth bins comprises:

performing global attention calculation according to the image characteristics to determine the width vector of a depth box corresponding to the two-dimensional image; and performing convolution operation on the width vector and the image characteristics to obtain a range attention characteristic diagram, and determining the linear combination of the depth box according to the range attention characteristic diagram.

4. The method for reconstructing a mesh model based on a two-dimensional image according to claim 3, wherein the determining the width vector of the depth box corresponding to the two-dimensional image by performing the global attention calculation according to the image features comprises:

5. The two-dimensional image-based mesh model reconstruction method according to claim 3, wherein the convolving the width vectors with the image features to obtain a range attention feature map, and determining the linear combination of the depth bins according to the range attention feature map comprises:

calculating a first probability of a center position of a depth bin according to the width vectors of the depth bins, and determining a linear combination of the depth bins according to the first probability and the fractional value; the linear combination is used to describe the depth value of the pixel.

6. The method of claim 1, wherein constructing a spatial voxel from the model point cloud by the truncated signed distance function, and performing frame fusion on the two-dimensional image according to a first distance between the spatial voxel and a nearest surface of the object and an updated first weight of the spatial voxel to obtain a signed distance field comprises:

according to the camera coordinate system and the camera internal reference matrix, performing established back projection on the first mapping point to obtain a second position point in the depth map;

a signed distance field for the spatial voxel is determined based on the first distance of the spatial voxel in the current frame and the first weight.

7. The method of claim 1, wherein the extracting a model surface from the signed distance field by iso-surface and reconstructing the three-dimensional mesh model of the object from the model surface comprises:

a floating point tensor corresponding to the distance field through a third location point in the three-dimensional mesh model;

8. A mesh model reconstruction system based on two-dimensional images is characterized by comprising:

9. A mesh model reconstruction apparatus based on two-dimensional images, the apparatus comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to perform the method for two-dimensional image based mesh model reconstruction according to any one of claims 1-7.

10. A storage medium having stored therein a processor-executable program, which when executed by a processor is adapted to execute the method of two-dimensional image based mesh model reconstruction according to any one of claims 1-7.