WO2022236802A1

WO2022236802A1 - Method and apparatus for reconstructing object model, and terminal device and storage medium

Info

Publication number: WO2022236802A1
Application number: PCT/CN2021/093783
Authority: WO
Inventors: 王磊; 钟宏亮; 林佩珍; 程俊
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-17

Abstract

The present application relates to the technical field of image processing, and provided are a method and apparatus for reconstructing an object model, and a terminal device and a storage medium. The method comprises: firstly, acquiring an original image including a target object and a certain preset grid template, extracting a feature vector of the original image, and then fusing the feature vector and position coordinates of each vertex in the grid template, so as to obtain a feature matrix; then using a decoding network to process the feature matrix, introducing an attention mechanism during decoding, so as to take a positional correlation between the vertices of an object into consideration, so as to obtain target position coordinates of each vertex after decoding; and finally, according to the obtained target position coordinates of each vertex and previously acquired connection relationship data between the vertices, performing reconstruction to obtain a three-dimensional model corresponding to the target object. By means of the method, the occurrence of unnatural bulges or recesses on the surface of the obtained, by means of reconstruction, three-dimensional model of an object can be avoided, thereby improving the reconstruction effect of the three-dimensional model.

Description

Object model reconstruction method, device, terminal equipment and storage medium

technical field

The present application relates to the technical field of image processing, and in particular to a reconstruction method, device, terminal device and storage medium of an object model.

Background technique

The 3D model reconstruction technology of indoor scenes has great application value in the fields of virtual reality and human-computer interaction. At present, a monocular 3D object model reconstruction method based on deep learning is usually used, which generally uses an end-to-end encoder-decoder structure operation model. However, when predicting the position distribution of a certain vertex on the surface of the object at the decoder side, usually only the global information of the object image and the feature information of the vertex are considered, which will make the surface of the reconstructed 3D model of the object appear unnaturally convex or The sag phenomenon leads to poor reconstruction effect of the 3D model.

technical problem

In view of this, the embodiment of the present application provides a reconstruction method, device, terminal device and storage medium of an object model, which can avoid unnatural protrusions or depressions on the surface of the reconstructed three-dimensional model of the object, and improve the accuracy of the three-dimensional model. Rebuild effect.

technical solution

The first aspect of the embodiments of the present application provides a method for reconstructing an object model, including:

Obtaining a preset grid template and an original image containing the target object, the grid template including the initial position coordinates of each vertex of the original three-dimensional model and connection relationship data between the various vertices;

Inputting the original image into a pre-built encoding network for processing, outputting an initial feature vector corresponding to the original image, the encoding network being a neural network for extracting image features;

Fusing the initial feature vector and the initial position coordinates of each vertex to obtain a first feature matrix, the first feature matrix includes target feature vectors corresponding to each of the vertices;

The first feature matrix is input into a pre-built decoding network for processing, and the second feature matrix is output. The second feature matrix includes the target position coordinates corresponding to each of the vertices. The decoding network is composed of a fully connected layer and a The neural network of the attention mechanism layer, the attention mechanism layer is used to fuse the targets corresponding to each of the vertices respectively according to the correlation between each of the vertices and the vertices for each of the vertices A feature vector is obtained to obtain a fused target feature vector corresponding to the vertex, and the fused target feature vector is used to determine the target position coordinates corresponding to the vertex;

A target three-dimensional model corresponding to the target object is reconstructed according to the target position coordinates corresponding to each of the vertices and the connection relationship data between the vertices.

In the embodiment of the present application, the original image containing the target object and a preset grid template are first obtained, the feature vector of the original image is extracted, and then the feature vector is fused with the position coordinates of each vertex in the grid template, Get the feature matrix. Then, the feature matrix is processed by the decoding network, and the attention mechanism is introduced during decoding to consider the position correlation between each vertex of the object, and the target position coordinates of each vertex after decoding are obtained. Finally, according to the acquired target position coordinates of each vertex and the previously acquired connection relationship data between each vertex, the 3D model corresponding to the target object is reconstructed. The above process performs the fusion of feature vectors according to the correlation of the position coordinates between the vertices of the object, which can consider the mutual influence relationship between the vertices of the object, so as to avoid unnatural protrusions or depressions on the surface of the reconstructed 3D model of the object phenomenon, and improve the reconstruction effect of the 3D model.

In one embodiment of the present application, before the fusion of the initial feature vector and the initial position coordinates of the respective vertices, it may further include:

Acquiring a category vector corresponding to the target object, where the category vector is used to represent the object category to which the target object belongs;

splicing the category vector and the initial feature vector to obtain a spliced feature vector;

The fusion of the initial feature vector and the initial position coordinates of each vertex may specifically be:

The spliced feature vectors are fused with the initial position coordinates of each vertex.

In one embodiment of the present application, the merging of the initial feature vector and the initial position coordinates of each vertex to obtain the first feature matrix may include:

Expressing the initial position coordinates of each vertex as a matrix of dimension N*3, N being the quantity of each vertex;

Splicing the initial feature vector and the matrix of dimension N*3 in the second dimension to obtain the first feature matrix of dimension N*(3+X), where X is the number of elements of the initial feature vector.

In one embodiment of the present application, the decoding network includes a plurality of cascaded decoding modules, each of which includes a fully connected layer, an attention mechanism layer, and a batch normalization layer in turn, and the The first feature matrix is input to the pre-built decoding network for processing, and the second feature matrix is output, which may include:

Input the first feature matrix into the fully connected layer of the first decoding module of the decoding network for processing, and output the first intermediate matrix;

Input the first intermediate matrix into the attention mechanism layer of the first decoding module for processing, and output the second intermediate matrix;

splicing the second intermediate matrix and the first intermediate matrix to obtain a third intermediate matrix;

Inputting the third intermediate matrix into the batch normalization layer of the first decoding module for processing to obtain a fourth intermediate matrix;

Input the fourth intermediate matrix into the second decoding module of the decoding network, and continue to use the same processing method as that of the first decoding module until the output of the last decoding module of the decoding network is obtained. Second feature matrix.

Further, the first intermediate matrix includes target feature vectors corresponding to each of the vertices, the first intermediate matrix is input to the attention mechanism layer of the first decoding module for processing, and the second intermediate matrix is output matrix, which can include:

For each of the vertices, the correlation weights between each of the vertices and the vertices are calculated according to the trainable weight matrix, and then the target feature vectors corresponding to each of the vertices are respectively corresponding to The correlation weights are weighted and summed to obtain the fused target feature vector corresponding to the vertex, and the second intermediate matrix is a matrix composed of the fused target feature vectors corresponding to each of the vertices.

In one embodiment of the present application, after the target three-dimensional model corresponding to the target object is reconstructed, it may further include:

calculating the size of all dihedral angles of the target three-dimensional model according to the position coordinates of each vertex of the target three-dimensional model;

Calculate the smoothing loss according to the size of all dihedral angles;

Optimizing and updating parameters of the decoding network based on the smoothing loss.

Further, the smoothing loss is calculated according to the sizes of all the dihedral angles, which may specifically be:

The smoothing loss is calculated by the following formula:

Wherein, L _smooth represents the smoothing loss, θ _{i, j} represents the dihedral angle between any two planes i, j of the target 3D model, and F represents all planes of the target 3D model.

The second aspect of the embodiment of the present application provides an object model reconstruction device, including:

A data acquisition module, configured to acquire a preset grid template and an original image containing the target object, the grid template including the initial position coordinates of each vertex of the original three-dimensional model and the connection relationship data between the various vertices;

A feature encoding module, configured to input the original image into a pre-built encoding network for processing, and output an initial feature vector corresponding to the original image, and the encoding network is a neural network for extracting image features;

A vector fusion module, configured to fuse the initial feature vector and the initial position coordinates of each vertex to obtain a first feature matrix, the first feature matrix includes target feature vectors corresponding to each of the vertices;

A feature decoding module, configured to input the first feature matrix into a pre-built decoding network for processing, and output a second feature matrix, the second feature matrix includes target position coordinates corresponding to each of the vertices, and the decoding network It is a neural network comprising a fully connected layer and an attention mechanism layer, and the attention mechanism layer is used to fuse each of the vertices according to the correlation between each of the vertices and the vertices for each of the vertices. The target feature vector corresponding to the vertex respectively, obtains the fused target feature vector corresponding to the vertex, and the fused target feature vector is used to determine the target position coordinates corresponding to the vertex;

The model reconstruction module is used to reconstruct the target three-dimensional model corresponding to the target object according to the target position coordinates corresponding to each of the vertices and the connection relationship data between the vertices.

The third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, when the processor executes the computer program The object model reconstruction method provided in the first aspect of the embodiment of the present application is implemented.

The fourth aspect of the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements the method provided in the first aspect of the embodiment of the present application. The reconstruction method of the object model.

A fifth aspect of the embodiments of the present application provides a computer program product, which, when the computer program product is run on a terminal device, causes the terminal device to execute the object model reconstruction method described in the first aspect of the embodiments of the present application.

It can be understood that, for the beneficial effects of the above-mentioned second aspect to the fifth aspect, reference can be made to the relevant description in the above-mentioned first aspect, and details will not be repeated here.

Description of drawings

FIG. 1 is a flow chart of a method for reconstructing an object model provided in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an encoding network provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a residual module provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a decoding network provided by an embodiment of the present application;

Fig. 5 is a schematic diagram of the processing of the attention mechanism layer provided by the embodiment of the present application;

Fig. 6 is a schematic diagram of the operation of the object model reconstruction method provided by the embodiment of the present application;

Fig. 7 is a schematic diagram of the processing effect of the object model reconstruction method provided by the embodiment of the present application;

Fig. 8 is a comparison diagram of the 3D model reconstruction results obtained by the present application and the Total3D original model in the prior art;

FIG. 9 is a structural diagram of an object model reconstruction device provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of a terminal device provided by an embodiment of the present application.

Embodiments of the present invention

In the following description, for the purpose of illustration rather than limitation, specific details such as specific system structures and technologies are presented, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. In addition, in the description of the specification and appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.

The present application proposes a reconstruction method, device, terminal equipment and storage medium of an object model, which can avoid unnatural protrusions or depressions on the surface of the reconstructed three-dimensional model of the object, and improve the reconstruction effect of the three-dimensional model. It should be understood that various method embodiments of the present application are executed by various types of terminal devices or servers, such as mobile phones, tablet computers, notebook computers, desktop computers, and wearable devices.

Please refer to FIG. 1, which shows a method for reconstructing an object model provided by an embodiment of the present application, including:

101. Obtain a preset grid template and an original image containing a target object;

First, grab one of the preset grid templates. The grid template includes the initial position coordinates of each vertex of the original three-dimensional model and connection relationship data between the various vertexes. For example, the grid template can be a Mesh file, which stores the vertex positions and the connection relationship between vertices of the original 3D model. The original 3D model can be a model of various shapes such as a sphere, a cube, and a cuboid. To make the distribution of each vertex position relatively uniform, it is generally recommended to use the original 3D model in the shape of a sphere. Assuming that the original 3D model has N vertices, the grid template includes the 3D position coordinates of each of the N vertices and the connection relationship data between the N vertices, and the N vertices can be determined according to the connection relationship data How are they connected, so that the corresponding 3D model can be obtained.

In addition, it is also necessary to obtain an original image containing a target object, which is any type of object whose corresponding 3D model needs to be reconstructed, such as a sofa, a table, or a bed. Specifically, the original image may be an RGB image or a grayscale image of the target object.

102. Input the original image into a pre-built encoding network for processing, and output an initial feature vector corresponding to the original image;

After the original image is acquired, the original image is input into a pre-built encoding network for processing to obtain the feature vector corresponding to the original image. Among them, the encoding network is a neural network for extracting image features. Generally, images are processed through convolutional layers, pooling layers, and fully connected layers to extract image features and obtain corresponding feature vectors. This application does not The type and structure of the network are defined.

A schematic diagram of the structure of a coding network provided in the embodiment of the present application is shown in Figure 2. The input original image with a dimension of 224*224*3 passes through several convolutional layers, ReLU activation function layers, and maximum pooling layers of the coding network. After processing with the network layer such as the fully connected layer, the feature data of 1*1*1024 is finally obtained, which can be regarded as a vector of 1024 elements, that is, the initial feature vector corresponding to the original image of 224*224*3 . In addition, in order to avoid the gradient explosion or gradient disappearance of image features caused by an overly deep model structure, multiple stacked residual modules can also be added to the encoding network structure shown in Figure 2, where the structure of each residual module The schematic diagram is shown in Figure 3. The input feature map is processed by two 3*3 convolution blocks with edge padding. After extracting local features, features are integrated and screened through the pooling layer to reduce the dimension of image features. The output of each residual module is added to its original input to form a new data transmission path, which endows the residual network with the ability of identity mapping. In practical applications, the residual network model ResNet-18 and its pre-trained weights provided by the PyTorch framework can be used as the encoding network.

103. Fusing the initial feature vector and the initial position coordinates of each vertex to obtain a first feature matrix;

After the initial feature vector is obtained, the feature vector is fused with the initial position coordinates of each vertex in the grid template to obtain a first feature matrix, and the first feature matrix includes target feature vectors corresponding to each of the vertices. The initial position coordinates (x, y, z) of a vertex can be regarded as a vector of 3 elements, so the vector of 3 elements and the initial feature vector can be fused in a splicing manner to obtain a new vector, that is, the target feature vector. Each target feature vector corresponding to each different vertex may form a matrix, that is, the first feature matrix.

(1) the initial position coordinates of each vertex are represented as a matrix of dimension N*3, and N is the quantity of each vertex;

(2) Splicing the initial feature vector and the matrix of dimension N*3 on the second dimension to obtain the first feature matrix of dimension N*(3+X), where X is the matrix of the initial feature vector number of elements.

Suppose there are a total of N vertices, and the initial position coordinates of each vertex are expressed as a vector of 3 elements, then the initial position coordinates of N vertices can be expressed as a matrix of N*3, and the number of elements of the initial feature vector is assumed to be X , then an N*(X+3) matrix will be obtained after splicing in the second dimension as the first feature matrix.

(1) Acquiring a category vector corresponding to the target object, where the category vector is used to represent the object category to which the target object belongs;

(2) Concatenate the category vector and the initial feature vector to obtain a concatenated feature vector.

In order to improve the versatility of this application and make it compatible with the three-dimensional model reconstruction processing of multiple different types of objects, before the initial feature vector and the initial position coordinates of each vertex are fused, a certain type vector can also be combined with the The initial feature vectors are concatenated, and then the concatenated vectors are fused with the initial position coordinates.

Specifically, each object category corresponds to a unique category vector, so the category vector can be in the form of one-hot encoding. For example, if there are images of 4 types of objects in the data set to be processed, which are tables, chairs, computers, and airplanes, you can pre-set the category vector corresponding to the table as (0, 0, 0, 1), and the category corresponding to the chair The vector is (0, 0, 1, 0), the category vector corresponding to the computer is (0, 1, 0, 0), and the category vector corresponding to the aircraft is (1, 0, 0, 0). If the current processed original image The target object in is a table, then the category vector (0, 0, 0, 1) corresponding to the table is obtained at this time, and spliced with the initial feature vector.

An example to illustrate the specific splicing method is as follows: Assume that there are 2562 vertices in total, and the initial position coordinates of each vertex are expressed as a vector of 3 elements, then the initial position coordinates of 2562 vertices can be expressed as a 2562*3 matrix. The number of elements of the initial feature vector is 1024, and the number of elements of the category vector is 9. First, the initial feature vector and the category vector are spliced to obtain a new feature vector with the number of elements of 1033, and then the new feature vector is combined with the 2562* The second dimension of the matrix of 3 is concatenated to obtain a 2562*1036 matrix as the first feature matrix. Each 1*1036 vector in the first feature matrix is a semantic vector corresponding to each model vertex.

104. Input the first feature matrix into a pre-built decoding network for processing, and output a second feature matrix;

After the first feature matrix is obtained, it is input into a pre-built decoding network for processing to obtain the second feature matrix, which contains the converted target position coordinates corresponding to each vertex. Wherein, the decoding network is a neural network including a fully connected layer and an attention mechanism layer, and the attention mechanism layer is used for each of the vertices according to the correlation between each of the vertices and the vertices. The target feature vectors corresponding to each of the vertices are fused to obtain a fused target feature vector corresponding to the vertex, and the fused target feature vector is used to determine the target position coordinates corresponding to the vertex. Ordinary decoding networks usually use a multi-layer stacked fully connected network to predict the offset of the vertex coordinates of the grid template, and obtain the converted target position coordinates. However, this method can only consider the global information of the image and the information of a single target point when predicting, and does not consider the related points of the target point, especially the mutual influence between local adjacent points, which will easily lead to the reconstruction of the 3D model Unnatural bumps or depressions appear on the surface. To address this problem, this application adds an attention mechanism layer to the decoding network to capture the positional interaction between different vertices of the same object.

(1) input the fully connected layer of the first decoding module of the decoding network with the first feature matrix for processing, and output the first intermediate matrix;

(2) the first intermediate matrix is input to the attention mechanism layer of the first decoding module for processing, and the second intermediate matrix is output;

(3) splicing the second intermediate matrix and the first intermediate matrix to obtain a third intermediate matrix;

(4) the batch normalization layer of described first decoding module input described 3rd intermediate matrix is processed, obtains the 4th intermediate matrix;

(5) Input the fourth intermediate matrix into the second decoding module of the decoding network, continue to use the same processing method as the first decoding module, until the output of the last decoding module of the decoding network is obtained The second characteristic matrix of .

As shown in FIG. 4 , it is a schematic structural diagram of a decoding network provided by an embodiment of the present application. The decoding network includes multiple stacked decoding modules, where each decoding module is sequentially composed of a fully connected layer, an attention mechanism layer and a batch normalization layer. The fully connected layer can be realized by 1*1 convolution to predict the coordinate offset of a single vertex, and then filter and extract several vertices most relevant to the current vertex (usually local adjacent vertices) through the attention mechanism layer The coordinate information of the data is spliced with the original output, and then processed by the batch normalization layer (that is, the Batch Normalization layer, also known as the batch reduction layer), so that the data conforms to the Gaussian distribution, and then put into the subsequent network.

As shown in Figure 5, it is a schematic diagram of the processing of the attention mechanism layer adopted in this application. After inputting the first feature matrix into the fully connected layer of the first decoding module for processing, the first intermediate matrix I∈R ^N*C is obtained, where N represents the number of vertices, and C represents the number of elements of the target feature vector corresponding to each vertex number. After inputting the first intermediate matrix I into the attention mechanism layer for processing, the second intermediate matrix A∈R ^N*C is obtained, and then the two matrices are spliced in the second dimension to obtain the third intermediate matrix O∈R ^N*2C . Next, the third intermediate matrix O is input to the batch normalization layer for processing, and then connected to the next decoding module to perform the same processing, and so on, and finally output the second feature matrix. This process can be called point-to-point attention mechanism processing.

After inputting the first intermediate matrix I into the attention mechanism layer, the specific processing method is: for a certain vertex P, a trainable weight matrix is used to calculate the N-1 vertices (excluding vertex P) and vertex The correlation weight between P, and then the target feature vectors corresponding to the N-1 vertices are weighted and summed according to their corresponding correlation weights, and the fused target feature vector corresponding to the vertex P is obtained. In this process, The dimensionality of the feature vector is unchanged (dimension is still C). After obtaining the fused target feature vectors corresponding to the N vertices in the same way as the vertex P, N fused target feature vectors will be obtained, which form the second intermediate matrix A∈R ^N*C .

When calculating the correlation weight, the following formula (1.1) can be used:

Among them, e _{i, j} represent the correlation weight between any two vertices i and j in the Nth vertex, p _i represents the target feature vector corresponding to vertex i, p _j represents the target feature vector corresponding to vertex j, W is a trainable weight matrix, the initial value of the weight matrix can be manually set, and then the value of the weight matrix is iteratively updated during the training process of the decoding network. Assuming that both p _i and p _j are vectors of 1*1036, the weight matrix W is a matrix of 1036*1036, so that the calculated correlation weight will be a value, indicating the correlation between vertices i and j.

In addition, the following formula (1.2) can also be used to process the obtained correlation weights corresponding to each vertex, so as to ensure that the sum of the respective correlation weights for a certain vertex is 1:

a _i ＝softmax(e _i ) (1.2)

Among them, a _i represents e _i after softmax reduction, and e _i is a vector obtained by concatenating e _{i and j} according to the jth dimension, which represents the correlation weight between all other vertices except vertex i and vertex i.

The fused target feature vector corresponding to vertex i can be expressed by the following formula (1.3):

Among them, A _i represents the fused target feature vector corresponding to vertex i, and a _i,j represents the reduced correlation weight between vertex j and vertex i.

Assuming that the first feature matrix is a matrix of 2562*1036, after inputting the matrix into the decoding network, each stacked decoding module in the decoding network will gradually perform a dimension reduction operation on the matrix (realized by a fully connected layer), and finally obtain a 2562 The result matrix of *3 represents the converted three-dimensional position coordinates corresponding to 2562 vertices respectively.

105. Reconstruct and obtain a target three-dimensional model corresponding to the target object according to the target position coordinates corresponding to each of the vertices and the connection relationship data between the vertices.

Finally, according to the target position coordinates corresponding to each vertex, the position of each vertex in the reconstructed 3D model can be determined, and then combined with the connection relationship data between each vertex contained in the grid template, a new 3D model can be constructed. model, as the target three-dimensional model corresponding to the target object.

In one embodiment of the present application, after reconstruction obtains the target three-dimensional model corresponding to the target object, it may also include:

(1) calculating the size of all dihedral angles that the target three-dimensional model has according to the position coordinates of each vertex of the target three-dimensional model;

(2) calculating the smoothing loss according to the size of all dihedral angles;

(3) Optimizing and updating parameters of the decoding network based on the smoothing loss.

After the target three-dimensional model is constructed, the size of each dihedral angle of the target three-dimensional model can be easily calculated because the coordinates of each vertex and the connection relationship between the vertices are known. Then, the smoothing loss can be calculated according to the size of all dihedral angles, and the smoothing loss is used as the objective function to optimize and update the parameters of the decoding network.

The smoothing loss is calculated by the following formula (1.4):

Wherein, L _smooth represents the smoothing loss, θ _i,j represents the dihedral angle between any two planes of the target 3D model, and F represents all planes of the target 3D model. In the process of using the grid template to fit the target 3D model, the connection relationship between the vertices is unchanged, so each dihedral angle can be conveniently calculated according to the coordinates of each vertex, and then the smoothing loss can be calculated by formula (1.4) .

Since the surface of the artificial object in the indoor scene is usually smooth, and the 3D model reconstruction of the coordinates of a single vertex, due to the generalization of the neural network and other reasons, there is often a lot of noise when reconstructing the surface of the object model. resulting in uneven surfaces. In order to solve this problem, the embodiment of the present application introduces a smoothing loss to train the neural network and constrains the smoothness of the surface of the object, which can make the surface of the reconstructed 3D model smoother and improve the effect of model reconstruction.

As shown in FIG. 6 , it is a schematic diagram of the operation of the object model reconstruction method provided by the embodiment of the present application. First, obtain a picture of the target object, use the encoding network to process the picture, and obtain the corresponding feature vector; then, splice the feature vector with the category vector corresponding to the target object, and combine it with the vertices in the grid template The coordinates are spliced; then, the spliced feature matrix is input into the decoding network, which is composed of stacked decoding modules. Each decoding module includes a fully connected layer, an attention mechanism layer, and a batch normalization layer in sequence. The attention mechanism is used to obtain the target position coordinates of each vertex after conversion; finally, according to the target position coordinates corresponding to each vertex and the connection relationship data between each vertex, the 3D model corresponding to the target object is reconstructed . In addition, the smoothing loss can be calculated according to each dihedral angle in the reconstructed 3D model, and the decoding network can be optimized and trained according to the smoothing loss, so as to improve the smoothness of the surface of the obtained 3D model.

As shown in FIG. 7 , it is a schematic diagram of the processing effect of the object model reconstruction method proposed in this application. Among them, the five 3D models at the top of Figure 7 are reconstructed three-dimensional models not obtained by using the inter-point attention mechanism, and the five three-dimensional models at the bottom of Figure 7 are corresponding reconstructed three-dimensional models obtained by using the inter-point attention mechanism. It can be seen that there are many unnatural protrusions and depressions in the five 3D models at the top of Figure 7 (see the dotted line box in the figure), but these protrusions do not exist in the five 3D models at the bottom of Figure 7 and depressions, the reconstruction of the 3D model is better.

In order to verify the 3D model reconstruction effect of this application, a 3D model reconstruction test is now carried out using the same data set as the original Total3D model in the prior art. The input of the model is a spherical grid template with 2562 vertices and a 224*224 input picture. Table 1 below shows the calculation model of this application and the original Total3D model and AtlasNet model in the prior art in the Pix3D data set In the above, a total of 9 categories of 3D model reconstruction accuracy comparisons on real scene indoor objects. Among them, the bevel angle distance reflects the position deviation between the vertices of the reconstructed object model and the true value, and the normal vector distance reflects the normal vector deviation between the reconstructed object surface and the true value surface. According to the comparison of the three-dimensional model reconstruction indicators shown in Table 1, it can be known that the calculation model proposed by this application can obtain smaller position deviation and normal vector deviation compared with the original Total3D model and AtlasNet model in the prior art, that is, it can effectively improve The reconstruction effect of the 3D model.

Table 1

Fig. 8 is a comparison diagram of the 3D model reconstruction results obtained by the original Total3D model in the present application and the prior art, wherein the left column is the input image, the middle column is the 3D model reconstruction result obtained by using the Total3D original model, and the right side A column of is the reconstruction result of the 3D model obtained by this application. It can be seen that a more accurate and smooth three-dimensional object model can be generated by using the calculation model proposed in this application.

It should be understood that the sequence numbers of the steps in the above embodiments do not mean the order of execution, and the execution order of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application .

The above mainly describes a method for reconstructing an object model, and a device for reconstructing an object model will be described below.

Please refer to FIG. 9, an embodiment of an object model reconstruction device in the embodiment of the present application includes:

A data acquisition module 801, configured to acquire a preset grid template and an original image containing the target object, the grid template including the initial position coordinates of each vertex of the original 3D model and the connection relationship data between the various vertices;

A feature encoding module 802, configured to input the original image into a pre-built encoding network for processing, and output an initial feature vector corresponding to the original image, and the encoding network is a neural network for extracting image features;

A vector fusion module 803, configured to fuse the initial feature vectors and the initial position coordinates of the respective vertices to obtain a first feature matrix, the first feature matrix including target feature vectors corresponding to each of the vertices;

A feature decoding module 804, configured to input the first feature matrix into a pre-built decoding network for processing, and output a second feature matrix, the second feature matrix includes target position coordinates corresponding to each of the vertices, and the decoding The network is a neural network including a fully connected layer and an attention mechanism layer, and the attention mechanism layer is used to fuse each vertex according to the correlation between each vertex and the vertex for each vertex in the vertex. The target feature vectors corresponding to the vertices respectively are obtained to obtain the fused target feature vectors corresponding to the vertexes, and the fused target feature vectors are used to determine the target position coordinates corresponding to the vertexes;

The model reconstruction module 805 is configured to reconstruct the target three-dimensional model corresponding to the target object according to the target position coordinates corresponding to each of the vertices and the connection relationship data between the vertices.

In one embodiment of the present application, the reconstruction device of the object model may also include:

A category vector acquisition module, configured to acquire a category vector corresponding to the target object, where the category vector is used to represent the object category to which the target object belongs;

A vector splicing module, configured to splice the category vector and the initial feature vector to obtain a spliced feature vector;

The vector fusion module can specifically be used for:

In one embodiment of the present application, the vector fusion module may include:

A matrix representation unit, configured to represent the initial position coordinates of each vertex as a matrix of dimension N*3, where N is the number of each vertex;

A vector splicing unit, configured to splice the initial feature vector and the matrix of dimension N*3 on the second dimension to obtain the first feature matrix of dimension N*(3+X), where X is the initial The number of elements in the eigenvector.

In one embodiment of the present application, the decoding network includes a plurality of cascaded decoding modules, each of which includes a fully connected layer, an attention mechanism layer, and a batch normalization layer in turn, and the feature decoding module Can include:

A first processing unit, configured to input the first feature matrix into the fully connected layer of the first decoding module of the decoding network for processing, and output a first intermediate matrix;

The second processing unit is configured to input the first intermediate matrix into the attention mechanism layer of the first decoding module for processing, and output a second intermediate matrix;

A third processing unit, configured to splice the second intermediate matrix and the first intermediate matrix to obtain a third intermediate matrix;

A fourth processing unit, configured to input the third intermediate matrix into the batch normalization layer of the first decoding module for processing to obtain a fourth intermediate matrix;

The fifth processing unit is configured to input the fourth intermediate matrix into the second decoding module of the decoding network, and continue to use the same processing method as that of the first decoding module until the final result obtained by the decoding network is obtained. The second feature matrix output by a decoding module.

Further, the first intermediate matrix includes target feature vectors corresponding to each of the vertices, and the second processing unit may specifically be used for:

A dihedral angle calculation module, configured to calculate the size of all dihedral angles of the target three-dimensional model according to the position coordinates of each vertex of the target three-dimensional model;

A smoothing loss calculation module, configured to calculate the smoothing loss according to the size of all dihedral angles;

A network parameter optimization module, configured to optimize and update parameters of the decoding network based on the smoothing loss.

Further, the smoothing loss calculation module is specifically used for:

The smoothing loss is calculated by the following formula:

The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any method for reconstructing an object model as shown in FIG. 1 is implemented.

The embodiment of the present application also provides a computer program product, which, when the computer program product is run on a terminal device, enables the terminal device to implement any method for reconstructing an object model as shown in FIG. 1 .

Fig. 10 is a schematic diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 10 , the terminal device 9 of this embodiment includes: a processor 90 , a memory 91 , and a computer program 92 stored in the memory 91 and operable on the processor 90 . When the processor 90 executes the computer program 92, it implements the steps in the embodiments of the reconstruction method of each object model mentioned above, such as steps 101 to 105 shown in FIG. 1 . Alternatively, when the processor 90 executes the computer program 92, it realizes the functions of the modules/units in the above-mentioned device embodiments, for example, the functions of the modules 801 to 805 shown in FIG. 9 .

The computer program 92 can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 91 and executed by the processor 90 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 92 in the terminal device 9 .

The so-called processor 90 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The storage 91 may be an internal storage unit of the terminal device 9 , such as a hard disk or memory of the terminal device 9 . The memory 91 can also be an external storage device of the terminal device 9, such as a plug-in hard disk equipped on the terminal device 9, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 91 may also include both an internal storage unit of the terminal device 9 and an external storage device. The memory 91 is used to store the computer program and other programs and data required by the terminal device. The memory 91 can also be used to temporarily store data that has been output or will be output.

Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Completion of modules means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware It can also be implemented in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above system, reference may be made to the corresponding process in the foregoing method embodiments, and details will not be repeated here.

Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

In the embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. For example, the system embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments in the present application can also be completed by instructing related hardware through computer programs. The computer programs can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps in the above-mentioned various method embodiments can be realized. . Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, and a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable media Excluding electrical carrier signals and telecommunication signals.

The above-described embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still implement the foregoing embodiments Modifications to the technical solutions described in the examples, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the application, and should be included in the Within the protection scope of this application.

Claims

A method for reconstructing an object model, characterized in that it comprises:

Obtaining a preset grid template and an original image containing the target object, the grid template including the initial position coordinates of each vertex of the original three-dimensional model and connection relationship data between the various vertices;

Inputting the original image into a pre-built encoding network for processing, outputting an initial feature vector corresponding to the original image, the encoding network being a neural network for extracting image features;

Fusing the initial feature vector and the initial position coordinates of each vertex to obtain a first feature matrix, the first feature matrix includes target feature vectors corresponding to each of the vertices;

The first feature matrix is input into a pre-built decoding network for processing, and the second feature matrix is output. The second feature matrix includes the target position coordinates corresponding to each of the vertices. The decoding network is composed of a fully connected layer and a The neural network of the attention mechanism layer, the attention mechanism layer is used to fuse the targets corresponding to each of the vertices respectively according to the correlation between each of the vertices and the vertices for each of the vertices A feature vector is obtained to obtain a fused target feature vector corresponding to the vertex, and the fused target feature vector is used to determine the target position coordinates corresponding to the vertex;

A target three-dimensional model corresponding to the target object is reconstructed according to the target position coordinates corresponding to each of the vertices and the connection relationship data between the vertices.
The method according to claim 1, further comprising:

Acquiring a category vector corresponding to the target object, where the category vector is used to represent the object category to which the target object belongs;

splicing the category vector and the initial feature vector to obtain a spliced feature vector;

The fusion of the initial feature vector and the initial position coordinates of each vertex is specifically:

The spliced feature vectors are fused with the initial position coordinates of each vertex.
The method according to claim 1, wherein said initial feature vector and said initial position coordinates of each vertex are fused to obtain a first feature matrix, comprising:

Expressing the initial position coordinates of each vertex as a matrix of dimension N*3, N being the quantity of each vertex;

Splicing the initial feature vector and the matrix of dimension N*3 in the second dimension to obtain the first feature matrix of dimension N*(3+X), where X is the number of elements of the initial feature vector.
The method according to claim 1, wherein the decoding network comprises a plurality of cascaded decoding modules, and each of the decoding modules comprises a fully connected layer, an attention mechanism layer and a batch normalization layer in turn, so The first feature matrix is input into a pre-built decoding network for processing, and the second feature matrix is output, including:

Input the first feature matrix into the fully connected layer of the first decoding module of the decoding network for processing, and output the first intermediate matrix;

Input the first intermediate matrix into the attention mechanism layer of the first decoding module for processing, and output the second intermediate matrix;

splicing the second intermediate matrix and the first intermediate matrix to obtain a third intermediate matrix;

Inputting the third intermediate matrix into the batch normalization layer of the first decoding module for processing to obtain a fourth intermediate matrix;

Input the fourth intermediate matrix into the second decoding module of the decoding network, and continue to use the same processing method as that of the first decoding module until the output of the last decoding module of the decoding network is obtained. Second feature matrix.
The method according to claim 4, wherein the first intermediate matrix includes target feature vectors corresponding to each of the vertices, and the first intermediate matrix is input to the attention of the first decoding module. The force mechanism layer processes and outputs the second intermediate matrix, including:

For each of the vertices, the correlation weights between each of the vertices and the vertices are calculated according to the trainable weight matrix, and then the target feature vectors corresponding to each of the vertices are respectively corresponding to The correlation weights are weighted and summed to obtain the fused target feature vector corresponding to the vertex, and the second intermediate matrix is a matrix composed of the fused target feature vectors corresponding to each of the vertices.
The method according to any one of claims 1 to 5, characterized in that, after reconstructing the target three-dimensional model corresponding to the target object, further comprising:

calculating the size of all dihedral angles of the target three-dimensional model according to the position coordinates of each vertex of the target three-dimensional model;

Calculate the smoothing loss according to the size of all dihedral angles;

Optimizing and updating parameters of the decoding network based on the smoothing loss.
The method according to claim 6, wherein the smoothing loss is calculated according to the size of all the dihedral angles, specifically:

The smoothing loss is calculated by the following formula:

Wherein, L smooth represents the smoothing loss, θ i, j represents the dihedral angle between any two planes i, j of the target 3D model, and F represents all planes of the target 3D model.
A device for reconstructing an object model, characterized in that it comprises:

A data acquisition module, configured to acquire a preset grid template and an original image containing the target object, the grid template including the initial position coordinates of each vertex of the original three-dimensional model and the connection relationship data between the various vertices;

A feature encoding module, configured to input the original image into a pre-built encoding network for processing, and output an initial feature vector corresponding to the original image, and the encoding network is a neural network for extracting image features;

A vector fusion module, configured to fuse the initial feature vector and the initial position coordinates of each vertex to obtain a first feature matrix, the first feature matrix includes target feature vectors corresponding to each of the vertices;

A feature decoding module, configured to input the first feature matrix into a pre-built decoding network for processing, and output a second feature matrix, the second feature matrix includes target position coordinates corresponding to each of the vertices, and the decoding network It is a neural network comprising a fully connected layer and an attention mechanism layer, and the attention mechanism layer is used to fuse each of the vertices according to the correlation between each of the vertices and the vertices for each of the vertices. The target feature vector corresponding to the vertex respectively, obtains the fused target feature vector corresponding to the vertex, and the fused target feature vector is used to determine the target position coordinates corresponding to the vertex;

The model reconstruction module is used to reconstruct the target three-dimensional model corresponding to the target object according to the target position coordinates corresponding to each of the vertices and the connection relationship data between the vertices.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in that, when the processor executes the computer program, the following claims 1 to 1 are implemented. The reconstruction method of the object model described in any one of 7.
A computer-readable storage medium, the computer-readable storage medium stores a computer program, characterized in that, when the computer program is executed by a processor, the object model according to any one of claims 1 to 7 is realized rebuild method.