WO2023019478A1

WO2023019478A1 - Three-dimensional reconstruction method and apparatus, electronic device, and readable storage medium

Info

Publication number: WO2023019478A1
Application number: PCT/CN2021/113308
Authority: WO
Inventors: 王磊; 刘薰裕; 马晓亮; 刘宝玉; 程俊
Original assignee: 深圳先进技术研究院; 中国科学院深圳理工大学(筹)
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2023-02-23

Abstract

The present application relates to the technical field of image processing, and in particular, discloses a three-dimensional reconstruction method, a three-dimensional reconstruction apparatus, an electronic device, and a computer readable storage medium. The three-dimensional reconstruction method comprises: performing feature extraction on an image of an object to be reconstructed to obtain a feature vector, wherein the feature vector is used for representing shape feature information of said object; generating a feature map according to the feature vector and a preset template for said object, the preset template being used for representing three-dimensional structure information of said object; and inputting the feature map into a trained graph convolutional neural network to obtain a three-dimensional reconstruction result of said object. By means of the solution of the present application, the stability of three-dimensional reconstruction can be improved.

Description

A three-dimensional reconstruction method, device, electronic equipment and readable storage medium

technical field

The present application belongs to the technical field of image processing, and in particular relates to a three-dimensional reconstruction method, a three-dimensional reconstruction device, electronic equipment, and a computer-readable storage medium.

Background technique

The 3D reconstruction of human body parts has always been a hot topic in computational vision, and has been widely used in the fields of virtual reality (VR) and augmented reality (AR).

technical problem

Traditional 3D reconstruction techniques need to rely on more complex and expensive equipment, such as 3D scanners, multi-view cameras or inertial sensors. At present, although 3D reconstruction techniques based on a single image have been developed, these 3D reconstruction techniques still have problems such as unstable reconstruction effects.

technical solution

The present application provides a three-dimensional reconstruction method, a three-dimensional reconstruction device, an electronic device and a computer-readable storage medium, which can solve the problem of unstable reconstruction effect existing in the existing three-dimensional reconstruction technology.

In a first aspect, the present application provides a three-dimensional reconstruction method, including:

performing feature extraction on the image of the object to be reconstructed to obtain a feature vector, wherein the feature vector is used to represent the shape feature information of the object to be reconstructed;

Generate a feature map according to the feature vector and a preset template for the object to be reconstructed, where the preset template is used to represent the three-dimensional structure information of the object to be reconstructed;

Input the above feature map into the trained graph convolutional neural network (Graph Convolutional Network, GCN) to obtain the 3D reconstruction result of the above object to be reconstructed.

In a second aspect, the present application provides a three-dimensional reconstruction device, including:

The extraction module is used to perform feature extraction on the image of the object to be reconstructed to obtain a feature vector, wherein the above-mentioned feature vector is used to represent the shape feature information of the above-mentioned object to be reconstructed;

A generation module, configured to generate a feature map according to the above-mentioned feature vector and a preset template for the above-mentioned object to be reconstructed, and the above-mentioned preset template is used to represent the three-dimensional structure information of the above-mentioned object to be reconstructed;

The reconstruction module is configured to input the above-mentioned feature map into the trained graph convolutional neural network to obtain the three-dimensional reconstruction result of the above-mentioned object to be reconstructed.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the above-mentioned first aspect is implemented. steps of the method.

In a fourth aspect, the present application provides a computer-readable storage medium, wherein the above-mentioned computer-readable storage medium stores a computer program, and when the above-mentioned computer program is executed by one or more processors, the steps of the method in the above-mentioned first aspect are implemented.

In a fifth aspect, the present application provides a computer program product, which, when the computer program product is run on an electronic device, causes the electronic device to execute the three-dimensional reconstruction method proposed in the first aspect.

Beneficial effect

The beneficial effect of the present application compared with the prior art is: for the image of the object to be reconstructed, the present application first performs feature extraction on the image to obtain the feature vector used to characterize the shape feature information of the above-mentioned object to be reconstructed, and then The feature vector will be combined with the preset template for the object to be reconstructed to generate a feature map, and finally the feature map will be input into the trained graph convolutional neural network to obtain the 3D reconstruction result of the object to be reconstructed . In the above process, through the combination of the feature vector and the preset template, the finally generated feature map not only contains the shape features of the object to be reconstructed, but also obtains the three-dimensional structure information of the object to be reconstructed displayed by the preset template, As a result, the trained graph convolutional neural network can better process the feature map, and the obtained 3D reconstruction results will be more accurate, ensuring the stability of 3D reconstruction.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the accompanying drawings that need to be used in the descriptions of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings in the following description are only for the present application For some embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.

Fig. 1 is a schematic diagram of the implementation flow of the three-dimensional reconstruction method provided by the embodiment of the present application;

FIG. 2 is an example diagram of a preset template when the object to be reconstructed is a human body provided by the embodiment of the present application;

Fig. 3 is a schematic structural diagram of the first functional module of the graph convolutional neural network provided by the embodiment of the present application;

Fig. 4 is a schematic structural diagram of the i-th functional module of the graph convolutional neural network provided by the embodiment of the present application;

5 is a schematic structural diagram of the Nth functional module of the graph convolutional neural network provided by the embodiment of the present application;

Fig. 6 is an example diagram of the overall structure of the graph convolutional neural network provided by the embodiment of the present application;

FIG. 7 is an example diagram of the working framework of the three-dimensional reconstruction method provided by the embodiment of the present application;

FIG. 8 is a structural block diagram of a three-dimensional reconstruction device provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Embodiments of the present invention

In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

At present, the existing 3D reconstruction technology still has the problem of unstable reconstruction effect. In order to solve this problem, the embodiment of the present application proposes a 3D reconstruction method, a 3D reconstruction device, an electronic device, and a computer-readable storage medium. After extracting the feature vector of the object to be reconstructed in the image, the feature vector can be combined with The preset template representing the three-dimensional structure information of the object to be reconstructed is combined to obtain a feature map, so that the feature map contains the three-dimensional structure information of the object to be reconstructed displayed by the preset template on the basis of the shape feature of the object to be reconstructed, It can be better processed by the trained graph convolutional neural network, and more accurate 3D reconstruction results can be obtained, ensuring the stability of 3D reconstruction. In order to illustrate the technical solution proposed by the present application, specific examples will be used below to illustrate.

The three-dimensional reconstruction method proposed in the embodiment of the present application will be described below. Please refer to Figure 1, the implementation process of the 3D reconstruction method is described in detail as follows:

Step 101, performing feature extraction on an image of an object to be reconstructed to obtain a feature vector.

In the embodiment of the present application, the electronic device can take pictures of the object to be reconstructed through its own camera, so that the electronic device can obtain the image of the object to be reconstructed; or, it can also be a third-party device equipped with a camera The object is photographed, and after the image of the object to be reconstructed is obtained, the image is transmitted to the electronic device in a wireless or wired manner, so that the electronic device obtains the image of the object to be reconstructed, and the image of the object to be reconstructed is not acquired here way is limited.

After the electronic device obtains the image of the object to be reconstructed, it can perform feature extraction on the image to obtain a feature vector. Considering that the 3D reconstruction result obtained by the 3D reconstruction operation is mainly to restore the pose of the object to be reconstructed, and the pose is most related to the shape feature of the object to be reconstructed, therefore, the feature extraction operation here is mainly aimed at the shape feature of the object to be reconstructed information; that is, the obtained feature vector is actually used to characterize the shape feature information of the object to be reconstructed. It can be understood that the shape feature information includes contour features describing the boundary shape of the object to be reconstructed and/or region features describing the internal shape of the object to be reconstructed, etc., which are not limited herein.

In some embodiments, in addition to the image information of the object to be reconstructed, there may be some redundant information and noise information in the image of the object to be reconstructed. In order to prevent redundant information and noise information from affecting the accuracy of subsequent feature extraction, the electronic device can first perform preprocessing on the image, such as segmentation operations and size adjustment operations; then the step 101 may include:

A1. Segment the image based on the object to be reconstructed to obtain a partial image.

The electronic device may first perform frame detection on the image based on the object to be reconstructed, that is, recognize the frame of the object to be reconstructed from the image. The frame is usually a rectangular frame; of course, it can also be a frame of other preset shapes, which is not limited here. The image area within this bounding box can be regarded as a region of interest. Segmenting the image to be processed based on the border can segment the region of interest from the image, thereby removing a large amount of redundant information and noise information contained in the background of the image to a certain extent, and obtaining the A partial image of the object, and the object to be reconstructed can be located at the center of the partial image as much as possible.

As an example only, when the object to be reconstructed is a human body, the electronic device may use the human body two-dimensional key point detection technology OpenPose to perform border detection on the image.

A2. Adjust the size of the partial image to a preset size.

In order to facilitate the generation of subsequent feature maps, the electronic device can unify the size of the partial image: if the size of the partial image is inconsistent with the preset size, the partial image is scaled until the size of the partial image is consistent with the preset size . As an example only, the preset size may be: 224×224, and the unit is pixel. The number of channels of the partial image is usually 3, which is used to represent three channels of red, green and blue (Red Green Blue, RGB). That is, the final size of the partial image is: 224×224×3.

A3. By using a convolutional neural network (Convolutional Neural Networks, CNN) encoder to perform feature extraction on the resized local image to obtain a feature vector.

The electronic device can pre-train the convolutional neural network for classification tasks on a given data set in advance, and the pre-training process can refer to the current general training process for neural networks, which will not be repeated here. After the pre-training is completed, the classification layer in the convolutional neural network is removed, and the feature extraction layer before the classification layer is retained, and the retained result constitutes the encoder.

As an example only, the convolutional neural network may be ResNet50, the given data set may be the ImageNet data set, and the output of the encoder of the convolutional neural network is a 2048-dimensional feature vector.

Step 102, generate a feature map according to the feature vector and the preset template for the object to be reconstructed.

In the embodiment of the present application, the preset template is used to characterize the three-dimensional structure information of the object to be reconstructed, specifically the three-dimensional structure information of the object to be reconstructed in a specified pose. According to different types of objects to be reconstructed, different preset templates are set. As an example only, when the object to be reconstructed is a human body, the preset template is a human body mesh; when the object to be reconstructed is a hand, the preset template is a hand mesh.

In some embodiments, considering that the graph convolutional neural network is a neural network structure proposed to better handle the graph data structure, in order to improve the efficiency of the three-dimensional reconstruction of the graph convolutional neural network model, the electronic device can use the Preset templates are expressed in a graph structure and combined with feature vectors to generate feature maps. Since the preset template is usually a grid graph, it can be regarded as an undirected graph composed of vertex sets and edge sets, which can be expressed as G=(V,E); where G represents a grid graph; V represents Vertex set; E means edge set. And because the grid in the grid graph is composed of triangular faces, the grid graph can also be regarded as an undirected graph composed of vertex sets and triangular face sets, which can be expressed as M= (V, F); where M represents the vertex information matrix of the mesh graph; V represents the vertex set; F represents the triangular face set, and each triangular face in the triangular face set F is composed of three vertices in the vertex set V composed of triangles. It can be understood that the edge set information of the mesh graph is actually included in the triangle face set.

In an application scenario, the object to be reconstructed is a human body, and the preset template is a human body grid map, then step 101 may include:

B1. Construct a graph structure in a preset format based on the human body mesh graph.

As an example only, the human body mesh map adopted by the electronic device for the human body may be a standard template defined by the SMPL (Skinned Multi-Person Linear) model. As shown in FIG. 2 , the human body mesh diagram represents the three-dimensional mesh of the human body under a T-pose (T-Pose).

Through the graph structure of the general format shown above, the electronic device can first express the human body mesh graph as M _smpl = (V, F), where V represents the vertex set of the human body mesh graph, and there are 6890 vertices in total; F represents the triangular face set of the human body mesh graph, and three vertices form a triangular face. Convert the graph structure M _smpl =(V,F) in the general format to obtain the graph structure expression M _smpl =(V,A) in the preset format; where, V still represents the vertex set; A represents the human body mesh The adjacency matrix of the graph, A∈{0,1} ^6890×6890 , is used to indicate that the value of the elements in the adjacency matrix is 0 or 1, specifically: if the i-th vertex is connected to the j-th vertex, then (A ) _ij =1, otherwise (A) _ij =0. The electronic device expresses the preset template of the object to be reconstructed through the graph structure of the preset format, which provides a basis for the subsequent use of the graph convolutional neural network to predict the three-dimensional coordinates of the human body mesh.

In some embodiments, in order to reduce the computational complexity of subsequent graph convolution operations, the electronic device can perform 4 times downsampling on the standard template defined by the SMPL model, and use the downsampling result as a preset template (that is, the human body grid map). It can be understood that the number of vertices of the human body mesh image obtained after 4 times downsampling is reduced to 1723. In the future, it is only necessary to perform 4 times upsampling on the 3D reconstruction results output by the graph convolutional neural network to obtain the final 3D reconstruction results and complete the reconstruction task.

For the downsampled human body mesh image, the final graph structure in the preset format is expressed as follows:

M _h ＝(V _h ,A _h ),V _h ∈R ^1723×3 ,A _h ∈{0,1} ^1723×1723

Among them, M _h represents the vertex information matrix of the downsampled human body mesh; V _h represents the vertex set of the downsampled human body mesh; V _h ∈ R ^1723×3 represents the 1723 vertices contained in the vertex set The coordinate values of the three-dimensional coordinates of are all real numbers; A _h represents the adjacency matrix of the downsampled human body grid map, the meaning of the adjacency matrix has been explained above, and will not be repeated here; A _h ∈ {0,1} ^{1723 ×1723} indicates that the values of the elements in the adjacency matrix are 0 or 1.

B2. Merging and splicing the feature vector and the graph structure to obtain the feature graph.

The 2048-dimensional feature vector obtained in step 101 is f∈R ²⁰⁴⁸ , and the vertex information matrix of the 4-fold downsampled human body mesh image obtained in step B1 is M _h ∈ R ^1723×3 , and the two are fused After splicing, the feature map can be obtained, which is the input of the subsequent graph convolutional neural network, which can be expressed as F _in ∈ R ^1723×2051 . The above operation can be understood as: splicing the 2048-dimensional feature vector to each vertex.

In another application scenario, where the object to be reconstructed is a hand and the preset template is a hand grid map, the process of generating the feature map is similar to steps B1 and B2, except that the preset template has been changed. As an example only, the hand grid map adopted by the electronic device for the hand may be a standard template defined by the MANO (hand Model with Articulated and Non-rigid defOrmations) model. It should be noted that since the hand grid image contains a small number of vertices (usually more than 700), there is no need to perform down-sampling operations on the hand grid image.

In summary, the feature map actually combines the vertex position information of the preset template and the shape feature information of the part to be reconstructed represented in the image.

Step 103, input the feature map into the trained graph convolutional neural network to obtain the 3D reconstruction result of the object to be reconstructed.

In the embodiment of the present application, the graph convolutional neural network will use the feature map as input, and finally output the transformed grid vertex position information of the object to be reconstructed as the 3D reconstruction result. The following is an introduction to the overall structure of the graph convolutional neural network:

The graph convolutional neural network includes N functional modules in series; wherein, the input of the first functional module is the input of the graph convolutional neural network, and the output of the Nth functional module is the output of the graph convolutional neural network, and N is an integer greater than 2. It can be understood that the first functional module is mainly used to receive the input of the graph convolutional neural network, the i-th functional module is mainly used for data calculation and transmission operations, and the N-th functional module is mainly used to output the final predicted Grid vertex position information of the object to be reconstructed, where i is an integer greater than 1 and less than N.

Specifically, each functional module includes the following three basic units, namely: convolution unit, normalization unit and activation function unit. The specific structure of each functional module is introduced as follows:

Please refer to Fig. 3, Fig. 3 shows the schematic structure of the first functional module. For the first functional module, it includes at least three specified structures (only three are shown in Figure 3), and the at least three specified structures are connected in series, and the specified structures include convolution units, normalization units, and A unit and an activation function unit; in the at least three specified structures: the input of the first specified structure is the input of the first functional module (that is, the input of the graph convolutional neural network); the output of the last specified structure The residual between the input of the first functional module (that is, the input of the graph convolutional neural network) is the output of the first functional module.

Please refer to FIG. 4. FIG. 4 shows a structural diagram of the i-th functional module. For the i-th functional module, it includes at least two specified structures (only 2 are shown in Figure 4), and the at least two specified structures are connected in sequence, and the specified structure is the same as the specified structure of the first functional module Same, no more details here. In the at least two specified structures: the input of the first specified structure is the output of the i-1th functional module, and the residual between the output of the last specified structure and the output of the i-1th functional module is the The output of the i-th functional module.

Please refer to FIG. 5, which shows a schematic structural diagram of the Nth functional module. For the Nth functional module, it includes a convolution unit, a normalization unit, an activation function unit, and a convolutional unit connected in series; where the input of the first convolutional unit is the N-1th functional module. output, the residual between the output of the second convolution unit and the output of the N-1th functional module is the output of the Nth functional module. It can be understood that since the function of the Nth functional module is to output the last predicted grid vertex position information of the object to be reconstructed, the data output by the second convolution unit of the Nth functional module is not Data normalization processing and activation function processing are then required.

Please refer to Figure 6, which shows an example of the overall structure of a graph convolutional neural network including 4 functional modules. For ease of understanding, the parameter f _in and the parameter f _out can be used to represent the change in the size of the feature dimension of each functional module during the graph convolution operation. Taking Figure 6 as an example, the first functional module includes 3 convolution units, then the change of its feature dimension size can be expressed as (f _in , f _out , f _out2 , f _out3 ), where f _in is the feature dimension size of the initial input, f _out1 , f _out2 and f _out3 are the feature dimension sizes output by the three convolution units respectively, since the normalization unit and the activation function unit do not change the feature dimension size, so input the second The size of the feature dimension of the convolution unit is the same as the size of the feature dimension output by the first convolution unit, and so on. Still taking Figure 6 as an example, its 2nd, 3rd and 4th functional modules all include 2 convolution units, then the change of its feature dimension size can be expressed in the form of (f _in , f _out1 , f _out2 ), where For the meaning of the parameters, please refer to the previous article, and will not repeat them here.

As an example only, when the object to be reconstructed is a human body, the obtained feature map F _in ∈R ^1723×2051 . The feature map passes through the graph convolutional neural network, and the final output F _out ∈ ^{R 1723×3} of the graph convolutional neural network can be obtained. After up-sampling, M _out ∈ R ^6890×3 is obtained, that is, the final 3D reconstruction result for the human body is obtained.

In an example, the convolution unit may be a Chebyshev convolution unit, and the Chebyshev convolution unit specifically uses Chebyshev polynomials to construct a Chebyshev convolution algorithm. Through the Chebyshev convolution unit, the processing speed of the graph convolutional neural network can be accelerated to a certain extent, and the efficiency of 3D reconstruction can be improved.

Among them, the expression of the Chebyshev polynomial is as follows:

T ₀ (x)=1; T ₁ (x)=x; T _n+1 (x)=2xT _n (x)-T _n-1 (x)

Based on the Chebyshev polynomial, the Chebyshev convolution algorithm can be obtained, expressed as follows:

Among them, F _in ∈ ^{R N×fin} represents the input features.

F _out ∈ R ^N×fout represents the output feature.

K represents the use of K-order Chebyshev polynomials. In the embodiment of the present application, each Chebyshev convolution unit of the graph convolutional neural network takes K=3.

θ _k ∈ R ^{fin × fout} represents the feature change matrix, and the parameters inside are the values that the graph convolutional neural network needs to learn.

A scaled Laplacian matrix representing the preset template. When the object to be reconstructed is a human body and the preset template used is the downsampling result of the standard template defined by the SMPL model, N is the number of vertices after downsampling, 1723. The scaled Laplacian matrix is specifically:

Among them, I is an identity matrix, (D _h ) _ij =Σ _j (A _h ) _ij is a diagonal matrix, and λ _max is the largest eigenvalue of the L _p matrix.

For ease of understanding, taking K=3 as an example, the Chebyshev convolution algorithm given above is expanded as follows:

Among them, L is an intermediate parameter, which has no actual physical meaning. W is the parameter that the graph convolutional neural network needs to learn.

In some embodiments, when the object to be reconstructed is a human body, the graph convolutional neural network can be trained using two datasets, Human3.6M and MSCOCO. Specifically, since these two data sets do not store the real human body mesh in each training sample, but only store the position information of the real human body's 3D joints, therefore, it is necessary to pre-based on the position information of the real human body's 3D joints of each training sample The high-precision real human body mesh is obtained by fitting, and the real human body mesh can be used as a strong label and put into the training process of the convolutional neural network of the image for use. That is to say, the real human body mesh mentioned here is actually a high-precision result fitted based on the three-dimensional joints of the real human body. It can be understood that the training process of this graph convolutional neural network is basically the same as that of a general neural network, except that a new loss function is used to make the 3D reconstruction results output by the trained graph convolutional neural network model smoother. Complete, its practicality is also higher. The loss function is:

loss＝λ _a L _v +λ _b L _j +λ _c L _n +λ _d L _e

Among them, λ _a , λ _b , λ _c and λ _d are hyperparameters.

where _Lv denotes the mesh loss, which is used to describe the positional difference between the real body mesh and the predicted body mesh. Let M ^* represent the position of each vertex of the real human body mesh, M represent the position of each vertex of the predicted human body mesh, and use L1 loss, then the mesh loss L _v is expressed as follows:

L _v ＝||MM ^* || ₁

Among them, L _j represents the 3D joint loss, which is used to describe the position difference between the real human 3D joints and the predicted human 3D joints. Let J ^3D* represent the position of the real three-dimensional joints of the human body, JM represents the predicted position of the human body joints, J∈R ^v×N represents the matrix of joints extracted from the human body mesh, and M represents the position of each vertex of the predicted human body mesh, using L1 loss, the three-dimensional joint loss is expressed as follows:

L _j ＝||JM-J ^3D* || ₁

where L _n represents the surface normal loss, which is used to describe the angle difference between the normal vectors of the triangular faces of the real human mesh and the normal vectors of the triangular faces of the predicted human mesh. Let f represent the triangular surface of the predicted human body mesh, n _f ^* represents the unit normal vector of the triangular surface corresponding to f in the real human body mesh, m _i and m _j represent the coordinates of two vertices in f respectively, then the surface method The direction loss _Ln is expressed as follows:

where L _e represents the surface edge loss, which is used to describe the length difference between the edge lengths of the triangular faces of the real human mesh and the edge lengths of the triangular faces of the predicted human mesh. Let f represent the triangular surface of the predicted human body mesh, m _i and m _j represent the two vertex coordinates in f respectively, and m _i ^* and m _j ^* represent the vertex coordinates corresponding to m _i and m _j in the real human body mesh respectively , then the surface edge loss L _e is expressed as follows:

For ease of understanding, please refer to FIG. 7 . FIG. 7 shows a working frame example of the three-dimensional reconstruction method in the embodiment of the present application by taking the human body as an example to be reconstructed. The working framework consists of two parts, a convolutional neural network-based encoder and a graph convolutional neural network-based human 3D vertex regressor. After shooting a person, the original image of the human body is obtained. The original image is used as the initial input, and the partial image is obtained after preprocessing. The partial image is encoded into a set of feature vectors through an encoder based on a convolutional neural network. The set of feature vectors are fused and spliced with the grid vertex position information in the preset human grid graph to form a feature graph as the input of the graph convolutional neural network, and finally the graph convolutional neural network will return a new set of network The position information of the vertices of the lattice makes it conform to the two-dimensional observation of the human body in the original image, and completes the task of three-dimensional reconstruction of the human body.

It can be seen from the above that, through the embodiment of the present application, feature extraction is first performed on the image of the object to be reconstructed, and the feature vector used to characterize the shape feature information of the object to be reconstructed is obtained, and then the feature vector is combined with the preset template for the object to be reconstructed Combined to generate a feature map, and finally input the feature map into the trained graph convolutional neural network, so as to obtain the 3D reconstruction result of the object to be reconstructed. In the above process, through the combination of the feature vector and the preset template, the finally generated feature map not only contains the shape features of the object to be reconstructed, but also obtains the three-dimensional structure information of the object to be reconstructed displayed by the preset template, As a result, the trained graph convolutional neural network can better process the feature map, and the obtained 3D reconstruction results will be more accurate, ensuring the stability of 3D reconstruction.

Corresponding to the three-dimensional reconstruction method provided above, an embodiment of the present application further provides a three-dimensional reconstruction device. As shown in Figure 8, the three-dimensional reconstruction device 800 includes:

The extraction module 801 is configured to perform feature extraction on the image of the object to be reconstructed to obtain a feature vector, wherein the feature vector is used to represent the shape feature information of the object to be reconstructed;

The generation module 802 is configured to generate a feature map according to the above-mentioned feature vector and a preset template for the above-mentioned object to be reconstructed, and the above-mentioned preset template is used to represent the three-dimensional structure information of the above-mentioned object to be reconstructed;

The reconstruction module 803 is configured to input the above-mentioned feature map into the trained graph convolutional neural network to obtain the three-dimensional reconstruction result of the above-mentioned object to be reconstructed.

Optionally, the graph convolutional neural network includes N functional modules connected in series;

Wherein, the input of the first above-mentioned functional module is the input of the above-mentioned graph convolutional neural network, the output of the Nth above-mentioned functional module is the output of the above-mentioned graph convolutional neural network, and the above-mentioned N is an integer greater than 2;

Wherein, the above functional modules include a convolution unit, a normalization unit and an activation function unit.

Optionally, the first above-mentioned functional module includes at least three specified structures, and the above-mentioned at least three specified structures are connected in sequence, and the above-mentioned specified structure includes the above-mentioned convolution unit, the above-mentioned normalization unit and the above-mentioned activation function unit connected in sequence;

Among the at least three specified structures above: the input of the first specified structure is the input of the first above-mentioned function module, and the residual between the output of the last specified structure and the input of the first above-mentioned function module is the first The output of the above function modules.

Optionally, the i-th above-mentioned functional module includes at least two specified structures, and the above-mentioned at least two specified structures are connected in series, and the above-mentioned specified structure includes the above-mentioned convolution unit, the above-mentioned normalization unit and the above-mentioned activation function unit connected in series, and the above-mentioned i is an integer greater than 1 and less than N;

Among the above at least two specified structures: the input of the first specified structure is the output of the i-1th above-mentioned functional module, the residual between the output of the last specified structure and the output of the i-1th above-mentioned functional module is the output of the i-th above-mentioned functional module.

Optionally, the Nth above-mentioned functional module includes the above-mentioned convolution unit, the above-mentioned normalization unit, the above-mentioned activation function unit, and the above-mentioned convolution unit connected in series;

In the Nth above-mentioned functional module: the input of the first above-mentioned convolution unit is the output of the N-1th above-mentioned functional module, the output of the second above-mentioned convolutional unit is the output of the N-1th above-mentioned functional module The residual between is the output of the Nth above-mentioned functional module.

Optionally, the above convolution unit is a Chebyshev convolution unit.

Optionally, when the above-mentioned object to be reconstructed is a human body, the above-mentioned preset template is a human body grid map; the above-mentioned generating module 802 includes:

A construction unit, configured to construct a graph structure in a preset format based on the above-mentioned human body mesh graph, where the graph structure includes the vertex information of the above-mentioned human body mesh graph;

The splicing unit is configured to fuse and splice the above-mentioned feature vectors and the above-mentioned graph structure to obtain the above-mentioned feature map.

Optionally, the above three-dimensional reconstruction device 800 also includes:

A calculation module for calculating the total loss of the above-mentioned graph convolutional neural network based on grid loss, three-dimensional joint loss, surface normal loss and surface edge loss during the training process of the above-mentioned graph convolutional neural network;

Among them, the above mesh loss is used to describe the position difference between the real human body mesh and the predicted human body mesh;

The above three-dimensional joint loss is used to describe the position difference between the real human three-dimensional joints and the predicted human three-dimensional joints;

The surface normal loss described above is used to describe the angular difference between the normal vectors of the triangular faces of the real human mesh and the normal vectors of the triangular faces of the predicted human mesh;

The surface edge loss described above is used to describe the difference in length between the edge lengths of the triangular faces of the real human mesh and the edge lengths of the triangular faces of the predicted human mesh.

Optionally, the above extraction module 801 includes:

A segmentation unit, configured to segment the image based on the object to be reconstructed to obtain a partial image;

an adjustment unit, configured to adjust the size of the partial image to a preset size;

The extraction unit is configured to perform feature extraction on the above-mentioned partial image after size adjustment by using an encoder using a convolutional neural network to obtain the above-mentioned feature vector.

It can be seen from the above that, through the embodiment of the present application, feature extraction is first performed on the image of the object to be reconstructed to obtain the feature vector used to characterize the shape feature information of the object to be reconstructed, and then the feature vector is combined with the preset object for the object to be reconstructed The templates are combined to generate a feature map, and finally the feature map is input into the trained graph convolutional neural network to obtain the 3D reconstruction result of the object to be reconstructed. In the above process, through the combination of the feature vector and the preset template, the finally generated feature map not only contains the shape features of the object to be reconstructed, but also obtains the three-dimensional structure information of the object to be reconstructed displayed by the preset template, As a result, the trained graph convolutional neural network can better process the feature map, and the obtained 3D reconstruction results will be more accurate, ensuring the stability of 3D reconstruction.

Corresponding to the three-dimensional reconstruction method provided above, an embodiment of the present application further provides an electronic device. Referring to Fig. 9, the electronic device 9 in the embodiment of the present application includes: a memory 901, one or more processors 902 (only one is shown in Fig. 9 ) and a computer stored on the memory 901 and operable on the processor program. Wherein: the memory 901 is used to store software programs and units, and the processor 902 executes various functional applications and diagnoses by running the software programs and units stored in the memory 901 to obtain resources corresponding to the above preset events. Specifically, the processor 902 implements the following steps by running the above-mentioned computer program stored in the memory 901:

Input the feature map above into the trained graph convolutional neural network to obtain the 3D reconstruction result of the object to be reconstructed above.

Assuming that the above is the first possible implementation manner, then in the second possible implementation manner provided on the basis of the first possible implementation manner, the above-mentioned graph convolutional neural network includes N series-connected functional modules;

In the third possible implementation manner provided on the basis of the second possible implementation manner above, the first above-mentioned functional module includes at least three specified structures, and the at least three specified structures are connected in sequence, and the above-mentioned specified structures include sequentially The above-mentioned convolution unit, the above-mentioned normalization unit and the above-mentioned activation function unit connected in series;

In the fourth possible implementation mode provided on the basis of the second possible implementation mode above, the i-th functional module includes at least two specified structures, the at least two specified structures are connected in sequence, and the specified structures include sequentially The above-mentioned convolution unit, the above-mentioned normalization unit and the above-mentioned activation function unit connected in series, the above-mentioned i is an integer greater than 1 and less than N;

In the fifth possible implementation manner provided on the basis of the second possible implementation manner above, the Nth functional module includes the above-mentioned convolution unit, the above-mentioned normalization unit, the above-mentioned activation function unit and the above-mentioned convolution unit;

Based on the second possible implementation manner above, or the third possible implementation manner as the basis, or the fourth possible implementation manner as the basis, or the fifth possible implementation manner as the basis, the sixth In a possible implementation manner, the above convolution unit is a Chebyshev convolution unit.

In the seventh possible implementation manner provided on the basis of the first possible implementation manner above, when the object to be reconstructed is a human body, the preset template is a human body mesh; The preset template of the object to be reconstructed generates a feature map, including:

Constructing a graph structure in a preset format based on the above-mentioned human body mesh graph, wherein the graph structure includes vertex information of the above-mentioned human body mesh graph;

The above feature vector is fused and spliced with the above graph structure to obtain the above feature map.

In the eighth possible implementation manner provided on the basis of the seventh possible implementation manner above, the processor 902 further implements the following steps when running the above computer program stored in the memory 901:

During the training process of the above-mentioned graph convolutional neural network, the total loss of the above-mentioned graph convolutional neural network is calculated based on grid loss, three-dimensional joint loss, surface normal loss and surface edge loss;

In the ninth possible implementation manner provided on the basis of the first possible implementation manner above, feature extraction is performed on the image of the object to be reconstructed to obtain a feature vector, including:

Segmenting the image above based on the object to be reconstructed to obtain a partial image;

Adjust the size of the above partial image to a preset size;

The above-mentioned feature vector is obtained by performing feature extraction on the above-mentioned partial image after size adjustment by using an encoder of a convolutional neural network.

It should be understood that in the embodiment of the present application, the so-called processor 902 may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP) , Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The memory 901 may include read-only memory and random-access memory, and provides instructions and data to the processor 902 . Part or all of the memory 901 may also include non-volatile random access memory. For example, the memory 901 may also store information of device categories.

It can be seen from the above that, through the embodiment of the present application, feature extraction is first performed on the image of the object to be reconstructed to obtain the feature vector used to characterize the shape feature information of the object to be reconstructed, and then the feature vector will be combined with the predicted object for the object to be reconstructed. The templates are combined to generate a feature map, and finally the feature map is input into the trained graph convolutional neural network to obtain the 3D reconstruction result of the object to be reconstructed. In the above process, through the combination of the feature vector and the preset template, the finally generated feature map not only contains the shape features of the object to be reconstructed, but also obtains the three-dimensional structure information of the object to be reconstructed displayed by the preset template, As a result, the trained graph convolutional neural network can better process the feature map, and the obtained 3D reconstruction results will be more accurate, ensuring the stability of 3D reconstruction.

Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Module completion means that the internal structure of the above-mentioned device is divided into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware It can also be implemented in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above system, reference may be made to the corresponding process in the foregoing method embodiments, and details will not be repeated here.

In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of external device software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

In the embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the system embodiments described above are only illustrative. For example, the division of the above-mentioned modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined Or it can be integrated into another system, or some features can be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

If the above integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above-mentioned embodiments, and can also be completed by instructing associated hardware through computer programs. The above-mentioned computer programs can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps in the above-mentioned various method embodiments can be realized. Wherein, the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code, executable file or some intermediate form. The above-mentioned computer-readable storage medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer-readable memory, read-only memory (ROM, Read-Only Memory ), Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunication signal, and software distribution medium, etc. It should be noted that the content contained in the above-mentioned computer-readable storage media can be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable storage media The medium does not include electrical carrier signals and telecommunication signals.

The above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still apply to the foregoing embodiments Modifications to the technical solutions recorded, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of each embodiment of the application, and should be included in this application. within the scope of protection.

Claims

A three-dimensional reconstruction method, characterized in that, comprising:

performing feature extraction on the image of the object to be reconstructed to obtain a feature vector, wherein the feature vector is used to characterize the shape feature information of the object to be reconstructed;

Generate a feature map according to the feature vector and a preset template for the object to be reconstructed, where the preset template is used to characterize the three-dimensional structure information of the object to be reconstructed;

The feature map is input into the trained graph convolutional neural network to obtain a three-dimensional reconstruction result of the object to be reconstructed.
The three-dimensional reconstruction method according to claim 1, wherein the graph convolutional neural network comprises N functional modules connected in series;

Wherein, the input of the first functional module is the input of the graph convolutional neural network, the output of the Nth functional module is the output of the graph convolutional neural network, and the N is an integer greater than 2 ;

Wherein, the functional modules include a convolution unit, a normalization unit and an activation function unit.
The three-dimensional reconstruction method according to claim 2, wherein the first functional module includes at least three specified structures, and the at least three specified structures are connected in sequence, and the specified structures include the volumes connected in sequence product unit, the normalization unit and the activation function unit;

In the at least three specified structures: the input of the first specified structure is the input of the first functional module, and the residual between the output of the last specified structure and the input of the first functional module is The output of the first described function module.
The three-dimensional reconstruction method according to claim 2, wherein the i-th functional module includes at least two specified structures, the at least two specified structures are connected in series, and the specified structures include the volumes connected in series product unit, the normalization unit and the activation function unit, the i is an integer greater than 1 and less than N;

In the at least two specified structures: the input of the first specified structure is the output of the i-1th functional module, the output of the last specified structure and the output of the i-1th functional module The residual of is the output of the i-th functional module.
The three-dimensional reconstruction method according to claim 2, wherein the Nth functional module includes the convolution unit, the normalization unit, the activation function unit and the convolution unit connected in series ;

In the Nth functional module: the input of the first convolutional unit is the output of the N-1th functional module, and the output of the second convolutional unit is the same as the output of the N-1th functional module. The residual between the outputs of the functional modules is the output of the Nth functional module.
The three-dimensional reconstruction method according to any one of claims 2 to 5, wherein the convolution unit is a Chebyshev convolution unit.
The three-dimensional reconstruction method according to claim 1, wherein when the object to be reconstructed is a human body, the preset template is a human body mesh; The preset template for generating feature maps, including:

Constructing a graph structure in a preset format based on the human body mesh graph, the graph structure including vertex information of the human body mesh graph;

The feature vector is fused and spliced with the graph structure to obtain the feature graph.
The three-dimensional reconstruction method according to claim 7, wherein, during the training process of the graph convolutional neural network, the graph volume is calculated based on grid loss, three-dimensional joint loss, surface normal loss, and surface edge loss. The total loss of the product neural network;

Wherein, the grid loss is used to describe the position difference between the real human body grid and the predicted human body grid;

The three-dimensional joint loss is used to describe the position difference between the real three-dimensional joints of the human body and the predicted three-dimensional joints of the human body;

The surface normal loss is used to describe the angle difference between the normal vector of the triangular face of the real human body mesh and the normal vector of the triangular face of the predicted human body mesh;

The surface edge loss is used to describe the length difference between the edge lengths of the triangular faces of the real human mesh and the edge lengths of the triangular faces of the predicted human mesh.
The three-dimensional reconstruction method according to claim 1, wherein the feature extraction of the image of the object to be reconstructed to obtain the feature vector comprises:

Segmenting the image based on the object to be reconstructed to obtain a partial image;

adjusting the size of the partial image to a preset size;

The feature vector is obtained by performing feature extraction on the size-adjusted partial image by using an encoder of a convolutional neural network.
A three-dimensional reconstruction device, characterized in that it comprises:

The extraction module is used to perform feature extraction on the image of the object to be reconstructed to obtain a feature vector, wherein the above-mentioned feature vector is used to represent the shape feature information of the above-mentioned object to be reconstructed;

A generation module, configured to generate a feature map according to the above-mentioned feature vector and a preset template for the above-mentioned object to be reconstructed, and the above-mentioned preset template is used to represent the three-dimensional structure information of the above-mentioned object to be reconstructed;

The reconstruction module is configured to input the above-mentioned feature map into the trained graph convolutional neural network to obtain the three-dimensional reconstruction result of the above-mentioned object to be reconstructed.
An electronic device, comprising a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein the processor implements the following steps when executing the computer program:

performing feature extraction on the image of the object to be reconstructed to obtain a feature vector, wherein the feature vector is used to characterize the shape feature information of the object to be reconstructed;

Generate a feature map according to the feature vector and a preset template for the object to be reconstructed, where the preset template is used to characterize the three-dimensional structure information of the object to be reconstructed;

The feature map is input into the trained graph convolutional neural network to obtain a three-dimensional reconstruction result of the object to be reconstructed.
The electronic device according to claim 11, wherein the graph convolutional neural network comprises N functional modules connected in series;

Wherein, the input of the first functional module is the input of the graph convolutional neural network, the output of the Nth functional module is the output of the graph convolutional neural network, and the N is an integer greater than 2 ;

Wherein, the functional modules include a convolution unit, a normalization unit and an activation function unit.
The electronic device according to claim 12, wherein the first functional module includes at least three specified structures, and the at least three specified structures are connected in series, and the specified structures include the convolution unit, the normalization unit and the activation function unit;

In the at least three specified structures: the input of the first specified structure is the input of the first functional module, and the residual between the output of the last specified structure and the input of the first functional module is The output of the first described function module.
The electronic device according to claim 12, wherein the i-th functional module includes at least two specified structures, and the at least two specified structures are connected in series, and the specified structures include the sequentially connected structures. Convolution unit, the normalization unit and the activation function unit, the i is an integer greater than 1 and less than N;

In the at least two specified structures: the input of the first specified structure is the output of the i-1th functional module, the output of the last specified structure and the output of the i-1th functional module The residual of is the output of the i-th functional module.
The electronic device according to claim 12, wherein the Nth functional module comprises the convolution unit, the normalization unit, the activation function unit, and the convolution unit connected in series in sequence;

In the Nth functional module: the input of the first convolutional unit is the output of the N-1th functional module, and the output of the second convolutional unit is the same as the output of the N-1th functional module. The residual between the outputs of the functional modules is the output of the Nth functional module.
The electronic device according to any one of claims 12 to 15, wherein the convolution unit is a Chebyshev convolution unit.
The electronic device according to claim 11, wherein when the object to be reconstructed is a human body, the preset template is a human body grid map; Preset templates to generate feature maps, including:

Constructing a graph structure in a preset format based on the human body mesh graph, the graph structure including vertex information of the human body mesh graph;

The feature vector is fused and spliced with the graph structure to obtain the feature graph.
The electronic device according to claim 17, wherein during the training process of the graph convolutional neural network, the graph convolution is calculated based on grid loss, three-dimensional joint loss, surface normal loss, and surface edge loss. the total loss of the neural network;

Wherein, the grid loss is used to describe the position difference between the real human body grid and the predicted human body grid;

The three-dimensional joint loss is used to describe the position difference between the real three-dimensional joints of the human body and the predicted three-dimensional joints of the human body;

The surface normal loss is used to describe the angle difference between the normal vector of the triangular face of the real human body mesh and the normal vector of the triangular face of the predicted human body mesh;

The surface edge loss is used to describe the length difference between the edge lengths of the triangular faces of the real human mesh and the edge lengths of the triangular faces of the predicted human mesh.
The electronic device according to claim 11, wherein the feature extraction is performed on the image of the object to be reconstructed to obtain a feature vector, comprising:

Segmenting the image based on the object to be reconstructed to obtain a partial image;

adjusting the size of the partial image to a preset size;

The feature vector is obtained by performing feature extraction on the size-adjusted partial image by using an encoder of a convolutional neural network.
A computer-readable storage medium storing a computer program, wherein the computer program implements the three-dimensional reconstruction method according to any one of claims 1 to 9 when executed by a processor.