CN113781659A

CN113781659A - Three-dimensional reconstruction method and device, electronic equipment and readable storage medium

Info

Publication number: CN113781659A
Application number: CN202110951222.1A
Authority: CN
Inventors: 王磊; 刘薰裕; 马晓亮; 刘宝玉; 程俊
Original assignee: Shenzhen Institute of Advanced Technology of CAS; Shenzhen Technology University
Current assignee: Shenzhen Institute of Advanced Technology of CAS; Shenzhen Technology University
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-12-10

Abstract

The application relates to the technical field of image processing, and particularly discloses a three-dimensional reconstruction method, a three-dimensional reconstruction device, electronic equipment and a computer-readable storage medium. The three-dimensional reconstruction method comprises the following steps: extracting features of an image of an object to be reconstructed to obtain a feature vector, wherein the feature vector is used for representing shape feature information of the object to be reconstructed; generating a feature map according to the feature vector and a preset template aiming at the object to be reconstructed, wherein the preset template is used for representing three-dimensional structure information of the object to be reconstructed; and inputting the characteristic diagram into a trained diagram convolution neural network to obtain a three-dimensional reconstruction result of the object to be reconstructed. Through the scheme, the stability of three-dimensional reconstruction can be improved.

Description

Three-dimensional reconstruction method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a three-dimensional reconstruction method, a three-dimensional reconstruction device, an electronic apparatus, and a computer-readable storage medium.

Background

Three-dimensional reconstruction of human body parts has been a hot problem in computational vision, and has been widely used in the fields of Virtual Reality (VR) and Augmented Reality (AR). Conventional three-dimensional reconstruction techniques require reliance on relatively complex and expensive equipment, such as three-dimensional scanners, multi-view cameras, or inertial sensors. At present, although three-dimensional reconstruction techniques based on a single image have been developed, these three-dimensional reconstruction techniques still have problems such as unstable reconstruction effect.

Disclosure of Invention

The application provides a three-dimensional reconstruction method, a three-dimensional reconstruction device, an electronic device and a computer-readable storage medium, which can solve the problem of unstable reconstruction effect of the existing three-dimensional reconstruction technology.

In a first aspect, the present application provides a three-dimensional reconstruction method, including:

extracting features of an image of an object to be reconstructed to obtain a feature vector, wherein the feature vector is used for representing shape feature information of the object to be reconstructed;

generating a feature map according to the feature vector and a preset template for the object to be reconstructed, wherein the preset template is used for representing three-dimensional structure information of the object to be reconstructed;

inputting the feature map into a trained Graph Convolutional neural Network (GCN) to obtain a three-dimensional reconstruction result of the object to be reconstructed.

In a second aspect, the present application provides a three-dimensional reconstruction apparatus, comprising:

the extraction module is used for extracting the characteristics of the image of the object to be reconstructed to obtain a characteristic vector, wherein the characteristic vector is used for representing the shape characteristic information of the object to be reconstructed;

a generating module, configured to generate a feature map according to the feature vector and a preset template for the object to be reconstructed, where the preset template is used to represent three-dimensional structure information of the object to be reconstructed;

and the reconstruction module is used for inputting the characteristic diagram into a trained diagram convolution neural network to obtain a three-dimensional reconstruction result of the object to be reconstructed.

In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by one or more processors, performs the steps of the method of the first aspect as described above.

Compared with the prior art, the application has the beneficial effects that: for an image of an object to be reconstructed, feature extraction is performed on the image to obtain a feature vector for representing shape feature information of the object to be reconstructed, then the feature vector is combined with a preset template for the object to be reconstructed to generate a feature map, and finally the feature map is input into a trained map convolution neural network to obtain a three-dimensional reconstruction result of the object to be reconstructed. In the process, the feature vector is combined with the preset template, so that the finally generated feature map further obtains the three-dimensional structure information of the object to be reconstructed, which is displayed by the preset template, on the basis that the feature map contains the shape feature of the object to be reconstructed, therefore, the trained map convolution neural network can better process the feature map, the obtained three-dimensional reconstruction result is more accurate, and the stability of three-dimensional reconstruction is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flow chart of an implementation of a three-dimensional reconstruction method provided in an embodiment of the present application;

fig. 2 is an exemplary diagram of a preset template when an object to be reconstructed is a human body according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a1 st functional module of a graph convolution neural network provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an ith functional module of a graph convolution neural network provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an nth functional module of a graph convolution neural network provided in an embodiment of the present application;

FIG. 6 is a diagram illustrating an example of an overall structure of a convolutional neural network provided in an embodiment of the present application;

fig. 7 is a diagram of an example working framework of a three-dimensional reconstruction method provided in an embodiment of the present application;

fig. 8 is a block diagram of a three-dimensional reconstruction apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

At present, the existing three-dimensional reconstruction technology still has the problem of unstable reconstruction effect. In order to solve the problem, an embodiment of the present application provides a three-dimensional reconstruction method, a three-dimensional reconstruction device, an electronic device, and a computer-readable storage medium, in which after a feature vector of an object to be reconstructed in an image is extracted, the feature vector is combined with a preset template representing three-dimensional structure information of the object to be reconstructed to obtain a feature map, so that the feature map further includes the three-dimensional structure information of the object to be reconstructed, which is displayed by the preset template, on the basis of including shape features of the object to be reconstructed, and thus, the three-dimensional reconstruction method, the three-dimensional reconstruction apparatus, the electronic device, and the computer-readable storage medium can better process a trained graph convolution neural network, obtain a more accurate three-dimensional reconstruction result, and ensure stability of three-dimensional reconstruction. In order to explain the technical solution proposed in the present application, the following description will be given by way of specific examples.

The three-dimensional reconstruction method proposed in the embodiments of the present application is explained below. Referring to fig. 1, the implementation flow of the three-dimensional reconstruction method is detailed as follows:

step 101, extracting features of an image of an object to be reconstructed to obtain a feature vector.

In the embodiment of the application, the electronic equipment can shoot the object to be reconstructed through the camera carried by the electronic equipment, so that the electronic equipment can obtain the image of the object to be reconstructed; alternatively, the object to be reconstructed may be photographed by a third-party device equipped with a camera, an image of the object to be reconstructed is obtained, and then the image is transmitted to the electronic device in a wireless or wired manner, so that the electronic device obtains the image of the object to be reconstructed, where the obtaining manner of the image of the object to be reconstructed is not limited here.

After the electronic equipment obtains the image of the object to be reconstructed, the electronic equipment can extract the features of the image to obtain the feature vector. Considering that the three-dimensional reconstruction result obtained by the three-dimensional reconstruction operation is mainly used for restoring the posture of the object to be reconstructed, and the posture is most relevant to the shape characteristic of the object to be reconstructed, therefore, the operation of characteristic extraction mainly aims at the shape characteristic information of the object to be reconstructed; that is, the obtained feature vector is actually used to characterize the shape feature information of the object to be reconstructed. It is to be understood that the shape feature information includes a contour feature describing a boundary shape of the object to be reconstructed and/or a region feature describing an internal shape of the object to be reconstructed, and the like, and is not limited herein.

In some embodiments, in addition to the image information of the object to be reconstructed, there may be some redundant information and noise information in the image of the object to be reconstructed. In order to avoid that the redundant information and the noise information influence the accuracy of subsequent feature extraction, the electronic equipment can firstly carry out preprocessing on the image, such as segmentation operation, size adjustment operation and the like; this step 101 may comprise:

and A1, segmenting the image based on the object to be reconstructed to obtain a local image.

The electronic device may perform frame detection on the image based on the object to be reconstructed, that is, identify a frame of the object to be reconstructed from the image. The frame is generally a rectangular frame; of course, other predetermined shapes of the frame are also possible, and are not limited herein. The image area within the bounding box can be considered as a region of interest. The image to be processed is segmented based on the frame, namely the region of interest can be segmented from the image, so that a large amount of redundant information and noise information contained in the background of the image can be removed to a certain extent, a local image mainly containing an object to be reconstructed is obtained, and the object to be reconstructed can be located at the center of the local image as far as possible.

For example only, when the object to be reconstructed is a human body, the electronic device may perform border detection on the image by using a human body two-dimensional keypoint detection technology openpos.

And A2, adjusting the size of the local image to a preset size.

In order to facilitate the generation of the subsequent feature map, the electronic device may unify the sizes of the partial images: and if the size of the local image is not consistent with the preset size, carrying out scaling processing on the local image until the size of the local image is consistent with the preset size. For example only, the preset size may be: 224 × 224, in pixels. The number of channels of the partial image is usually 3, which represents three channels of Red, Green and Blue (RGB). That is, the size of the final partial image is: 224 × 224 × 3.

A3, performing feature extraction on the local image after size adjustment by using an encoder of a Convolutional Neural Network (CNN) to obtain a feature vector.

The electronic device may pre-train the convolutional neural network in advance on a given data set for a classification task, and the pre-training process may refer to a current general training process for the neural network, which is not described herein again. After the pre-training is finished, the classification layer in the convolutional neural network is removed, the feature extraction layer before the classification layer is reserved, and the reserved result forms an encoder.

For example only, the convolutional neural network may be ResNet50, the given data set may be an ImageNet data set, and the output of the encoder of the convolutional neural network is a feature vector of 2048 dimensions.

And 102, generating a feature map according to the feature vector and a preset template aiming at the object to be reconstructed.

In the embodiment of the application, the preset template is used for representing three-dimensional structure information of an object to be reconstructed, specifically three-dimensional structure information of the object to be reconstructed in a specified posture. Different preset templates are set according to different types of objects to be reconstructed. For example only, when the object to be reconstructed is a human body, the preset template is a human body grid map; and when the object to be reconstructed is a hand, the preset template is a hand grid map.

In some embodiments, considering that the graph convolution neural network is a neural network structure proposed for better processing of a graph data structure, to improve the efficiency of three-dimensional reconstruction of the graph convolution neural network model, the electronic device may express the preset template in the graph structure and combine the preset template with the feature vectors to generate the feature map. Since the default template is usually a grid graph, it can be regarded as an undirected graph composed of a vertex set and an edge set, and can be represented as G ═ V, E; wherein G represents a grid graph; v represents a set of vertices; e represents an edge set. Since the mesh in the mesh map is formed by splicing triangular surfaces, the mesh map can be regarded as an undirected graph formed by a vertex set and a triangular surface set, and can be represented as M ═ V, F; wherein M represents a vertex information matrix of the grid graph; v represents a set of vertices; f represents a set of triangular faces, each triangular face in the set of triangular faces F being composed of a triangle composed of three vertices in the set of vertices V. It will be appreciated that the information for the set of edges of the grid graph is actually contained in the set of triangular faces.

In an application scenario, if the object to be reconstructed is a human body, and the preset template is a human body grid map, the step 101 may include:

and B1, constructing a graph structure with a preset format based on the human body grid graph.

For example only, the human body mesh map adopted by the electronic device for the human body may be a standard template defined by the SMPL (Skinned Multi-Person Linear) model. As shown in fig. 2, the body grid map represents a three-dimensional grid of the body in T-Pose (T-pos).

With the diagram structure in the general format shown above, the electronic device may first represent the human body grid as M_smpl(V, F), where V represents the set of vertices of the human mesh map, for a total of 6890 vertices; and F represents a triangular surface set of the human body grid diagram, and three vertexes form a triangular surface. Graph structure M for the common format_smplConverting the graph structure (V, F) to obtain a graph structure expression M with a preset format_smpl(V, a); wherein V still represents a set of vertices; a represents the adjacency matrix of the human body grid map, and A is equal to {0,1}^6890×6890The value used to represent the element in the adjacency matrix is 0 or 1, specifically: if the ith vertex is connected to the jth vertex, (a) ij is equal to 1, otherwise (a) ij is equal to 0. The electronic equipment expresses the preset template of the object to be reconstructed through the graph structure with the preset format, and provides a basis for predicting the three-dimensional coordinates of the human body grid by using the graph convolutional neural network subsequently.

In some embodiments, to reduce the computational complexity of subsequent graph convolution operations, the electronic device may downsample the standard template defined by the SMPL model by a factor of 4, and use the result of the downsampling as a preset template (i.e., a human mesh graph). It can be appreciated that the number of vertices of the resulting 4-fold downsampled human mesh map is reduced to 1723. And subsequently, the final three-dimensional reconstruction result can be obtained by only performing up-sampling on the three-dimensional reconstruction result output by the graph convolution neural network by 4 times, and the reconstruction task is completed.

For the down-sampled human body mesh diagram, the finally obtained diagram structure with the preset format is represented as follows:

M_h＝(V_h,A_h),V_h∈R^1723×3,A_h∈{0,1}^1723×1723

wherein M is_hA vertex information matrix representing the downsampled human mesh map; v_hA set of vertices representing the downsampled human mesh map; v_h∈R^1723×3Coordinate values representing three-dimensional coordinates of 1723 vertexes contained in the vertex set are all real numbers; a. the_hAn adjacency matrix representing the downsampled human body grid map, the meaning of the adjacency matrix has been described previously and is not described herein again; a. the_h∈{0,1}^1723×1723Indicating that the value of an element in the adjacency matrix is 0 or 1.

And B2, carrying out fusion splicing on the feature vector and the graph structure to obtain a feature graph.

The 2048-dimensional feature vector obtained in step 101 is f ∈ R²⁰⁴⁸The vertex information matrix of the 4-fold down-sampled human mesh map obtained in step B1 is M_h∈R^1723×3The two are fused and spliced to obtain a characteristic diagram, the characteristic diagram is the input of the convolutional neural network of the subsequent diagram and can be expressed as F_in∈R^1723×2051. The above operation can be understood as: the 2048-dimensional feature vector is stitched to each vertex.

In another application scenario, the object to be reconstructed is a hand, and the preset template is a hand grid map, the generation flow of the feature map is similar to that in steps B1 and B2, but the preset template is changed. For example only, the hand grid map employed by the electronic device for the hand may be a standard template defined by a MANO (hand Model with organized and Non-linear default objects) Model. It should be noted that, since the hand grid contains a small number of vertices (typically 700 or more), there is no need to perform downsampling on the hand grid.

In summary, the feature map actually combines the vertex position information of the preset template and the shape feature information of the portion to be reconstructed represented in the image.

And 103, inputting the characteristic diagram into the trained diagram convolution neural network to obtain a three-dimensional reconstruction result of the object to be reconstructed.

In the embodiment of the application, the graph convolution neural network takes the feature graph as input, and finally outputs the transformed mesh vertex position information of the object to be reconstructed as a three-dimensional reconstruction result. The general structure of the graph convolution neural network is described below:

the graph convolution neural network comprises N functional modules which are connected in series; the input of the 1 st functional module is the input of the convolutional neural network, the output of the Nth functional module is the output of the convolutional neural network, and N is an integer greater than 2. It can be understood that the 1 st functional module is mainly used for receiving the input of the graph convolution neural network, the ith functional module is mainly used for performing data calculation and transmission operations, and the nth functional module is mainly used for outputting mesh vertex position information of a finally predicted object to be reconstructed, wherein i is an integer greater than 1 and smaller than N.

Specifically, each functional module includes the following three basic units: convolution unit, normalization unit and activation function unit. The following describes specific structures of the functional modules:

referring to fig. 3, fig. 3 shows a structural schematic diagram of the 1 st functional module. For the 1 st functional module, the functional module includes at least three specified structures (only 3 are shown in fig. 3), and the at least three specified structures are connected in series in sequence, and the specified structure includes a convolution unit, a normalization unit and an activation function unit which are connected in series in sequence; in the at least three specified structures: the input of the first specified structure is the input of the 1 st functional module (namely the input of the graph convolution neural network); the residual between the output of the last specified structure and the input of the 1 st functional block (i.e., the input of the graph convolutional neural network) is the output of the 1 st functional block.

Referring to fig. 4, fig. 4 shows a structural schematic diagram of the ith functional module. For the ith functional module, it includes at least two designated structures (only 2 are shown in fig. 4), and the at least two designated structures are connected in series in sequence, and the designated structures are the same as those of the 1 st functional module, and are not described here again. In the at least two specified structures: the input of the first specified structure is the output of the (i-1) th functional module, and the residual error between the output of the last specified structure and the output of the (i-1) th functional module is the output of the (i) th functional module.

Referring to fig. 5, fig. 5 shows a structural schematic diagram of the nth functional module. For the Nth functional module, the Nth functional module comprises a convolution unit, a normalization unit, an activation function unit and a convolution unit which are sequentially connected in series; the input of the first convolution unit is the output of the (N-1) th functional module, and the residual error between the output of the second convolution unit and the output of the (N-1) th functional module is the output of the Nth functional module. It can be understood that, since the nth function module is used to output the mesh vertex position information of the object to be reconstructed predicted last, the data output by the second convolution unit of the nth function module does not need to be subjected to the data normalization process and the activation function process.

Referring to fig. 6, fig. 6 shows an example of the overall structure of a convolutional neural network including 4 functional modules. For ease of understanding, the parameter f may be used_inAnd parameter f_outTo represent the change of the characteristic dimension during the graph convolution operation of each functional module. Taking fig. 6 as an example, if the 1 st functional module includes 3 convolution units, the variation of the characteristic dimension can be represented as (f)_in,f_out1,f_out2,f_out3) In a form of (1), wherein f_inIs the characteristic dimension size of the initial input, f_out1、f_out2And f_out3The sizes of the characteristic dimensions output by the three convolution units are respectively, and the sizes of the characteristic dimensions input into the second convolution unit are the same as those output by the first convolution unit because the normalization unit and the activation function unit do not change the sizes of the characteristic dimensions, and the rest are analogized in sequence. Still taking fig. 6 as an example, if 2 convolution units are included in each of the 2 nd, 3 rd and 4 th functional modules, the variation of the characteristic dimension can be represented as (f)_in,f_out1,f_out2) For the above-mentioned forms, the meaning of the parameters can be referred to, and are not described herein again.

By way of example only, in the case where the object to be reconstructed is a human body, the resulting feature map F_in∈R^1723×2051. The final output F of the convolutional neural network can be obtained by passing the characteristic diagram through the convolutional neural network_out∈R^1723×3. After upsampling, M is obtained_out∈R^6890×3And obtaining a final three-dimensional reconstruction result aiming at the human body.

In one example, the convolution unit may be a chebyshev convolution unit that constructs a chebyshev convolution algorithm, in particular, using a chebyshev polynomial. Through the Chebyshev convolution unit, the processing speed of the graph convolution neural network can be increased to a certain extent, and the efficiency of three-dimensional reconstruction is improved.

Wherein the chebyshev polynomial is represented as follows:

T₀(x)＝1；T₁(x)＝x；T_n+1(x)＝2xT_n(x)-T_n-1(x)

based on the chebyshev polynomial, a chebyshev convolution algorithm can be obtained, which is expressed as follows:

wherein, F_in∈R^N×finRepresenting the input features.

F_out∈R^N×foutRepresenting the output characteristics.

K denotes the use of a chebyshev polynomial of order K, and in the embodiment of the present application, each chebyshev convolution unit of the graph convolution neural network takes K3.

θ_k∈R^fin×foutAnd expressing the characteristic change matrix, wherein the parameters in the characteristic change matrix are values required to be learned by the graph convolution neural network.

The scaled laplacian matrix representing the preset template. When the object to be reconstructed is a human body and the adopted preset template is a downsampling result of the standard template defined by the SMPL model, N is the number 1723 of vertexes after downsampling. The scaled laplacian matrix is specifically:

wherein I is an identity matrix and (D)_h)_ij＝∑_j(A_h)_ijIs a diagonal matrix, λ_maxIs L_pThe maximum eigenvalue of the matrix.

For ease of understanding, the chebyshev convolution algorithm given above is expanded to take K — 3 as an example, and is expressed as follows:

W＝[θ₀,θ₁,θ₂]^T∈R^3fin×fout

wherein, L is an intermediate parameter and has no actual physical meaning. W is the parameter needed to be learned by the graph convolution neural network.

In some embodiments, where the object to be reconstructed is a human body, the atlas neural network may be trained using both the human3.6m and MSCOCO datasets. Specifically, since the two data sets do not store the real human body mesh in each training sample, but only store the position information of the real human body three-dimensional joint, the real human body mesh with high precision needs to be obtained in advance based on the position information of the real human body three-dimensional joint of each training sample, and the real human body mesh can be used as a strong label and put into the training process of the graph convolution neural network for use. That is, the real body mesh referred to herein is actually a high-precision result fitted based on the real body three-dimensional joints. It can be understood that the training process of the convolutional neural network is basically the same as that of a general neural network, and only a new loss function is adopted, so that the three-dimensional reconstruction result output by the trained convolutional neural network model is smoother and more complete, and the practicability is higher. The loss function is:

loss＝λ_aL_v+λ_bL_j+λ_cL_n+λ_dL_e

wherein λ is_a、λ_b、λ_cAnd lambda_dAre all hyper-parameters.

Wherein L is_vA mesh loss is represented, which is used to describe a difference in position between the real human mesh and the predicted human mesh. The positions of the vertices of the real human body mesh are represented by M, the positions of the vertices of the predicted human body mesh are represented by M, and L is lost by using L1 loss_vIs represented as follows:

L_v＝||M-M*||₁

wherein L is_jRepresenting three-dimensional joint loss, used for describing real three-dimensional joints of human body and predicting three-dimensional joints of human bodyPositional differences between the joints. With J^3D*Expressing the position of the three-dimensional joint of a real human body, JM expressing the position of the joint of the predicted human body, J e R^v ^×NUsing the L1 loss, which represents a matrix that extracts joints from the body mesh, M represents the position of each vertex of the predicted body mesh, the three-dimensional joint loss is expressed as follows:

L_j＝||JM-J^3D*||₁

wherein L is_nRepresenting surface normal losses for describing the angular difference between the normal vectors of the triangular faces of the real human mesh and the normal vectors of the triangular faces of the predicted human mesh. The triangular surface of the predicted human body mesh is represented by f, n_fA unit normal vector representing a triangular face corresponding to f in a real human body mesh, m_iAnd m_jRepresenting two vertex coordinates in f, respectively, the surface normal loss L_nIs represented as follows:

wherein L is_eRepresenting surface edge loss describing a length difference between side lengths of triangular faces of the real human body mesh and side lengths of triangular faces of the predicted human body mesh. The triangular surface of the predicted human body mesh is represented by f, m_iAnd m_jRespectively representing two vertex coordinates in f, m^* _iAnd

respectively representing m and m in a real human body grid_iAnd m_jCorresponding vertex coordinates, the surface edge loss L_eIs represented as follows:

for easy understanding, please refer to fig. 7, where fig. 7 takes a portion to be reconstructed as a human body as an example, and an example of a working framework of the three-dimensional reconstruction method in the embodiment of the present application is given. The working frame consists of two parts, namely an encoder based on a convolution neural network and a human body three-dimensional vertex regressor based on a graph convolution neural network. The method comprises the steps of obtaining an original image of a human body after shooting a certain person, using the original image as an initial input, obtaining a local image after preprocessing, coding the local image into a group of feature vectors through a convolutional neural network-based coder, fusing and splicing the group of feature vectors and grid vertex position information in a preset human body grid map to form a feature map as an input of a graph convolution neural network, and finally enabling the graph convolution neural network to regress a group of new grid vertex position information to be in accordance with two-dimensional observation of the original image on the human body so as to complete a three-dimensional reconstruction task of the human body.

As can be seen from the above, according to the embodiment of the present application, feature extraction is performed on an image of an object to be reconstructed to obtain a feature vector for representing shape feature information of the object to be reconstructed, then the feature vector is combined with a preset template for the object to be reconstructed to generate a feature map, and finally the feature map is input into a trained map convolution neural network, so as to obtain a three-dimensional reconstruction result of the object to be reconstructed. In the process, the feature vector is combined with the preset template, so that the finally generated feature map further obtains the three-dimensional structure information of the object to be reconstructed, which is displayed by the preset template, on the basis that the feature map contains the shape feature of the object to be reconstructed, therefore, the trained map convolution neural network can better process the feature map, the obtained three-dimensional reconstruction result is more accurate, and the stability of three-dimensional reconstruction is guaranteed.

Corresponding to the three-dimensional reconstruction method provided above, the embodiment of the present application further provides a three-dimensional reconstruction device. As shown in fig. 8, the three-dimensional reconstruction apparatus 800 includes:

an extraction module 801, configured to perform feature extraction on an image of an object to be reconstructed to obtain a feature vector, where the feature vector is used to represent shape feature information of the object to be reconstructed;

a generating module 802, configured to generate a feature map according to the feature vector and a preset template for the object to be reconstructed, where the preset template is used to represent three-dimensional structure information of the object to be reconstructed;

and the reconstruction module 803 is configured to input the feature map into a trained graph convolution neural network, so as to obtain a three-dimensional reconstruction result of the object to be reconstructed.

Optionally, the graph convolutional neural network includes N functional modules connected in series;

wherein, the input of the 1 st above-mentioned function module is the input of the above-mentioned graph convolution neural network, the output of the above-mentioned nth above-mentioned function module is the output of the above-mentioned graph convolution neural network, above-mentioned N is an integer greater than 2;

the function module comprises a convolution unit, a normalization unit and an activation function unit.

Optionally, the 1 st functional module includes at least three specific structures, the at least three specific structures are sequentially connected in series, and the specific structure includes the convolution unit, the normalization unit, and the activation function unit, which are sequentially connected in series;

in the above at least three specified structures: the input of the first specified structure is the input of the 1 st functional module, and the residual error between the output of the last specified structure and the input of the 1 st functional module is the output of the 1 st functional module.

Optionally, the ith function module includes at least two designated structures, the at least two designated structures are sequentially connected in series, the designated structure includes the convolution unit, the normalization unit and the activation function unit, which are sequentially connected in series, and i is an integer greater than 1 and less than N;

in the above at least two specified structures: the input of the first specified structure is the output of the i-1 th functional module, and the residual error between the output of the last specified structure and the output of the i-1 th functional module is the output of the i-th functional module.

Optionally, the nth function module includes the convolution unit, the normalization unit, the activation function unit, and the convolution unit connected in series in sequence;

in the nth above functional module: the input of the first convolution unit is the output of the (N-1) th functional module, and the residual error between the output of the second convolution unit and the output of the (N-1) th functional module is the output of the (N) th functional module.

Optionally, the convolution unit is a chebyshev convolution unit.

Optionally, when the object to be reconstructed is a human body, the preset template is a human body grid map; the generating module 802 includes:

a construction unit, configured to construct a graph structure in a preset format based on the human body mesh map, where the graph structure includes vertex information of the human body mesh map;

and the splicing unit is used for fusing and splicing the feature vector and the graph structure to obtain the feature graph.

Optionally, the three-dimensional reconstruction apparatus 800 further includes:

the calculation module is used for calculating the total loss of the graph convolution neural network based on grid loss, three-dimensional joint loss, surface normal loss and surface edge loss in the training process of the graph convolution neural network;

wherein, the grid loss is used for describing the position difference between the real human body grid and the predicted human body grid;

the three-dimensional joint loss is used for describing the position difference between the real three-dimensional joints of the human body and the predicted three-dimensional joints of the human body;

the surface normal loss is used for describing an angle difference between a normal vector of a triangular surface of a real human body mesh and a normal vector of a triangular surface of a predicted human body mesh;

the above surface edge loss is used to describe the length difference between the side lengths of the triangular faces of the real human body mesh and the side lengths of the triangular faces of the predicted human body mesh.

Optionally, the extracting module 801 includes:

the segmentation unit is used for segmenting the image based on the object to be reconstructed to obtain a local image;

an adjusting unit, configured to adjust the size of the local image to a preset size;

and the extracting unit is used for extracting the features of the local image after the size adjustment through an encoder adopting a convolutional neural network to obtain the feature vector.

Corresponding to the three-dimensional reconstruction method provided above, an embodiment of the present application further provides an electronic device. Referring to fig. 9, an electronic device 9 in the embodiment of the present application includes: a memory 901, one or more processors 902 (only one shown in fig. 9), and a computer program stored on the memory 901 and executable on the processors. Wherein: the memory 901 is used for storing software programs and units, and the processor 902 executes various functional applications and diagnoses by running the software programs and units stored in the memory 901, so as to acquire resources corresponding to the preset events. Specifically, the processor 902 realizes the following steps by executing the above-mentioned computer program stored in the memory 901:

and inputting the characteristic diagram into a trained diagram convolution neural network to obtain a three-dimensional reconstruction result of the object to be reconstructed.

Assuming that the above is the first possible implementation manner, in a second possible implementation manner provided on the basis of the first possible implementation manner, the graph convolution neural network includes N functional modules connected in series;

In a third possible implementation form based on the second possible implementation form, the 1 st functional module includes at least three designated structures, the at least three designated structures are sequentially connected in series, and the designated structure includes the convolution unit, the normalization unit and the activation function unit which are sequentially connected in series;

In a fourth possible implementation manner provided on the basis of the second possible implementation manner, an ith function module includes at least two designated structures, the at least two designated structures are sequentially connected in series, the designated structure includes the convolution unit, the normalization unit and the activation function unit, which are sequentially connected in series, and i is an integer greater than 1 and less than N;

In a fifth possible embodiment based on the second possible embodiment, the nth function block includes the convolution unit, the normalization unit, the activation function unit, and the convolution unit connected in series in this order;

In a sixth possible implementation form, which is based on the second possible implementation form, or the third possible implementation form, or the fourth possible implementation form, or the fifth possible implementation form, the convolution unit is a chebyshev convolution unit.

In a seventh possible implementation manner provided on the basis of the first possible implementation manner, when the object to be reconstructed is a human body, the preset template is a human body grid map; generating a feature map according to the feature vector and a preset template for the object to be reconstructed, including:

constructing a graph structure with a preset format based on the human body grid graph, wherein the graph structure comprises vertex information of the human body grid graph;

and performing fusion splicing on the feature vector and the graph structure to obtain the feature graph.

In an eighth possible implementation provided on the basis of the seventh possible implementation, the processor 902 further implements the following steps when executing the computer program stored in the memory 901:

calculating the total loss of the graph convolution neural network based on grid loss, three-dimensional joint loss, surface normal loss and surface edge loss in the training process of the graph convolution neural network;

In a ninth possible implementation manner provided based on the first possible implementation manner, the extracting features of the image of the object to be reconstructed to obtain a feature vector includes:

segmenting the image based on the object to be reconstructed to obtain a local image;

adjusting the size of the local image to a preset size;

and performing feature extraction on the local image after the size adjustment by adopting an encoder of a convolutional neural network to obtain the feature vector.

It should be understood that in the embodiments of the present Application, the Processor 902 may be a Central Processing Unit (CPU), and the Processor may be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 901 may include both read-only memory and random access memory, and provides instructions and data to processor 902. Some or all of memory 901 may also include non-volatile random access memory. For example, the memory 901 may also store device class information.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of external device software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules or units is only one logical functional division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer-readable storage medium may include: any entity or device capable of carrying the above-described computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer readable Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the computer readable storage medium may contain other contents which can be appropriately increased or decreased according to the requirements of the legislation and the patent practice in the jurisdiction, for example, in some jurisdictions, the computer readable storage medium does not include an electrical carrier signal and a telecommunication signal according to the legislation and the patent practice.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of three-dimensional reconstruction, comprising:

generating a feature map according to the feature vector and a preset template aiming at the object to be reconstructed, wherein the preset template is used for representing three-dimensional structure information of the object to be reconstructed;

2. The three-dimensional reconstruction method of claim 1 wherein said graph convolutional neural network comprises N functional blocks in series;

wherein, the input of the 1 st functional module is the input of the graph convolution neural network, the output of the Nth functional module is the output of the graph convolution neural network, and N is an integer greater than 2;

3. The three-dimensional reconstruction method according to claim 2, wherein the 1 st functional module comprises at least three designated structures, the at least three designated structures are sequentially connected in series, and the designated structures comprise the convolution unit, the normalization unit and the activation function unit which are sequentially connected in series;

in the at least three specified structures: the input of the first specified structure is the input of the 1 st functional module, and the residual error between the output of the last specified structure and the input of the 1 st functional module is the output of the 1 st functional module.

4. The three-dimensional reconstruction method according to claim 2, wherein the ith functional module comprises at least two specified structures, the at least two specified structures are sequentially connected in series, the specified structures comprise the convolution unit, the normalization unit and the activation function unit which are sequentially connected in series, and i is an integer greater than 1 and less than N;

in the at least two specified structures: the input of the first specified structure is the output of the (i-1) th functional module, and the residual error between the output of the last specified structure and the output of the (i-1) th functional module is the output of the (i) th functional module.

5. The three-dimensional reconstruction method of claim 2, wherein the nth of said functional blocks comprises, in series, said convolution unit, said normalization unit, said activation function unit, and said convolution unit;

in the nth of the functional modules: the input of the first convolution unit is the output of the (N-1) th functional module, and the residual error between the output of the second convolution unit and the output of the (N-1) th functional module is the output of the (N) th functional module.

6. A method of three-dimensional reconstruction as claimed in any one of claims 2 to 5 wherein said convolution unit is a Chebyshev convolution unit.

7. The three-dimensional reconstruction method according to claim 1, wherein when the object to be reconstructed is a human body, the preset template is a human body grid map; generating a feature map according to the feature vector and a preset template aiming at the object to be reconstructed, wherein the generating of the feature map comprises the following steps:

and carrying out fusion splicing on the feature vector and the graph structure to obtain the feature graph.

8. The three-dimensional reconstruction method of claim 7, wherein during the training of the graph convolution neural network, a total loss of the graph convolution neural network is calculated based on a mesh loss, a three-dimensional joint loss, a surface normal loss, and a surface edge loss;

wherein the grid loss is used for describing a position difference between a real human body grid and a predicted human body grid;

the three-dimensional joint loss is used for describing the position difference between the real three-dimensional joint of the human body and the predicted three-dimensional joint of the human body;

the surface edge loss is used to describe a length difference between a side length of a triangular face of the real human body mesh and a side length of a triangular face of the predicted human body mesh.

9. The three-dimensional reconstruction method of claim 1, wherein said extracting features from the image of the object to be reconstructed to obtain a feature vector comprises:

adjusting the size of the local image to a preset size;

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 9 when executing the computer program.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 9.