CN111243096A

CN111243096A - Three-dimensional face representation and reconstruction method based on edge-constrained spectrum convolution neural network

Info

Publication number: CN111243096A
Application number: CN202010039201.8A
Authority: CN
Inventors: 李坤; 刘幸子; 袁存款
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2020-06-05

Abstract

The invention belongs to the field of computer vision and computer graphics, and aims to realize the capability of effectively expressing and reconstructing high-dimensional features of three-dimensional face data by using an automatic encoder to simultaneously combine grid convolution with efficient network up-sampling and network down-sampling. The method is mainly applied to the occasions of three-dimensional face representation and reconstruction.

Description

Three-dimensional face representation and reconstruction method based on edge-constrained spectrum convolution neural network

Technical Field

The invention belongs to the field of computer vision and computer graphics, and particularly relates to a method for representing a 3D face and reconstructing the face by using a deep learning method.

Background

Three-dimensional reconstruction of human faces has been one of the classic tasks in the field of computer vision research. The application of the three-dimensional face model can be seen in a plurality of different fields. Common application fields are cultural relics protection, construction, medicine and the like. The difficulty of three-dimensional reconstruction of human faces lies in the following aspects. First, one can randomly make a number of exaggerated expressions that can cause significant facial distortion; secondly, different faces usually have a large difference in shape representation due to the influence of various factors such as age, gender, race, and the like. It is difficult to efficiently represent and reconstruct such non-linear deformations.

Most of the conventional three-dimensional face reconstruction techniques use linear transformations (Tewari, a.,

m., Kim, h., Garrido, p., Bernard, f., Perez, p., c, t.: MoFA: Model-based depth dependent facial auto encoder for underlying monomeric recording-in: international Conference Computer Vision (2017)) or a high-order tensor generalization (Brunton, a., Bolkart, t., Wuhrer, s.: Multilinear roads: a static spectrum for human interface in: European Conference Computer vision.297312 (2014)), these models are linear in character and they are difficult to capture the nonlinear deformations caused by some extreme expressions, so that the result of reconstruction is often lacking in reality. Take the current latest CoMA (Ranjan A, Bolkart T, Sanyal S, et al. Generation 3D facing volumetric Mesh Autoencoders [ J]2018.) compared with the traditional method, although a spectrum convolution grid self-encoder consisting of a network up-sampling layer and a network down-sampling layer is introduced, the operation speed is improved to a certain extent, and the accuracy of the reconstruction result is greatly improved. In summary, it is important to efficiently and accurately represent and reconstruct a three-dimensional face, and this technology can provide technical support for directions such as face recognition and identity verification.

Disclosure of Invention

To enable efficient high-dimensional feature representation of three-dimensional face data and the ability to efficiently reconstruct from the high-dimensional features,

the invention adopts the technical scheme that an Edge-constrained spectrum convolution neural network is designed to learn three-dimensional face representation and reconstruction, a self-encoder hidden space is expanded to 128 dimensions, and a three-dimensional face reconstruction method B-Edge based on the Edge-constrained spectrum convolution neural network is provided. An automatic encoder added with edge constraint and Batch Normalization (Ioffe S, Szegedy C. Batchnorrmalization: acquiring deep network training by reducing intercalant shift [ C ]// International Conference on machine learning JMLR. org. 2015.) can combine grid convolution with efficient network up-sampling and network down-sampling simultaneously, and the invention is realized by the following technical scheme:

a three-dimensional face representation and reconstruction method based on an edge-constrained spectrum convolution neural network comprises the steps of inputting a Mesh of a face grid into a variational encoder based on a graph convolution structure, wherein the variational encoder comprises an encoder and a decoder, obtaining high-dimensional features coded by the Mesh through the encoder, and analyzing the high-dimensional features into original Mesh by using the decoder, so that three-dimensional face representation and reconstruction are realized.

Firstly, sampling Mesh by using a down-sampling algorithm and an up-sampling algorithm, processing data obtained by sampling by an encoder, wherein the encoder consists of 4 Chebyshev convolution filters and a Chebyshev polynomial, a bias linear rectification function ReLU is arranged behind each convolution filter, and the convolution filters adopt a spectrum convolution algorithm to perform normal distribution variational processing on the data processed by the encoder to generate new face model data; next, processing the obtained new face model data by a decoder, wherein the decoder consists of full connection layers, the layers are mapped to R ^128 hidden space from R ^8 through vector transformation, and meanwhile, a grid is reconstructed through upsampling, and an offset ReLU is added after each convolution; and finally, applying edge constraint and Mean Square Error (MSE) constraint to the output result, wherein the decoder firstly uses a full connection layer to map 128-dimensional features to a Mesh space, then uses 4 layers of graph convolution, each layer of graph convolution uses batch normalization and ReLU activation functions, and the last layer does not use batch normalization and activation functions.

Processing the grid data using a dynamic filtering convolution, defining a spectral convolution as

And (3) formula analysis: input of

Having a structure of F_inCharacterized in that the input surface mesh has F corresponding to its 3D vertex position_inBy y, 3 characteristics_jComputing

J of (a)^thFeatures, each layer of convolution having an F corresponding to its 3D fixed point position_in-3 features, g denotes a filter, L is laplacian, and θ_i，j∈R^KIs a trainable parameter, each layer convolving F with Chebyshev coefficients_in×F_outAnd (5) vector quantity.

In the Mesh downsampling algorithm and the Mesh upsampling algorithm, a grid hierarchical multi-scale representation method is used, the method allows a convolution kernel to capture local context in a shallow layer and global context in a deeper layer of a network, and a grid sampling operator is introduced, and the grid sampling operator is used for defining downsampling and upsampling of grid features in a neural network.

Features of a mesh with n vertices are represented using an nxf tensor, where F is the dimension of each vertex, applying convolution to the mesh produces features with different dimensions, a mesh sampling operation defines a new topology at each layer, and maintains context at the neighborhood vertices, first using a transformation matrix Q_d∈{0，1}^n×mIntra-network downsampling a mesh having m vertices and using Q_u∈R^m×nGo on the upsampling, m>n, obtaining downsampling by iteratively shrinking vertex pairs while maintaining surface approximation error using a quadratic matrix for arbitrary p, if and only if Q_dWhen (p, Q) ═ 1, these vertices are retained during upsampling, when Q is_d(p, q) ═ 0, these vertices are discarded during downsampling and mapped to the downsampled mesh using barycentric coordinatesLattice surfaces, the down-sampling algorithm uses a quadratic matrix to iteratively shrink vertex pairs, the up-sampling is the inverse of the down-sampling, and the up-sampling is the process of discarding vertices v during the down-sampling_qRe-adding to the down-sampled grid by

Represents that V is_PThe most approximate triangle (i, j, k) when projected onto the downsampled mesh, and

calculating barycentric coordinates (v)_j，v_j，v_k∈V_d，w_i+w_j+w_k1) then at Q_u(q，i)＝w_iAnd Q_u(q，j)＝w_jThe weights are updated.

Compared with the prior art, the invention has the technical characteristics and effects that:

firstly, on the basis of three-dimensional Mesh, compared with the traditional reconstruction method, the method provided by the invention has the following characteristics:

1. the model can accurately represent a three-dimensional face in a low-dimensional hidden space, and has better performance than a 3DMM model commonly used at present and a latest CoMA model (Ranjan A, Bolkart T, Sanyal S, et al.

2. The spectral convolution automatic encoder uses few parameters, the hidden space of the self encoder is 128 dimensions, and the reconstruction precision is high.

3. The method adds Batch Normalization [23] and edge constraint to the network, improves reconstruction accuracy and reduces errors.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic diagram of a network structure model according to an embodiment of the present invention.

FIG. 2 is a graph showing the comparison of the reconstruction results obtained from the CoMA data set (Anurag Ranjan, Timo Bolkart, Souhik Sanyal, and Michael J Black, "Generating 3D faces using the volumetric method processors," in ECCV. Springer,2018, pp.725-741.) with other methods. The results of reconstruction by Anurag et al (Anurag Ranjan, Timobolkart, Soubhik Sanyal, and Michael J Black, "Generating 3D faces using volumetric Mesh Autoencoders," in ECCV. Springer,2018, pp.725-741.) are sequentially from top to bottom, the results of the estimated reconstruction, the error visualization of the method by Anurag et al, and the error visualization of the invention.

Fig. 3 is a schematic diagram of variation generation effect of the present invention, in which human faces are all Mesh generated randomly by a network.

Detailed Description

In order to realize the capability of effectively expressing high-dimensional features of 3D face data and effectively reconstructing the high-dimensional features, the invention adopts the technical scheme that a variational encoder neural network is designed to learn the high-dimensional expression and reconstruction of the 3D face. The generation capability of the variational encoder is utilized to generate more types of 3D face data. Specifically, our method essentially comprises the steps of:

1) design of a variational encoder. The variational encoder mainly comprises an encoder and a decoder, a variational encoder frame based on a graph convolution structure is designed, high-dimensional features coded by a Mesh can be obtained through the encoder (a first component of the variational encoder) as long as the Mesh is input, and then the high-dimensional features can be analyzed into the original Mesh by using the decoder (a second component of the variational encoder).

The deep convolution network comprises three parts, namely a spectrum convolution algorithm, a Mesh down-sampling algorithm and an up-sampling algorithm, and a network structure, and specifically comprises the following parts:

1-1) spectral convolution algorithm. Conventional convolutional neural networks cannot handle irregular graph data like Mesh, so we use dynamic filtering convolution to process the Mesh data. It can learn the mapping from neighborhood to filter weight, it takes the inherent characteristics of the grid into account in the algorithm, and defines the spectrum convolution as

And (3) formula analysis: input of

Having a structure of F_inCharacterized in that the input surface mesh has F corresponding to its 3D vertex position_in3 features. By y_jComputing

J of (a)^thFeatures, each layer of convolution having an F corresponding to its 3D fixed point position_inWith 3 features, g denotes the filter and L is the laplacian operator. And, theta_i，j∈R^KIs a trainable parameter, each layer convolving F with Chebyshev coefficients_in×F_outAnd (5) vector quantity.

1-2) a Mesh down-sampling algorithm and an up-sampling algorithm. The stable sampling algorithm can improve the feature extraction capability of the network, thereby effectively improving the performance of the whole network. Unlike conventional image sampling algorithms, in order to capture global and local contexts, a hierarchical multi-scale representation of the grid is used herein, which allows the convolution kernel to capture local contexts in the shallow layers and global contexts in deeper layers of the network. To solve this representation problem, we introduce a grid sampling operator. The grid sampling operator is used for defining the down sampling and the up sampling of grid features in the neural network. A mesh feature with n vertices can be represented using an n x F tensor, where F is the dimension of each vertex, e.g., a 3D mesh is represented by F-3, so that applying a convolution to the mesh can produce features with different dimensions. The mesh sampling operation defines a new topology at each layer and maintains context on the neighborhood vertices. In our network, first a transformation matrix Q is used_d∈{0，1}^n×mIntra-network downsampling a mesh having m vertices and using Q_u∈R^m×nPerform upsampling (where m>n) to obtain downsampling by iteratively shrinking vertex pairs while maintaining surface approximation error using a quadratic matrix for arbitrary p if and only if Q_dWhen (p, Q) ═ 1, we retain these vertices during upsampling, when Q is_dWhen (p, q) ═ 0, we discard these vertices during downsampling and map to the downsampled mesh surface using barycentric coordinates. The downsampling algorithm iteratively shrinks vertex pairs using a quadratic matrix. The upsampling is the inverse of the downsampling, and the upsampling is the process of discarding the vertex v during the downsampling_qRe-adding to the down-sampled grid by

1-3) network architecture. Firstly, a Mesh is sampled by using a down-sampling algorithm and an up-sampling algorithm, data obtained by sampling are processed by an encoder, the encoder consists of 4 Chebyshev convolution filters and a Chebyshev polynomial, an offset ReLU is arranged behind each convolution filter, the convolution filters adopt a spectrum convolution algorithm, and simultaneously, the number of grid vertexes is reduced by about 4 times by each down-sampling layer. And performing normal distribution variation processing on the data processed by the encoder to generate new face model data. Next, the data is processed by a decoder consisting of fully connected layers that are mapped from R8 to R128 hidden space by vector transformation while reconstructing the mesh by upsampling. Each convolution is followed by an offset ReLU similar to the encoder network, while each upsampled layer increases the number of mesh vertices by a factor of approximately 4. Finally, an edge constraint and an MSE (mean square error) constraint are added to the output result. The encoder inputs features in dimensions 1256 × 16,314 × 32,79 × 64 and 20 × 128 for each layer, and the last layer maps features to a 128-dimensional latent layer space. The decoder firstly uses a full-connection layer to map the 128-dimensional features to the Mesh space, and then 4 layers of graph convolution are carried out, wherein each layer of graph convolution uses batch normalization and a ReLU activation function, and the whole decoder part is equivalent to the inverse process of the encoder. The output characteristic dimensions of each layer are 79 × 128,314 × 64,1256 × 32 and 5023 × 16. Note that the last layer of the network does not use batch normalization and activation functions. The 128-dimensional features generated by the encoder are subjected to variation processing with the data of the Gaussian distribution, so that the data generated by the encoder is close to the Gaussian distribution space as much as possible.

2) A loss function. To avoid extreme cases such as model local collapse, local overlap and excessive distortion, we normalize the shape of the mesh by applying edge constraints, minimizing the mean square error MSE, we minimize the sum of the squared differences of the vertices below, hence

And (3) formula analysis: by V_i,V_jRespectively representing any two vertices on the mesh, and using e_ijV on presentation of training results_i-V_j. By v_i′,v_j'represents the vertex corresponding to the group-truth (true value), respectively, and is represented by e'_i,jDenotes v on group-truth_i′-v_j′。

3) And (5) training. Although we can sample a three-dimensional face from a convolutional mesh auto-encoder, the distribution of the hidden space is unknown, and therefore the sampling requires encoding the mesh in that space. In order to effectively sample from Gaussian distribution and generate random face data, variation processing is carried out on output data of an encoder, and obtained results are used as input data of a decoder. The variation experiment mainly adds kl (relative entropy) divergence loss in a loss function, so that the probability distribution of 128 numbers output by an encoder is close to Gaussian distribution as much as possible, and the encoder can realize the aboveThe training completed network is extracted independently, random faces are generated directly from Gaussian distribution sampling, self-growing weight design is used when a variational model is trained, and the weight of a kl variational loss function is continuously increased along with the increase of training times of the network, so that the optimal training effect can be achieved. It should be noted that the output of the encoder is two sets of 128-dimensional numbers, denoted mean and std, respectively, which we use to build the kl divergence loss and train the entire network. Here, the encoder is denoted by E and the decoder by D. We first train the dataset in the hidden space to obtain the features z ═ e (f), and then turn each component of the hidden vector into a hidden vector

And transforming the latent vector into a reconstructed mesh using a decoder

We set the spatial dimension of the potential layer to be 128, the number of the vertex points of the training Mesh to be 5023, the vertex set of the 1 field is used, we perform 200 times of iterative training on the whole data set in the network, and Adam (an optimization algorithm) is used by an optimizer. The network was trained using the TensorFlow (an open source software library that uses dataflow graphs for numerical calculations) deep learning framework, running on the Invitta GTX1080Ti video card.

Claims

1. A three-dimensional face representation and reconstruction method based on an edge-constrained spectrum convolution neural network is characterized in that a Mesh of a face grid is input into a variational encoder based on a graph convolution structure, the variational encoder comprises an encoder and a decoder, high-dimensional features coded by the Mesh are obtained through the encoder, and then the high-dimensional features are analyzed into original Mesh by the decoder, so that three-dimensional face representation and reconstruction are realized.

2. The three-dimensional human face representation and reconstruction method based on the edge-constrained spectral convolution neural network as claimed in claim 1, wherein a down-sampling algorithm and an up-sampling algorithm are firstly utilized to sample Mesh, data obtained by sampling are processed by an encoder, the encoder is composed of 4 Chebyshev convolution filters and Chebyshev polynomials, an offset linear rectification function ReLU is arranged behind each convolution filter, the convolution filters adopt the spectral convolution algorithm, and normal distribution variational processing is carried out on the data processed by the encoder to generate new human face model data; next, processing the obtained new face model data by a decoder, wherein the decoder consists of full connection layers, the layers are mapped to R ^128 hidden space from R ^8 through vector transformation, and meanwhile, a grid is reconstructed through upsampling, and an offset ReLU is added after each convolution; and finally, applying edge constraint and Mean Square Error (MSE) constraint to the output result, wherein the decoder firstly uses a full connection layer to map 128-dimensional features to a Mesh space, then uses 4 layers of graph convolution, each layer of graph convolution uses batch normalization and ReLU activation functions, and the last layer does not use batch normalization and activation functions.

3. The method for three-dimensional face representation and reconstruction based on edge-constrained spectral convolutional neural network as claimed in claim 1 or 2, wherein dynamic filtering convolution is used to process the grid data and the spectral convolution is defined as

And (3) formula analysis: input of

4. The method as claimed in claim 1, wherein a hierarchical multi-scale representation of the Mesh is used in the Mesh downsampling algorithm and the upsampling algorithm, which allows the convolution kernel to capture the local context in the shallow layer and the global context in the deeper layer of the Mesh, and a Mesh sampling operator is introduced, which is used to define the downsampling and upsampling of the Mesh features in the neural network.

5. The method for representing and reconstructing a three-dimensional face based on an edge-constrained spectral convolutional neural network as claimed in claim 1 or 4, wherein the features of a mesh with n vertices are represented using n x F tensor, where F is the dimension of each vertex, applying convolution to the mesh produces features with different dimensions, the sampling operation of the mesh defines a new topology at each layer, and maintains context at the vertices of the neighborhood, first using the transformation matrix Q_d∈{0，1}^n×mIntra-network downsampling a mesh having m vertices and using Q_u∈R^m×nGo on the upsampling, m>n, obtaining downsampling by iteratively shrinking vertex pairs while maintaining surface approximation error using a quadratic matrix for arbitrary p, if and only if Q_dWhen (p, Q) ═ 1, these vertices are retained during upsampling, when Q is_d(p, q) ═ 0, the vertices are discarded during downsampling and mapped to the downsampled mesh surface using barycentric coordinates, the downsampling algorithm iteratively shrinks pairs of vertices using a quadratic matrix, the upsampling is the inverse of the downsampling, and the upsampling is the process of discarding the vertices v in the downsampling process_qRe-adding to the down-sampled grid by