CN110288697A

CN110288697A - 3D face representation and method for reconstructing based on multiple dimensioned figure convolutional neural networks

Info

Publication number: CN110288697A
Application number: CN201910551003.7A
Authority: CN
Inventors: 李坤; 袁存款; 杨敬钰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-09-27

Abstract

The invention belongs to computer visions and field of Computer Graphics, effective high dimensional feature expression is carried out to 3D human face data to realize, and has the ability effectively rebuild from high dimensional feature, the technical solution adopted by the present invention is that, 3D face representation and method for reconstructing based on multiple dimensioned figure convolutional neural networks, it constructs the variation based on depth convolution and generates network, indicate and rebuild using the higher-dimension of variation encoder neural network learning 3D face；Further types of 3D human face data is generated using the generative capacity of variation encoder；Then high dimensional feature is parsed into facial triagnle network model Mesh using variation decoder, realizes 3D face reconstruction.Present invention is mainly applied to the occasions such as 3D human face rebuilding, identification.

Description

3D face representation and method for reconstructing based on multiple dimensioned figure convolutional neural networks

Technical field

The invention belongs to computer visions and field of Computer Graphics, in particular to remove table using the method for deep learning Show 3D face and rebuilds.

Background technique

Face plays key effect in identification, information transmitting and emotional expression.It is specific face it is effective expression with It rebuilds for creating individual face figure image, 3D printing and FA Facial Animation are extremely important, in film, computer game, enhancing It is had a wide range of applications in real (AR) and virtual reality (VR).However, due to by factors such as age, sex, races Influence, face shape change very greatly, and expression deformation it is significant.Accordingly, it is difficult to effectively indicate and rebuild and is this non-linear Deformation.

Conventional method rebuilds 3D face using the method based on fusion using laser scanner or depth camera (R.A.Newcombe et al.,“KinectFusion:Real-time dense surface mapping and Tracking, " in Proc.IEEE International Symposium on Mixed and Augmented Reality, 2011, pp.127-136.), but they cannot achieve animation, editor and generation.In order to solve this problem, many Work proposes parametrization faceform (Volker Blanz and Thomas Vetter, " A morphable model For the synthesis of 3D faces, " in CGIT, 1999, pp.187-194.) and mixing shape (John P Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Frederic H Pighin, and Zhigang Deng,“Practice and theory of blendshape facial models,”in Eurographics(State Of the Art Reports), 2014.) indicate face shape and expression, and there is several methods that using these models from sweeping Face shape (Pei-Lun Hsieh, Chongyang Ma, Jihun Yu, and have successfully been rebuild in the deep grid retouched Hao Li, " Unconstrained realtime facial performance capture, " in CVPR, 2015, pp.1675–1683.).However, being generally in the shape of using the reconstruction of linear expression smooth without details abundant.To sum up, It is important for efficiently and accurately indicate and rebuild to 3D face, will be the directions such as face recognition, authentication Technical support is provided.

Summary of the invention

Be it is with overcoming the prior art insufficient, realize and effective high dimensional feature expression carried out to 3D human face data, and have The ability effectively rebuild from high dimensional feature, the technical solution adopted by the present invention is that, it is based on multiple dimensioned figure convolutional neural networks 3D face representation and method for reconstructing, construct variation based on depth convolution and generate network, utilize variation encoder neural network The higher-dimension for learning 3D face is indicated and is rebuild；Further types of 3D face number is generated using the generative capacity of variation encoder According to；Then high dimensional feature is parsed into facial triagnle network model Mesh using variation decoder, realizes 3D face reconstruction.

The variation of depth convolution generates network and uses picture scroll integration method and mesh sampling algorithm, every layer of packet in encoder Picture scroll product, batch standardization, ReLU activation primitive and mesh down-sampling are included, every layer includes mesh up-sampling, figure in decoder Convolution, batch standardization and ReLU activation primitive.

It is a depth convolutional network that variation, which generates network, includes three parts, picture scroll integration method, Mesh down-sampling algorithm And network structure, it is specific as follows:

1) picture scroll integration method

Grid data is handled using dynamic filter convolutional layer, learns the mapping from neighborhood to filter weight, and consider Grid inherent feature, specifically, network layer input is a feature vector x_iCorresponding to a vertex i ∈ { 1 ..., n }, output It is also vector y_i:

Formula Parsing: N_iIt is the neighbours vertex set i on vertex in Mesh, The positive edge weights in algorithm, can by m characteristic value normalization be 1, this meeting so that in feature space weight translation invariant Property, when using input feature vector of the luv space 3D coordinate as shaped grid, the feature of translation invariance has preferably instruction Practice effect, b, W_m, t_mAnd c_mAll be can training parameter, M is the hyper parameter manually set；

2-2) Mesh down-sampling algorithm

Using using permutation matrix P_d∈{0,1}^k×nQuick down-sampling is carried out, by a Mesh with n vertex It is down sampled to k vertex (n < k).P_dWhat (p, q) was indicated is whether q-th of vertex retains in down-sampling, just protects if being 1 It stays, just gives up for 0, down-sampling algorithm comes iterative shrinkage vertex pair using quadratic matrix；Up-sampling is exactly the inverse process of down-sampling, One Mesh with k vertex is upsampled to n vertex, n < k uses up-sampling permutation matrix P_u∈R^n×k, up-sampling Process be that the vertex v q that will be abandoned during down-sampling is re-added in down-sampling grid, i.e., by V_qIt is mapped to down-sampling Immediate triangle (h, i, j) in grid, and calculate barycentric coodinates and use v=w_hv_h+w_iv_i+w_jv_j, v_hv_iv_j∈V_dVertex Set, and w_h+w_i+w_j=1, P_uMiddle weight setting is P_u(q, h)=w_h, P_u(q, i)=w_i, P_u(q, j)=w_j；

2-3) network structure

Network is divided into encoder and decoder part, and encoder is made of 6 picture scrolls products, feature quantity be set as (16, 32,64,96,128,256), every layer is all that batch standardization and ReLU activation primitive, every layer of convolution has been used all to use down-sampling, Multiplying power is respectively [2,2,2,4,4,4], and the defeated characteristic dimension of every layer of encoder is 2512 × 16,1256 × 32,628 × 64,157 × 96,40 × 128, and 10 × 256, the last layer is by the latent sheaf space of Feature Mapping to 128 dimensions；Decoder uses first One full articulamentum is by the Feature Mapping of 128 dimensions to the space Mesh, and followed by 6 layers of picture scroll product, every layer of picture scroll product is all to use Batch standardization and ReLU activation primitive, up-sampling multiplying power is [4,4,4,2,2,2], and entire decoder section is equivalent to encoder Inverse process；Every layer of output characteristic dimension is 40 × 128,157 × 96,628 × 64,1256 × 32,2512 × 16, and 5023 × 3,128 dimensional characteristics that encoder generates can carry out the loss of Kullback-Leibler variation with the data of Gaussian Profile The calculating of function, so that the data that encoder generates approach Gaussian Profile space as far as possible.

Compared with the prior art, the technical features and effects of the present invention are:

Firstly, our invention is on the basis of three-dimensional Mesh, compared to traditional method for reconstructing, the method for the present invention is main It has the following characteristics that

1, we have proposed a kind of new picture scroll product variation encoder, it has the multiple dimensioned table of layering for face Mesh Show.Vertex connection relationship of our model dependent on the grid of convolution, and can also be by effectively to the vertex of grid It is sampled to generate layering grid representation.

2, our variation encoder is easy to trained using Mesh initial data without complicated data embedding procedure, And reconstruction precision is very high.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is that the embodiment of the present invention is network structure model signal.

Fig. 2 is that the embodiment of the present invention is based on Coma data set (Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J Black, " Generating 3D faces using convolutional mesh Autoencoders, " in ECCV.Springer, 2018, pp.725-741.) on the reconstructed results that obtain and other methods Contrast schematic diagram.It is successively the true value of the Mesh of input, Anurag et al. (Anurag Ranjan, Timo from top to bottom Bolkart, Soubhik Sanyal, and Michael J Black, " Generating 3D faces using Convolutional mesh autoencoders, " in ECCV.Springer, 2018, pp.725-741.) reconstruct As a result, the present invention it is estimated rebuild as a result, the error of the method for Anurag et al. visualizes figure, error visualization of the invention Figure.

Fig. 3 is to be proposed that variation of the invention generates effect diagram, and the face in diagram is all that network is given birth at random At Mesh.

Specific embodiment

In order to carry out effective high dimensional feature expression to 3D human face data and effectively rebuild from high dimensional feature Ability, the technical solution adopted by the present invention is that design one variation encoder neural network go study 3D face higher-dimension indicate And it rebuilds.Further types of 3D human face data is generated using the generative capacity of variation encoder.Specifically, our side Method mainly comprises the steps that

1) variation generates the design of network.The variation encoder frame based on figure convolutional coding structure is devised, as long as defeated As soon as entering a facial triagnle grid model Mesh, then the high dimensional feature after Mesh coding can be obtained by encoder uses High dimensional feature can be parsed into original facial Mesh by decoder.

2) whole network includes picture scroll integration method and mesh down-sampling algorithm.As shown in Figure 1, each G in encoder It include picture scroll product, batch standardization, ReLU activation primitive and mesh down-sampling, each G includes adopting on mesh in decoder Sample, picture scroll product, batch standardization and ReLU activation primitive.

It is a depth convolutional network that variation, which generates network, it mainly includes three parts, is adopted under picture scroll integration method, Mesh Sample algorithm and network structure specifically include following part:

2-1) picture scroll integration method.Traditional convolutional neural networks cannot handle this irregular diagram data of Mesh, therefore We handle grid data using dynamic filter convolutional layer.It can learn the mapping from neighborhood to filter weight, it will Grid inherent feature has taken into account in algorithm.Specifically, network layer input is a feature vector x_iCorresponding to a vertex i ∈ { 1 ..., n }, output are also vector y_i:

Formula Parsing: N_iIt is the neighbours vertex set i on vertex in Mesh, The positive edge weights in algorithm, can by m characteristic value normalization be 1, this meeting so that in feature space weight translation invariant Property.When using input feature vector of the luv space 3D coordinate as shaped grid, the feature of translation invariance has preferably instruction Practice effect.B, W_m, t_mAnd c_mAll be can training parameter, M is the hyper parameter manually set.

2-2) Mesh down-sampling algorithm.Traditional image sampling algorithm, stable sampling are gone in the sampling algorithm difference of Mesh Algorithm is able to ascend the ability in feature extraction of network, so that the performance of whole network is effectively promoted, in our network, I Using use permutation matrix P_d∈ { 0,1 }^k×nQuick down-sampling is carried out, by a Mesh down-sampling with n vertex To k vertex (n < k).P₊What (p, q) was indicated is whether q-th of vertex retains in down-sampling, just retains if being 1, is 0 Just give up.Down-sampling algorithm comes iterative shrinkage vertex pair using quadratic matrix.Up-sampling is exactly the inverse process of down-sampling, by one Mesh with k vertex is upsampled to n vertex (n < k), we use up-sampling permutation matrix P_u∈R^n×k, up-sampling Process is that the vertex v q that will be abandoned during down-sampling is re-added in down-sampling grid, i.e., by V₅It is mapped to down-sampling net Immediate triangle (h, i, j) in lattice simultaneously calculates barycentric coodinates and uses v=w_hv_h+w_iv_i+w_jv_j, v_hv_iv_j∈V_dVertex set It closes, and w_h+w_i+w_j=1.P_uMiddle weight setting is P_u(q, h)=w_h, P_u(q, i)=w_i, P_u(q, j)=w_j。

2-3) network structure.Network is divided into encoder and decoder part, and encoder is made of 6 picture scroll products, characteristic Amount is set as (16,32,64,96,128,256), and every layer is all to have used batch standardization and ReLU activation primitive.Every layer of convolution is all It is to have used down-sampling, multiplying power is respectively [2,2,2,4,4,4].The defeated characteristic dimension of every layer of encoder be 2512 × 16,1256 × 32,628 × 64,157 × 96,40 × 128, and 10 × 256, the last layer is by the latent sheaf space of Feature Mapping to 128 dimensions. Decoder uses a full articulamentum by the Feature Mapping of 128 dimensions to the space Mesh first, followed by 6 layers of picture scroll product, often Layer picture scroll product is all to have used batch standardization and ReLU activation primitive, and up-sampling multiplying power is [4,4,4,2,2,2], entire decoder portion Divide the inverse process for being equivalent to encoder.Every layer of output characteristic dimension is 40 × 128,157 × 96,628 × 64,1256 × 32, 2512 × 16, and 5023 × 3, it should be noted that the last layer of network is not using batch standardization and activation primitive.

4) training.We set hyper parameter M=16, and sheaf space dimension of diving is 128, and the number of vertex of training Mesh is 5023, The vertex set in 1 field, whole data set training 100 times are used, learning rate is set as 0.002, and every batch of fills 8 Mesh data, the Adam optimizer used.Training Variation Model when, we use self-propagation weight design, network with The continuous weight for increasing KL variation loss function of the increase of frequency of training, can reach optimal training effect with this.The net Network is trained using TensorFlow deep learning frame, is operated in tall and handsome up to GTX1080Ti video card.

Claims

1. a kind of 3D face representation and method for reconstructing based on multiple dimensioned figure convolutional neural networks, characterized in that building is based on deep The variation for spending convolution generates network, indicates and rebuilds using the higher-dimension of variation encoder neural network learning 3D face；It utilizes The generative capacity of variation encoder generates further types of 3D human face data；Then use variation decoder by high dimensional feature solution Facial triagnle network model Mesh is analysed into, realizes 3D face reconstruction.

2. 3D face representation and method for reconstructing as described in claim 1 based on multiple dimensioned figure convolutional neural networks, feature It is that the variation of depth convolution generates network and uses picture scroll integration method and mesh sampling algorithm, includes for every layer in encoder Picture scroll product, batch standardization, ReLU activation primitive and mesh down-sampling, in decoder every layer include mesh up-sampling, picture scroll product, Criticize standardization and ReLU activation primitive.

3. 3D face representation and method for reconstructing as described in claim 1 based on multiple dimensioned figure convolutional neural networks, feature Be, variation generate network be a depth convolutional network, include picture scroll integration method, Mesh down-sampling algorithm and network structure, It is specific as follows:

1) picture scroll integration method

Grid data is handled using dynamic filter convolutional layer, learns the mapping from neighborhood to filter weight, and consider grid Inherent feature, specifically, network layer input is a feature vector x_iCorresponding to a vertex i ∈ { 1 ..., n }, output is also Vector y_i:

Formula Parsing: N_iIt is the neighbours vertex set i on vertex in Mesh,It is to calculate Positive edge weights in method, can by m characteristic value normalization be 1, this meeting so that in feature space weight translation invariance, when When using luv space 3D coordinate as the input feature vector of shaped grid, the feature of translation invariance has preferably training effect Fruit, b, W_m, t_mAnd c_mAll be can training parameter, M is the hyper parameter manually set；

2-2) Mesh down-sampling algorithm

Using using permutation matrix P_d∈{0,1}^k×nQuick down-sampling is carried out, will be adopted under a Mesh with n vertex Sample is to k vertex, n < k, P_dWhat (p, q) was indicated is whether q-th of vertex retains in down-sampling, just retains if being 1, is 0 Just give up, down-sampling algorithm comes iterative shrinkage vertex pair using quadratic matrix；Up-sampling is exactly the inverse process of down-sampling, by one Mesh with k vertex is upsampled to n vertex, and n < k uses up-sampling permutation matrix P_u∈R^n×k, the process of up-sampling is The vertex v q abandoned during down-sampling is re-added in down-sampling grid, i.e., by V_qIt is mapped in down-sampling grid Immediate triangle (h, i, j), and calculate barycentric coodinates and use v=w_hv_h+w_iv_i+w_jv_j, v_hv_iv_j∈V_dVertex set, and And w_h+w_i+w_j=1, P_uMiddle weight setting is P_u(q, h)=w_h, P_u(q, i)=w_i, P_u(q, j)=w_j；

2-3) network structure

Network is divided into encoder and decoder part, and encoder is made of 6 picture scrolls products, feature quantity be set as (16,32,64, 96,128,256), every layer is all to have used batch standardization and ReLU activation primitive, and every layer of convolution all uses down-sampling, multiplying power point Not Wei [2,2,2,4,4,4], the defeated characteristic dimension of every layer of encoder be 2512 × 16,1256 × 32,628 × 64,157 × 96, 40 × 128, and10 × 256, the last layer is by the latent sheaf space of Feature Mapping to 128 dimensions；Decoder use first one it is complete Articulamentum is by the Feature Mapping of 128 dimensions to the space Mesh, and followed by 6 layers of picture scroll product, every layer of picture scroll product is all to have used batch rule Generalized and ReLU activation primitive, up-sampling multiplying power is [4,4,4,2,2,2], and entire decoder section is equivalent to the inverse mistake of encoder Journey；Every layer of output characteristic dimension is 40 × 128,157 × 96,628 × 64,1256 × 32,2512 × 16, and 5023 × 3, 128 dimensional characteristics that encoder generates can carry out the meter of Kullback-Leibler variation loss function with the data of Gaussian Profile It calculates, so that the data that encoder generates approach Gaussian Profile space as far as possible.