CN111524226A

CN111524226A - Method for detecting key point and three-dimensional reconstruction of ironic portrait painting

Info

Publication number: CN111524226A
Application number: CN202010316895.5A
Authority: CN
Inventors: 张举勇; 蔡泓锐; 郭玉东; 彭妆
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-08-11
Anticipated expiration: 2040-04-21
Also published as: CN111524226B

Abstract

The invention discloses a method for detecting key points and reconstructing three-dimensionally of ironic portrait painting, which comprises the following steps: constructing a convolutional neural network, and collecting a data set which comprises a three-dimensional face template model, sarcasia portrait, marked two-dimensional key point coordinates and a three-dimensional exaggerated face model generated based on the existing method; and performing network training by using the data set, and outputting a deformation representation model and camera projection parameters corresponding to the input sarcasian portrait painting through a convolutional neural network so as to predict the vertex coordinates and two-dimensional key point coordinates of the three-dimensional exaggerated face model. The method liberates the process of labeling key points for the sarcasian portrait, and with the help of a new face deformation representation and a huge data set, the trained convolutional neural network can directly reconstruct an exaggerated three-dimensional face through a deformation representation model obtained through prediction, and two-dimensional key point coordinates are obtained through camera projection parameters obtained through simultaneous prediction.

Description

Method for detecting key point and three-dimensional reconstruction of ironic portrait painting

Technical Field

The invention relates to the technical field of image processing technology and three-dimensional modeling, in particular to a method for detecting key points and reconstructing three-dimensional sarcasia portrait painting.

Background

Irony portraits are an artistic expression relying on two-dimensional images and three-dimensional models. The human face image generation method creates a visual effect with humorous colors by exaggerating certain characteristics or details of the human face, and is often used in life scenes such as movies, advertisements and social contact. The artistic expression is also proved to be capable of effectively improving the accuracy of face recognition in the fields of computer vision, cognitive psychology and the like. Because of its potential research prospects and wide use, the issues related to ironic portraits are attracting increasing numbers of researchers and enterprises to invest in.

Key point detection techniques on ironic portraits: compared with a normal face, the sarcasian portrait has the characteristics of exaggeration and diversity, so that the difficulty in identifying key points is high. Thus, there are few automatic key point detection algorithms on ironic portraits. On the other hand, many research topics of ironic portraits rely on key points, which are not only boring, but also time-consuming and laborious to label manually. Therefore, it is a matter of great significance to develop a key point detection algorithm related to the irony portrait, which not only fills the blank of research in this respect, but also helps the development of related topics.

Most of the current popular normal face key point detection algorithms are data-driven methods and depend on the design of a deep neural network structure. Such algorithms generally extract visual features of a human face or statistical features of pixels of a human face image from a single picture and return the positions of key points, and the extraction method comprises knowledge-based and algebraic feature-based methods. And the exaggerated face is rooted in the normal face, which needs to satisfy basic features of one face, such as the need to have a specific number of eyes, mouth, nose, ears, and the like. But the exaggerated face usually exaggerates the features based on the normal face, so that a certain feature is more different among different pictures, such as the distribution of key points around the eyes. Due to the exaggerated differentiation and diversification of features, there are few key point detection algorithms related to the sarcasm portrait.

Three-dimensional reconstruction techniques on ironic portraits: at present, two main methods are used for obtaining three-dimensional exaggerated face models: manual modeling and reconstruction based on deformation algorithms. Manual modeling, which is the earliest three-dimensional modeling means, is still widely used to generate an exaggerated human face three-dimensional model. But the process typically requires a person trained in specialized learning to do so on specialized modeling software such as AutoDesk Maya. Although having the advantage of high accuracy, it is more popular to obtain three-dimensional exaggerated face models based on morphing algorithms because it requires a lot of time and manpower. However, although the morphing algorithm has the advantage of automatic generation, the generated model is often limited in exaggerated style, and the three-dimensional exaggerated face with different shapes obtained by manual modeling is not diverse and not accurate enough. Moreover, most of the existing transformation algorithms depend on key points, so that time and labor are needed to label the key points, and once the label is inaccurate, the generated model is possibly not matched with the original two-dimensional ironic portrait.

The traditional method for generating a normal face three-dimensional model based on an image usually constructs a three-dimensional model of some people through a camera acquisition and other ways, then constructs a corresponding face database through a statistical or dimension reduction based method, and establishes a parameterized model (including a linear model and a nonlinear model) of the face, so as to parameterize a complex three-dimensional face into a low-dimensional parameter space, and the corresponding normal face can be reconstructed by obtaining coordinate representation in the low-dimensional space. From the idea, the conventional exaggerated face generation idea firstly labels two-dimensional key points on a single picture, and generates a corresponding exaggerated face through key point constraint and a constructed parameterized model. The method is very dependent on key points, so that not only is time spent on the labeling task, but also the reconstructed three-dimensional model is directly influenced once the labeling accuracy is not high.

Disclosure of Invention

The invention aims to provide a method for detecting key points and reconstructing three-dimensional sarcasia face, which can automatically and quickly detect key points of an exaggerated face and generate a corresponding three-dimensional model, and has important practical application value in the fields of face recognition, animation generation, expression migration, AR/VR and the like.

The purpose of the invention is realized by the following technical scheme:

a method for detecting key points and reconstructing three-dimensional of ironic portrait painting comprises the following steps:

constructing a convolutional neural network, and collecting a data set which comprises a three-dimensional face template model, sarcasia portrait, marked two-dimensional key point coordinates and a three-dimensional exaggerated face model generated based on the existing method; the three-dimensional face template model and the three-dimensional exaggerated face model have the same topological structure;

in the training stage, a three-dimensional face template model is used as a template face, a deformation representation model of each ironic portrait is calculated, and camera projection parameters are output; predicting the corresponding three-dimensional exaggerated face model vertex coordinates and two-dimensional key point coordinates according to the deformation representation model and the camera projection parameters, and constructing a loss function in a training stage according to the three-dimensional exaggerated face model vertex coordinates and the two-dimensional key point coordinates, so that the network is trained in a supervision mode;

after training, corresponding deformation representation model and camera projection parameters are obtained for the ironic portrait painting input, and therefore the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model are predicted.

It can be seen from the above technical solutions provided by the present invention that 1) the deformation on the face constrained by the deformation representation enables the generated face to still have the properties of the face, and the strong deformation representation model can also generate the face with an exaggerated style. 2) The human face deformation model and the projection parameters of the camera can be directly regressed from a single picture through a convolutional neural network structure. 3) The two act together to obtain a more accurate three-dimensional exaggerated face model. Meanwhile, more accurate two-dimensional key point coordinates are obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting key points and reconstructing a irony portrait according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolutional neural network according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a test result performed by using a trained convolutional neural network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

In the irony portrait face recognition field, the normal face-based keypoint detection algorithm is often not accurate enough because of the large distribution difference of some facial features among different pictures, and a large amount of time is still needed to adjust the positions of the keypoints after detection. In the field of irony portrait three-dimensional reconstruction, the traditional three-dimensional reconstruction method has insufficient expression capability of a base model, so that the exaggerated degree of a reconstructed face model is insufficient; some reconstruction algorithms based on optimization methods and key point constraints depend on the labeling of key points excessively, and once the labeling is not accurate enough, the generated three-dimensional model and the two-dimensional picture have large deviation. To this end, an embodiment of the present invention provides a method for detecting key points and reconstructing a irony portrait, as shown in fig. 1, which mainly includes the following steps:

step 1, constructing a convolutional neural network, and collecting a data set which comprises a three-dimensional face template model, sarcasm portrait painting, marked two-dimensional key point coordinates and a three-dimensional exaggerated face model generated based on the existing method.

The method mainly comprises the steps of constructing a network and collecting data; because the data set has the diversity of acquisition modes and the possibility of different data set processing, the three-dimensional exaggerated face model in the data set is required to have the same topological structure as the three-dimensional face template model, namely, different data share the same vertex number and adjacency relation, and the vertex sequence is the same on different models; in addition, the acquired face data is set to be sufficiently diverse.

Those skilled in the art will appreciate that the above-described normal face data set satisfying such conditions may be obtained by conventional means.

Step 2, in the training stage, a three-dimensional face template model is used as a template face, a deformation representation model of each sarcasian portrait picture is calculated, and camera projection parameters are output; and predicting the corresponding three-dimensional exaggerated face model vertex coordinates and two-dimensional key point coordinates according to the deformation representation model and the camera projection parameters, and constructing a loss function in a training stage according to the three-dimensional exaggerated face model vertex coordinates and the two-dimensional key point coordinates, so that the network is trained in a supervision mode.

First, the principle of calculation of the deformation expression model will be described.

Recording the set of the top points of the three-dimensional face template model as V, V ═ V_i|i＝1,...,N_vV is formed by all vertexes V on single face three-dimensional data_iWherein i is an index subscript, N_vIs the total number of vertices; the obtained data set meets the condition that the number of vertexes and the sequence of the vertexes of the face data are the same, and meanwhile, the adjacency relation is also the same. Knowing the set of vertices V and some index i, it is possible to know which vertex is referred to.

Taking the three-dimensional face template model as a template face, and taking a three-dimensional exaggerated face model corresponding to the ironic portrait picture as a deformed face; constructing a vertex v 'with index i on deformed human face'_iAnd a vertex v with index i on template face_iDeformation gradient T between_iOfQuantity function, minimizing the energy function to solve for T_i：

Wherein N is_iA subscript set of 1-neighborhood vertices with the vertex with index subscript i as the center, and a set N in the template face_iThe vertex with an internal index j is denoted as v_jSet N in a morphed face_iThe vertex with the index subscript j is recorded as v'_j；e'_ijIs a vertex v 'on the deformed face'_iTo v 'to vertex'_jEdge of (e), e_ijAs a template of the vertex v on the face_iTo the vertex v_jThe edge of (1); c. C_ijThe cosine Laplace weight of the template face;

after the deformation gradient of the vertex is obtained, T is decomposed through matrix polar decomposition_iDecomposition into R_iS_iWherein R is_iRepresenting the vertex v_iTo v 'to vertex'_iRotation matrix component of deformation gradient, S_iRepresenting the vertex v_iTo v 'to vertex'_iA scaling matrix component of the deformation gradient;

rotating the matrix R by matrix operation_iEquivalent is expressed as exp (logR)_i) Then, the deformation representation model from the template face to the deformed face is written as:

f_n＝{logR_i；S_i-I|i＝1,...,N_v}

wherein, I is a unit array, and the introduction aims at constructing a coordinate system, V_n＝{v'_i|i＝1,...,N_vThe vertex set on the three-dimensional exaggerated face model is used as the vertex set; the purpose of logR is to make the operation R on the rotation matrix_iR_jCan be expressed as exp (logR)_i+logR_j) This allows the multiplication to be simplified to an addition.

The method comprises the steps of coding all deformation from deformed faces to template faces to obtain a deformation representation set F ═ F based on the template faces on a three-dimensional exaggerated face model data set_n1., N }, where N is in the set of deformation representationsThe number of elements, that is, the number of three-dimensional data in the face data set. Illustratively, the number of elements in F is 7800, i.e., N is 7800.

The set of distortion representations F is noted as a matrix of size N × M, with the nth row of the matrix representing the distortion representation F of the exaggerated face numbered N based on the template face_n(ii) a For each f_nIts ith vertex v'_iIs expressed as a deformation of { logR_i；S_i-I } is recorded as a vector with one dimension of 9, so M ═ N_v× 9, supra, N_vThe total number of the vertexes on the face three-dimensional mesh.

As shown in fig. 2, the convolutional neural network includes an encoder and a decoder; an encoder for encoding the ironic portrait as a K-dimensional hidden vector that is split into two parts, one part being a K1-dimensional vector, i.e. camera projection parameters; the other part is a vector with K2 dimension, and the vector is decoded by a decoder to become a deformation representation model; wherein K1+ K2 ═ K.

Illustratively, ResNet34 may be used as an encoder and a 3-layer fully-connected neural network may be used as a decoder.

For example, the resolution of the ironic portrait of the input may be 224, 216, 1, and 2.

Based on the principle, in the training process, the three-dimensional face template model is used as a template face, the sarcasian portrait is input, the deformation gradient is obtained through the predicted rotation matrix component and the predicted scaling matrix component in the deformation representation model, the vertex coordinates of the three-dimensional exaggerated face model corresponding to the sarcasian portrait are predicted, and the two-dimensional key point coordinates are predicted by combining with the camera projection parameters output by the network. Then, a loss function can be constructed by using the marked two-dimensional key point coordinates and the three-dimensional exaggerated face model (real values) corresponding to the corresponding sarcasian portrait in the data set, and the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model predicted by the network approach to the real values in the data set through continuous training.

The preferred embodiment of network training is as follows:

for a ironic portrait, a deformation representation model can be obtained by a convolutional neural network, and is represented as:

wherein the content of the first and second substances,

representing predicted vertices v_iTo v 'to vertex'_iThe rotational matrix component of the deformation gradient,

representing predicted vertices v_iTo v 'to vertex'_iA scaling matrix component of the deformation gradient; note the book

Denotes a vertex v 'with index subscript i on the predicted warped face'_iAnd a vertex v with subscript i corresponding to the template face_iA deformation gradient;

according to predicted deformation gradient

Predicting the vertex coordinates of the three-dimensional exaggerated face model by solving an optimization problem:

wherein the content of the first and second substances,

to index the vertex coordinates with index i in the predicted three-dimensional exaggerated face model,

representing a set N of predicted faces_iThe internal index subscript is the vertex coordinate of j; solving the optimal problem is equivalentAnd solving a linear equation set to obtain the vertex coordinates of the three-dimensional exaggerated face:

the camera projection parameters P are expressed as:

wherein

Is a scaling parameter that is a function of the zoom level,

is the rotation matrix (derived from the euler angle vector),

is a translation parameter. As in the previous example, K1 ═ 6, then

Sequentially 1-dimensional, 3-dimensional and 2-dimensional vectors. According to the predicted vertex coordinates of the three-dimensional exaggerated face model and a weak perspective projection formula, two-dimensional key point coordinates can be obtained:

wherein L' is a three-dimensional key point set selected from a predicted vertex set of the three-dimensional exaggerated face model;

is a two-dimensional keypoint set, and T is the total number of two-dimensional keypoints.

For example, the key points may be 68 key points including contours, eyebrows, eyes, nose, and mouth, or other forms of key points; corresponding three-dimensional key points can be selected from the three-dimensional key point set according to the selected key point form to form a set L'.

During the training process, the data in the data set is used as the true value (supervision information) during training. Based on the input single ironic portrait, the convolution neural network constructed in the step 1 can output a deformation representation model f and a camera projection parameter P in a manner combined with the introduction of the step, and thereby the predicted vertex coordinates of the three-dimensional model are obtained

And two-dimensional keypoint coordinates

In the embodiment of the present invention, the loss function in the training phase includes three parts:

1) vertex-based loss function E_ver。

Using the three-dimensional exaggerated face vertex coordinates of the corresponding ironic portrait in the dataset as the supervisory information, the corresponding penalty function is expressed as:

wherein the content of the first and second substances,

to index the vertex coordinates, v ', with subscript i in the predicted three-dimensional exaggerated face model'_iThe vertex coordinates with index i in the corresponding three-dimensional exaggerated face model in the dataset are indexed.

2) Loss function E based on two-dimensional key points_lan。

Using the corresponding two-dimensional keypoint coordinates in the dataset as the supervisory information, the corresponding loss function is expressed as:

is a two-dimensional key point set, and T is the total number of the two-dimensional key points;

the predicted two-dimensional key point coordinates are obtained; q's'_tAnd (4) correspondingly marking the coordinates of the two-dimensional key points in the data set.

3) Loss function E based on camera projection parameters_srt

Since the loss value of the key point not only relates to the three-dimensional vertex coordinates, but also relates to the camera parameters, more supervision information is needed to individually constrain the camera parameters when training is started, and the corresponding loss function is expressed as:

wherein the content of the first and second substances,

is a scaling parameter that is a function of the zoom level,

is a matrix of rotations of the optical system,

is a translation parameter.

Finally, the loss function for the training phase is:

E＝λ₁E_ver+λ₂E_lan+λ₃E_srt

wherein, { lambda ]_k1,2,3 is a weight parameter; illustratively, set λ₁＝1,λ₂＝0.00001,λ₃＝0.0001。

In the embodiment of the present invention, based on the PyTorch deep learning framework training model, supervised learning may be performed by reading in multiple sets of data (for example, 32 sets) each time, and the training may be ended after training for multiple cycles (for example, 2000 cycles).

And 3, after training is finished, obtaining a corresponding deformation representation model and a corresponding camera projection parameter for the irony portrait painting, so as to predict the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model.

The testing process and the training process are the same in processing mode, the sarcasian portrait is input into the trained convolutional neural network, a deformation representation model and camera projection parameters can be obtained, and therefore vertex coordinates of the three-dimensional exaggerated face model (the three-dimensional exaggerated face model can be directly constructed due to the fact that the topological structure is known) and two-dimensional key point coordinates are predicted.

Some examples of test results are given schematically in fig. 3; the first row is the input two-dimensional ironic portrait (224 x 224), the second row is the predicted three-dimensional exaggerated face model, and the third row is the image labeled with the predicted two-dimensional keypoints.

Compared with the traditional key point detection and three-dimensional reconstruction algorithm based on pictures, the scheme of the embodiment of the invention mainly has the following advantages:

1) by means of the parameterized three-dimensional nonlinear deformation model, the expression capability of the convolutional neural network is enhanced through an algorithm, and a key point detection task based on an exaggerated human face is achieved.

2) Through the convolutional neural network, the algorithm realizes a method for reconstructing a three-dimensional face model from a two-dimensional exaggerated face picture from end to end.

3) Based on the established massive data training, the recognition and modeling accuracy of the algorithm model on the sarcasic portrait works of different styles and different writers is greatly improved compared with that of the traditional algorithm.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting key points and reconstructing three-dimensionally of ironic portrait painting is characterized by comprising the following steps:

2. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 1, wherein said convolutional neural network comprises an encoder and a decoder; an encoder for encoding the ironic portrait as a K-dimensional hidden vector that is split into two parts, one part being a K1-dimensional vector, i.e. camera projection parameters; the other part is a vector with K2 dimension, and the vector is decoded by a decoder to become a deformation representation model; wherein K1+ K2 ═ K.

3. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 1, wherein said three-dimensional face template model and three-dimensional exaggerated face model having the same topology means that two models share the same number and adjacency of vertices, and the order of vertices is the same on different models; recording the set of the top points of the three-dimensional face template model as V, V ═ V_i|i＝1,...,N_vV is formed by all vertexes V on single face three-dimensional data_iWherein i is an index subscript, N_vIs the total number of vertices;

during training, the three-dimensional face template model is used as a template face, and a sarcasic portrait is input to obtain a deformation representation model f and a camera projection parameter P.

4. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 3,

the deformation representation model is expressed as:

wherein the content of the first and second substances,

according to predicted deformation gradient

wherein the content of the first and second substances,

representing a set N of predicted faces_iThe internal index subscript is the vertex coordinate of j;

the camera projection parameters P are expressed as:

wherein

Is a scaling parameter that is a function of the zoom level,

is a matrix of rotations of the optical system,

is a translation parameter; according to the predicted vertex coordinates of the three-dimensional exaggerated face model and a weak perspective projection formula, two-dimensional key point coordinates can be obtained:

5. The method of ironic portrait keypoint detection and three-dimensional reconstruction of claim 1, or 2, or 3, or 4, characterized in that the loss function of the training phase is:

E＝λ₁E_ver+λ₂E_lan+λ₃E_srt

wherein, { lambda ]_k1,2,3 is a weight parameter;

E_verfor vertex-based loss functions:

wherein the content of the first and second substances,

to index the vertex coordinates, v ', with subscript i in the predicted three-dimensional exaggerated face model'_iIndexing vertex coordinates with subscript i in a corresponding three-dimensional exaggerated face model in the data set; n is a radical of_vRepresents the total number of vertices;

E_lanis a loss function based on two-dimensional key points:

the predicted two-dimensional key point coordinates are obtained; q's'_tMarking the coordinates of the two-dimensional key points in the data set correspondingly;

E_srtfor the loss function based on camera projection parameters:

wherein the content of the first and second substances,

is a scaling parameter that is a function of the zoom level,

is a matrix of rotations of the optical system,

is a translation parameter.