CN111524226B

CN111524226B - Method for detecting key point and three-dimensional reconstruction of ironic portrait painting

Info

Publication number: CN111524226B
Application number: CN202010316895.5A
Authority: CN
Inventors: 张举勇; 蔡泓锐; 郭玉东; 彭妆
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2023-04-18
Anticipated expiration: 2040-04-21
Also published as: CN111524226A

Abstract

The invention discloses a method for detecting key points and reconstructing three-dimensionally of ironic portrait painting, which comprises the following steps: constructing a convolutional neural network, and collecting a data set which comprises a three-dimensional face template model, sarcasia portrait, marked two-dimensional key point coordinates and a three-dimensional exaggerated face model generated based on the existing method; and performing network training by using the data set, and outputting a deformation representation model and camera projection parameters corresponding to the input sarcasian portrait painting through a convolutional neural network so as to predict the vertex coordinates and two-dimensional key point coordinates of the three-dimensional exaggerated face model. The method liberates the process of labeling key points for the sarcasian portrait, and with the help of a new face deformation representation and a huge data set, the trained convolutional neural network can directly reconstruct an exaggerated three-dimensional face through a deformation representation model obtained through prediction, and two-dimensional key point coordinates are obtained through camera projection parameters obtained through simultaneous prediction.

Description

Method for detecting key point and three-dimensional reconstruction of ironic portrait painting

Technical Field

The invention relates to the technical field of image processing and three-dimensional modeling, in particular to a method for detecting key points and reconstructing three-dimensional sarcasia portrait painting.

Background

Irony portraits are an artistic expression relying on two-dimensional images and three-dimensional models. The human face image generation method creates a visual effect with humorous colors by exaggerating certain features or details of the human face, and is often used in life scenes such as movies, advertisements and social contact. The artistic expression form is also proved to be capable of effectively improving the accuracy of face recognition in the fields of computer vision, cognitive psychology and the like. Because of its potential research prospects and wide use, the issues related to ironic portraits are attracting increasing numbers of researchers and enterprises to invest in.

Key point detection technology related to ironic portraits: compared with a normal face, the sarcasian portrait has the characteristics of exaggeration and diversity, so that the difficulty in identifying key points is high. Thus, there are few automatic key point detection algorithms on ironic portraits. On the other hand, many research topics of ironic portraits rely on key points, which are not only boring, but also time-consuming and laborious to label manually. Therefore, it is a matter of great significance to develop a key point detection algorithm related to the irony portrait, which not only fills the blank of research in this respect, but also helps the development of related topics.

Most of the current popular normal face key point detection algorithms are data-driven methods and depend on the design of a deep neural network structure. Such algorithms generally extract visual features of a human face or statistical features of pixels of a human face image from a single picture and return the positions of key points, and the extraction method comprises knowledge-based and algebraic feature-based methods. And the exaggerated face is rooted in the normal face, which needs to satisfy basic features of one face, such as the need to have a specific number of eyes, mouth, nose, ears, and the like. However, the exaggerated face usually exaggerates the features based on the normal face, so that a certain feature is greatly different among different pictures, such as the distribution of key points around the eyes. Due to the exaggerated differentiation and diversification of features, there are relatively few key point detection algorithms related to the sarcasm portrait.

Three-dimensional reconstruction techniques on ironic portraits: at present, two main methods are used for obtaining three-dimensional exaggerated face models: manual modeling and reconstruction based on deformation algorithms. Manual modeling, which is the earliest three-dimensional modeling means, is still widely used to generate an exaggerated human face three-dimensional model. But the process typically requires a person trained in specialized learning to do so on specialized modeling software such as AutoDesk Maya. Although having the advantage of high accuracy, it is more popular to obtain three-dimensional exaggerated face models based on morphing algorithms because it requires a lot of time and manpower. However, although the morphing algorithm has the advantage of automatic generation, the generated model is often limited in exaggerated style, and the three-dimensional exaggerated face with different shapes obtained by manual modeling is not diverse and not accurate enough. Moreover, most of the existing transformation algorithms depend on key points, so that time and labor are needed to label the key points, and once the label is inaccurate, the generated model is possibly not matched with the original two-dimensional ironic portrait.

The traditional method for generating a normal human face three-dimensional model based on images usually comprises the steps of firstly constructing a three-dimensional model of some people through a camera acquisition and other ways, then constructing a corresponding human face database through a statistical or dimension reduction based method, establishing a human face parameterized model (comprising a linear model and a nonlinear model), further parameterizing the complex three-dimensional human face into a low-dimensional parameter space, and reconstructing the corresponding normal human face by obtaining coordinate representation in the low-dimensional space. The traditional exaggerated face generation thought is characterized in that two-dimensional key points are marked on a single picture, and a corresponding exaggerated face is generated through key point constraint and a constructed parameterized model. The method is very dependent on key points, so that not only is time spent on the labeling task, but also the reconstructed three-dimensional model is directly influenced once the labeling accuracy is not high.

Disclosure of Invention

The invention aims to provide a method for detecting key points and reconstructing three-dimensional sarcasia face, which can automatically and quickly detect key points of an exaggerated face and generate a corresponding three-dimensional model, and has important practical application value in the fields of face recognition, animation generation, expression migration, AR/VR and the like.

The purpose of the invention is realized by the following technical scheme:

a method for detecting key points and reconstructing three-dimensional of ironic portrait painting comprises the following steps:

constructing a convolutional neural network, and collecting a data set which comprises a three-dimensional face template model, sarcasia portrait, marked two-dimensional key point coordinates and a three-dimensional exaggerated face model generated based on the existing method; the three-dimensional face template model and the three-dimensional exaggerated face model have the same topological structure;

in the training stage, a three-dimensional face template model is used as a template face, a deformation representation model of each ironic portrait is calculated, and camera projection parameters are output; predicting the corresponding three-dimensional exaggerated face model vertex coordinates and two-dimensional key point coordinates according to the deformation representation model and the camera projection parameters, and constructing a loss function in a training stage according to the three-dimensional exaggerated face model vertex coordinates and the two-dimensional key point coordinates, so that the network is trained in a supervision mode;

after training, corresponding deformation representation model and camera projection parameters are obtained for the ironic portrait painting input, and therefore the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model are predicted.

It can be seen from the above technical solutions provided by the present invention that 1) the deformation on the face constrained by the deformation representation enables the generated face to still have the properties of the face, and the strong deformation representation model can also generate the face with an exaggerated style. 2) The human face deformation model and the projection parameters of the camera can be directly regressed from a single picture through a convolution neural network structure. 3) The two act together to obtain a more accurate three-dimensional exaggerated face model. Meanwhile, more accurate two-dimensional key point coordinates are obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting key points and reconstructing ironically portrait painting according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolutional neural network according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a test result performed by using a trained convolutional neural network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

In the irony portrait face recognition field, the normal face-based keypoint detection algorithm is often not accurate enough because of the large distribution difference of some facial features among different pictures, and a large amount of time is still needed to adjust the positions of the keypoints after detection. In the field of irony portrait three-dimensional reconstruction, the traditional three-dimensional reconstruction method has insufficient expression capability of a base model, so that the exaggerated degree of a reconstructed face model is insufficient; some reconstruction algorithms based on optimization methods and key point constraints depend on the labeling of key points excessively, and once the labeling is not accurate enough, the generated three-dimensional model and the two-dimensional picture have large deviation. To this end, an embodiment of the present invention provides a method for detecting key points and reconstructing ironically portrait painting in three dimensions, as shown in fig. 1, which mainly includes the following steps:

step 1, constructing a convolutional neural network, and collecting a data set which comprises a three-dimensional face template model, sarcasm portrait painting, marked two-dimensional key point coordinates and a three-dimensional exaggerated face model generated based on the existing method.

The method mainly comprises the steps of constructing a network and collecting data; because the data set has the diversity of acquisition modes and the possibility of different data set processing, the three-dimensional exaggerated face model in the data set is required to have the same topological structure as the three-dimensional face template model, namely, different data share the same vertex number and adjacency relation, and the vertex sequence is the same on different models; in addition, the acquired face data is set to be sufficiently diverse.

Those skilled in the art will appreciate that the above-described normal face data set satisfying such conditions may be obtained by conventional means.

Step 2, in the training stage, a three-dimensional face template model is used as a template face, a deformation representation model of each sarcasian portrait picture is calculated, and camera projection parameters are output; and predicting the corresponding three-dimensional exaggerated face model vertex coordinates and two-dimensional key point coordinates according to the deformation representation model and the camera projection parameters, and constructing a loss function in a training stage according to the three-dimensional exaggerated face model vertex coordinates and the two-dimensional key point coordinates, so that the network is trained in a supervision mode.

First, the principle of calculation of the deformation expression model will be described.

Recording the set of the top points on the three-dimensional face template model as V, V = { V = _i |i＝1,...,N _v V is formed by all the top points V on the single face three-dimensional data _i Wherein i is an index subscript, N _v Is the total number of vertices; the obtained data set meets the condition that the number of vertexes and the sequence of the vertexes of the face data are the same, and meanwhile, the adjacency relation is also the same. Therefore, knowing the vertex set V and some index i, it can know which vertex is referred to.

Taking the three-dimensional face template model as a template face, and taking a three-dimensional exaggerated face model corresponding to the ironic portrait picture as a deformed face; constructing a vertex v 'with index i on deformed human face' _i And a vertex v with index i on template face _i Deformation gradient T between _i Minimizing the energy function to solve for T _i ：

Wherein N is _i A subscript set of 1-neighborhood vertexes taking a vertex with index subscript i as a center and a set N in a template face are referred to _i The vertex with an internal index j is denoted as v _j Set N in a morphed face _i With an internal index subscript jVertex is recorded as v' _j ；e' _ij Is the vertex v 'on the morphed face' _i To v 'to vertex' _j E is a side of _ij As a template of the vertex v on the face _i To the vertex v _j The edge of (1); c. C _ij The cosine Laplace weight of the template face;

after the deformation gradient of the vertex is obtained, T is decomposed through matrix polar decomposition _i Decomposition into R _i S _i Wherein R is _i Representing the vertex v _i To vertex v' _i Rotation matrix component of deformation gradient, S _i Representing the vertex v _i To v 'to vertex' _i A scaling matrix component of the deformation gradient;

rotating the matrix R by matrix operation _i Equivalent is expressed as exp (logR) _i ) Then, the deformation representation model from the template face to the deformed face is written as:

f _n ＝{logR _i ；S _i -I|i＝1,...,N _v }

wherein, I is a unit array, and the introduction aims at constructing a coordinate system, V _n ＝{v' _i |i＝1,...,N _v The vertex set on the three-dimensional exaggerated face model is used as the vertex set; the purpose of logR is to make the operation R on the rotation matrix _i R _j Can be expressed as exp (logR) _i +logR _j ) This allows the multiplication to be simplified to an addition.

Obtaining a deformation representation set F = { F based on the template human face on a three-dimensional exaggerated face model data set by coding deformation from all deformed human faces to the template human face _n I N =1,.. N }, where N is the number of elements in the set represented by the deformation, that is, the number of three-dimensional data in the face data set. Illustratively, the number of elements in F is 7800, i.e., N =7800.

The set of distortion representations F is recorded as a matrix of size N M, the nth row of the matrix represents the distortion representation F of the exaggerated face with the number N based on the template face _n (ii) a For each f _n Its ith vertex v' _i Is expressed as a deformation of { logR _i ；S _i -I } is recorded as a vector with one dimension of 9, so M = N _v X 9, same as above, N _v The total number of the vertexes on the face three-dimensional mesh.

As shown in fig. 2, the convolutional neural network includes an encoder and a decoder; an encoder for encoding the ironic portrait as a K-dimensional hidden vector, which is split into two parts, one part being a K1-dimensional vector, i.e. a camera projection parameter; the other part is a vector of K2 dimension, and the vector is decoded by a decoder to become a deformation representation model; wherein K1+ K2= K.

Illustratively, resNet34 may be used as an encoder and a 3-layer fully-connected neural network may be used as a decoder.

For example, the resolution of the input ironic portrait may be 224 × 224, K =216, K1=6, K2=210.

Based on the principle, in the training process, the three-dimensional face template model is used as a template face, the sarcasian portrait is input, the deformation gradient is obtained through the predicted rotation matrix component and the predicted scaling matrix component in the deformation representation model, the vertex coordinates of the three-dimensional exaggerated face model corresponding to the sarcasian portrait are predicted, and the two-dimensional key point coordinates are predicted by combining with the camera projection parameters output by the network. Then, a loss function can be constructed by using the marked two-dimensional key point coordinates and the three-dimensional exaggerated face model (real values) corresponding to the corresponding sarcasian portrait in the data set, and the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model predicted by the network approach to the real values in the data set through continuous training.

The preferred embodiment of network training is as follows:

for a ironic portrait, a deformation representation model can be obtained by a convolutional neural network, and is represented as:

wherein the content of the first and second substances,

representing predicted vertices v _i To v 'to vertex' _i Rotation matrix component of deformation gradient，/>

Representing predicted vertices v _i To vertex v' _i A scaling matrix component of the deformation gradient; marking/conjunction>

Denotes a vertex v 'with index subscript i on the predicted warped face' _i And a vertex v with subscript i corresponding to the template face _i A deformation gradient;

according to predicted deformation gradient

Predicting the vertex coordinates of the three-dimensional exaggerated face model by solving an optimization problem:

wherein the content of the first and second substances,

for the predicted three-dimensional exaggerated face model, the vertex coordinate with index i is asserted>

Representing a set N of predicted faces _i The internal index subscript is the vertex coordinate of j; solving the optimal problem is equivalent to solving a linear equation system to obtain the vertex coordinates of the three-dimensional exaggerated face:

/>

the camera projection parameters P are expressed as:

wherein->

Is a zoom parameter, is asserted>

Is a rotation matrix (derived from an Euler angle vector), ->

Is a translation parameter. As in the previous example, K1=6, then £ r>

Sequentially 1-dimensional, 3-dimensional and 2-dimensional vectors. According to the predicted vertex coordinates of the three-dimensional exaggerated face model and a weak perspective projection formula, two-dimensional key point coordinates can be obtained:

wherein L' is a three-dimensional key point set selected from the predicted vertex set of the three-dimensional exaggerated face model;

is a two-dimensional keypoint set, and T is the total number of two-dimensional keypoints.

For example, the key points may be 68 key points including contours, eyebrows, eyes, nose, and mouth, or other forms of key points; corresponding three-dimensional key points can be selected from the three-dimensional key point set according to the selected key point form to form a set L'.

In the training process, the data in the data set is used as the true value (supervision information) in the training. Based on the input single ironic portrait, a deformation representation model f and a camera projection parameter P can be output through the convolution neural network constructed in the step 1 in combination with the method introduced in the step, and the predicted vertex coordinates of the three-dimensional model are obtained

And two-dimensional key point coordinates->

In the embodiment of the present invention, the loss function in the training phase includes three parts:

1) Vertex-based loss function E _ver 。

Using the three-dimensional exaggerated face vertex coordinates of the corresponding ironic portrait in the dataset as the supervisory information, the corresponding penalty function is expressed as:

wherein the content of the first and second substances,

to index the vertex coordinates, v ', with subscript i in the predicted three-dimensional exaggerated face model' _i The vertex coordinates with index i in the corresponding three-dimensional exaggerated face model in the dataset are indexed.

2) Loss function E based on two-dimensional key points _lan 。

Using the corresponding two-dimensional keypoint coordinates in the dataset as supervision information, the corresponding loss function is expressed as:

wherein L' is a three-dimensional key point set selected from a predicted vertex set of the three-dimensional exaggerated face model;

is a two-dimensional key point set, and T is the total number of the two-dimensional key points; />

For predicted two-dimensional turn-offKey point coordinates; q's' _t And (4) correspondingly marking the coordinates of the two-dimensional key points in the data set.

3) Loss function E based on camera projection parameters _srt

Since the loss value of the key point not only relates to the three-dimensional vertex coordinates, but also relates to the camera parameters, more supervision information is needed to individually constrain the camera parameters when training is started, and the corresponding loss function is expressed as:

wherein the content of the first and second substances,

is a zoom parameter, is asserted>

Is the rotation matrix, is greater than or equal to>

Is a translation parameter.

Finally, the loss function for the training phase is:

E＝λ ₁ E _ver +λ ₂ E _lan +λ ₃ E _srt

wherein, { lambda ] _k I k =1,2,3 is a weight parameter; illustratively, set λ ₁ ＝1,λ ₂ ＝0.00001,λ ₃ ＝0.0001。

In the embodiment of the present invention, based on the PyTorch deep learning framework training model, supervised learning may be performed by reading multiple sets of data (for example, 32 sets) each time, and the training may be completed after training for multiple cycles (for example, 2000 cycles).

And 3, after training is finished, obtaining a corresponding deformation representation model and a corresponding camera projection parameter for the irony portrait painting, so as to predict the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model.

The processing mode of the test process and the training process is the same, the sarcasia portrait is input into the trained convolutional neural network, the deformation representation model and the camera projection parameters can be obtained, and therefore the vertex coordinates of the three-dimensional exaggerated face model (the three-dimensional exaggerated face model can be directly constructed due to the fact that the topological structure is known) and the two-dimensional key point coordinates are predicted.

Some examples of test results are given schematically in fig. 3; the first row is the input two-dimensional ironic portrait (224 x 224), the second row is the predicted three-dimensional exaggerated face model, and the third row is the image labeled with the predicted two-dimensional keypoints.

Compared with the traditional key point detection and three-dimensional reconstruction algorithm based on pictures, the scheme of the embodiment of the invention mainly has the following advantages:

1) By parameterizing the three-dimensional nonlinear deformation model, the expression capability of the convolutional neural network is enhanced by the algorithm, and the key point detection task based on the exaggerated human face is realized.

2) Through a convolutional neural network, the algorithm realizes a method for reconstructing a three-dimensional face model from a two-dimensional exaggerated face picture end to end.

3) Based on the established massive data training, the recognition and modeling accuracy of the algorithm model on the sarcasic portrait works of different styles and different writers is greatly improved compared with that of the traditional algorithm.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting key points and reconstructing three-dimensionally of ironic portrait painting is characterized by comprising the following steps:

after training is finished, obtaining a corresponding deformation representation model and a corresponding camera projection parameter for the ironic portrait painting, so as to predict the vertex coordinates and the two-dimensional key point coordinates of the three-dimensional exaggerated face model;

the three-dimensional face template model and the three-dimensional exaggerated face model have the same topological structure, namely the two models share the same vertex number and adjacency relation, and the vertex sequence is the same on different models; recording the set of the top points on the three-dimensional face template model as V, V = { V = _i |i＝1,...,N _v V is formed by all the top points V on the single face three-dimensional data _i Is formed by i is index subscript, N _v Is the total number of vertices;

during training, a three-dimensional face template model is used as a template face, and a sarcasic portrait is input to obtain a deformation representation model f and a camera projection parameter P;

the deformation representation model is expressed as:

wherein the content of the first and second substances,

representing predicted vertices v _i To the vertex v _i ' the rotational matrix component of the deformation gradient, device for selecting or keeping>

Representing predicted vertices v _i To the vertex v _i ' scaling matrix component of deformation gradient; marking/combining>

Denotes a vertex v with index i on the predicted warped face _i ' and the vertex v with subscript i corresponding to the template face _i A deformation gradient;

according to predicted deformation gradient

wherein the content of the first and second substances,

for the predicted vertex coordinate with index i in the three-dimensional exaggerated face model, the method comprises>

Representing a set N of predicted faces _i The internal index subscript is the vertex coordinate of j;

the camera projection parameters P are expressed as:

wherein->

Is a zoom parameter, <' > is selected>

Is the rotation matrix, is greater than or equal to>

Is a translation parameter; according to the predicted vertex coordinates of the three-dimensional exaggerated face model and a weak perspective projection formula, two-dimensional key point coordinates can be obtained:

2. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 1, wherein said convolutional neural network comprises an encoder and a decoder; an encoder for encoding the ironic portrait as a K-dimensional hidden vector, which is split into two parts, one part being a K1-dimensional vector, i.e. a camera projection parameter; the other part is a vector of K2 dimension, and the vector is decoded by a decoder to become a deformation representation model; wherein K1+ K2= K.

3. The ironic portrait keypoint detection and three-dimensional reconstruction method of claim 1 or 2, characterized in that the loss function of the training phase is:

E＝λ ₁ E _ver +λ ₂ E _lan +λ ₃ E _srt

wherein, { lambda ] _k I k =1,2,3 is a weight parameter;

E _ver for vertex-based loss functions:

wherein the content of the first and second substances,

for the predicted vertex coordinates with index i, v, in the three-dimensional exaggerated face model _i ' indexing vertex coordinates with index subscript i in corresponding three-dimensional exaggerated face model in data set; n is a radical of _v Represents the total number of vertices;

E _lan is a loss function based on two-dimensional key points:

The predicted two-dimensional key point coordinates are obtained; q. q.s _t ' are the corresponding marked two-dimensional key point coordinates in the data set;

E _srt for the loss function based on camera projection parameters:

wherein the content of the first and second substances,

is a zoom parameter, is asserted>

Is the rotation matrix, is greater than or equal to>

Is a translation parameter. />