CN110598601A

CN110598601A - Face 3D key point detection method and system based on distributed thermodynamic diagram

Info

Publication number: CN110598601A
Application number: CN201910818437.9A
Authority: CN
Inventors: 王正宁; 何庆东; 赵德明; 刘怡君; 曾仪; 曾浩; 张翔
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2019-12-20

Abstract

The invention discloses a human face 3D key point detection method and a system based on a distributed thermodynamic diagram, which comprises the following steps: carrying out dimensionality reduction projection on N3D reference coordinate vectors of the key points of the human face in the database on three two-dimensional planes; respectively encoding each 2D reference coordinate vector into a distributed thermodynamic diagram by adopting a distributed encoding sub-network; combining N distributed thermodynamic diagrams into a 2D combined thermodynamic diagram in a coordinate mapping mode; superposing the three 2D joint thermodynamic diagrams into a 3D joint thermodynamic diagram through a concat algorithm; decoding the 3D joint thermodynamic diagram into N3D detected coordinate vectors using a decoding subnetwork. The method combines the advantages of the existing 2D and 3D face key point detection methods, constructs the distributed thermodynamic diagrams, and combines the distributed thermodynamic diagrams in a coordinate mapping mode, wherein the distributed coding sub-network model is simple and has small calculation amount, and the model parameters can be further reduced and the model operation speed can be improved while the higher detection precision is kept.

Description

Face 3D key point detection method and system based on distributed thermodynamic diagram

Technical Field

The invention relates to the technical field of image processing and computer machine vision, in particular to a human face 3D key point detection method and system based on a distributed thermodynamic diagram.

Background

With the rapid development of deep learning technology in the field of computer vision, various face image processing tasks are widely applied in life, wherein face key point detection plays an important role in face recognition, expression recognition, face reconstruction and the like.

Face keypoint detection has achieved tremendous success in the past decade, particularly in the field of 2D face keypoint detection. An ASM (active Shape model) algorithm based on a point distribution model, which is proposed by Cootes and the like, is a classic human face key point detection algorithm, a training set is firstly calibrated by a manual calibration method, a Shape model is obtained by training, and then matching of a specific object is realized by matching of key points; the CPR (CascadedPose regression) -based algorithm proposed by Dollar refines a designated initial predicted value step by step through a series of regressors, each regressor depends on the output of the previous regressor to execute simple image operation, and the whole system can automatically learn from training samples; in addition, Zhang et al proposed a multitask cascaded convolutional neural network MTCNN (Multi-task cascaded convolutional network) for handling face detection and face keypoint localization problems simultaneously. However, in complex scenes such as large-angle poses and face occlusion, the 2D-based face keypoint detection method is difficult to implement and has limitations. To address this limitation, more and more researchers are increasingly focusing on 3D face keypoint detection, which represents more information and provides more occlusion information relative to 2D.

The 3D face key point detection method is roughly classified into a model-based method and a non-model-based method. A model-based method: the three-dimensional deformation model (3DMM) proposed by Blanz et al is a common method for completing the detection of key points of a 3D face; II, a non-model-based method: tulyakov et al propose a method for locating 3D face key points by calculating three-dimensional shape features through cascade regression, and popularize the cascade regression method into 3D face key point detection. In addition, the model-based method also comprises a method for detecting key points of the human face by using a deep learning model, which is mainly divided into a two-stage regression method and a volume representation method, wherein the two-stage regression typical method is used for separating coordinates from an axis, firstly regressing the coordinates and then regressing; the volume representation method expands the traditional 2D thermodynamic diagram into a 3D volume table form, and is widely applied to human body key point detection.

However, due to the increase of the 3D space dimension, the processing speed and the model precision of the corresponding algorithm face huge challenges, and the existing 3D face key point detection algorithm has defects of different degrees in the aspects of processing speed, model size and complexity, model precision and the like.

Disclosure of Invention

At least one of the objectives of the present invention is to overcome the problems in the prior art, and provide a method and a system for detecting a 3D key point of a human face based on a distributed thermodynamic diagram, which can simplify a model and increase a processing operation speed while ensuring accuracy.

In order to achieve the above object, the present invention adopts the following aspects.

A3D face key point detection method based on distributed thermodynamic diagrams comprises the following steps:

101, performing dimensionality reduction projection on N3D reference coordinate vectors of key points of a human face in a database on three two-dimensional planes; wherein the three two-dimensional planes are respectively xy, xz and yz planes, and x, y and z are positive or negative simultaneously; each two-dimensional plane comprises N2D reference coordinate vectors corresponding to the N3D reference coordinate vectors;

102, respectively encoding each 2D reference coordinate vector into a distributed thermodynamic diagram by adopting a distributed encoding sub-network; n distributed thermodynamic diagrams can be obtained under one two-dimensional plane;

103, combining the N distributed thermodynamic diagrams under a two-dimensional plane into a 2D combined thermodynamic diagram in a coordinate mapping mode;

step 104, superposing the 2D joint thermodynamic diagrams under the three two-dimensional planes into a 3D joint thermodynamic diagram through a concat algorithm;

and 105, decoding the 3D joint thermodynamic diagram into N3D detection coordinate vectors by adopting a decoding sub network.

Preferably, the distributed coding sub-network is configured to code each 2D reference coordinate vector into a set of consecutive values, and select a maximum value of the set of consecutive values as a coded value, and use a thermodynamic diagram corresponding to the coded value as a distributed thermodynamic diagram of the 2D reference coordinate vector.

Preferably, the distributed coding sub-network is constructed based on a k-order hourglass network, and the distributed coding sub-network is trained by using a face image with coordinate values to form a nonlinear mapping which is input into a face image with coordinate vectors and output into a distributed thermodynamic diagram.

Preferably, the decoding sub-network is constructed based on a 2D full convolution network; and training the decoding subnetwork using the 3D joint thermodynamic diagram to form a nonlinear mapping with an input as the 3D joint thermodynamic diagram and an output as the 3D detection coordinate vector.

Preferably, the decoding subnetwork comprises: 5 2D convolutional layers, wherein the number of convolutional cores in the 5 2D convolutional layers is respectively as follows: 128, 128, 256, 256, 512; the sizes of convolution kernels are 4 multiplied by 4, and the step length is set to be 2; the batch normalization and LeakyRelu activation functions are collocated in the middle of each convolutional layer.

A distributed thermodynamic diagram based 3D face keypoint detection system comprising at least one processor, and a memory communicatively connected to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method.

In summary, due to the adoption of the technical scheme, the invention at least has the following beneficial effects:

reducing the representation dimension of the 3D key point coordinate vector by projecting the coordinate vector to a two-dimensional plane; combining the advantages of 2D face key point detection and 3D face key point detection, providing a face key point detection model combining distributed thermodynamic diagrams and coordinate regression; the projection coordinates of the two-dimensional plane are subjected to distributed coding and then are expressed into a 2D (two-dimensional) combined thermodynamic diagram through coordinate mapping, so that the relation among the coordinates is kept, and the expressed dimensionality is reduced; and decoding the combined thermodynamic diagram through coordinate regression to obtain final 3D key point detection coordinates, so that the process of directly detecting N3D key point coordinates from one 2D face image is realized.

Drawings

Fig. 1 is a flowchart of a distributed thermodynamic diagram-based face 3D keypoint detection method according to an exemplary embodiment of the present invention.

Fig. 2 is a schematic diagram of a distributed coding subnetwork structure according to an exemplary embodiment of the present invention.

Fig. 3 is a 2D face key point and its corresponding distributed thermodynamic diagrams, which are combined by coordinate mapping to form a combined thermodynamic diagram (the leftmost projection of the combined thermodynamic diagram on a plane), according to an exemplary embodiment of the present invention.

Fig. 4 is a schematic diagram of a one-stage stacked hourglass network configuration according to an exemplary embodiment of the present invention.

Fig. 5 is a schematic diagram of a first order hourglass network structure according to an exemplary embodiment of the present invention.

FIG. 6 is a schematic diagram of a decoding subnetwork structure according to an exemplary embodiment of the present invention

Fig. 7 is a schematic structural diagram of a face 3D keypoint detection system based on a distributed thermodynamic diagram according to an exemplary embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and embodiments, so that the objects, technical solutions and advantages of the present invention will be more clearly understood. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 illustrates a face 3D keypoint detection method based on distributed thermodynamic diagrams according to an exemplary embodiment of the present invention. The method of this embodiment mainly includes:

specifically, the 3D reference coordinate vectors of N key points of the face are extracted from a group route (generally abbreviated as GT information) data set, and the total number of the key points of the general face is 68, so that N is preferably 68 in this embodiment. And performing dimensionality reduction decomposition on the extracted N3D key point reference coordinate vectors (x, y, z) in three two-dimensional planes.

In the specific projection, the three 2D reference coordinate vectors (x, y), (y, z) and (x, z) are decomposed. Let V_x,y,zRepresenting the keypoint 3D reference coordinate vector, (x, y, z), the three 2D reference coordinate vectors generated separately are:

for example: a three-dimensional space coordinate point is (1, -2, 3), and dimension reduction decomposition is carried out to obtain (1, -2), (-2, 3) and (1, 3), but in order to form a joint 2D thermodynamic diagram later, three coordinate planes of xy, yz and xz (x, y and z have the same positive and negative polarities and are positive or negative at the same time) are projected during dimension reduction; therefore, three two-dimensional reference coordinates with the same positive and negative characters can be obtained by each three-dimensional coordinate after dimension reduction. Preferably, we project it in three planes in the first quadrant (x, y, z are all positive) of the spatial coordinate system.

102, respectively encoding each 2D reference coordinate vector into a distributed thermodynamic diagram by adopting a distributed encoding sub-network; wherein, N distributed thermodynamic diagrams can be obtained by one two-dimensional plane;

specifically, fig. 2 shows the distributed coding sub-network structure, where the distributed coding sub-network codes each 2D reference coordinate vector as a set of continuous values, and selects the maximum value as a coding value, and uses the thermodynamic diagram corresponding to the coding value as the distributed thermodynamic diagram of the 2D reference coordinate vector. Order toIndicates that the mth thermodynamic diagram is located at (i)_m,j_m) The value of (c), m ∈ {1,2,3 }. For the nth key point on the face image, the position is v_x,y,v_y,z,v_x,zThe (x, y) coordinate vector is encoded in 2D gaussian form (the other two coordinate vectors do the same operation), as shown in equation (1) (σ is the variance):

for a face image with N key points, selecting the maximum value of each key point in a series of continuous values coded by the key point through a max function as a coded value, wherein the thermodynamic diagram corresponding to the coded value is the distributed thermodynamic diagram of the corresponding 2D reference coordinate vector. Fig. 3 (taking N ═ 15 as an example) shows the projection (leftmost) of the corresponding 2D face key points and their corresponding distributed thermodynamic diagrams, and the joint thermodynamic diagram formed by the joint of the distributed thermodynamic diagrams through coordinate mapping, on the image plane.

The distributed coding sub-network is constructed based on a k-order hourglass network model (for example, k is 1), a network finally used for distributed coding is obtained through training and learning, and the distributed coding sub-network is trained by using a face image with coordinate values to form a nonlinear mapping which is input into a face image with coordinate vectors and output into a distributed thermodynamic diagram. Since the distributed coding sub-network only needs to perform mapping from a face image with coordinate vectors to a distributed thermodynamic diagram, the corresponding model is very simple and miniaturized, and the execution speed of the model is further increased.

As shown in fig. 4 and 5, the k-order (k ═ 1) hourglass network model is centered on an hourglass network (the number of input channels is 256, the number of output channels is 512, and the specific structure is shown in fig. 2), and other modules are added to form a one-stage stacked hourglass network. The original image firstly passes through a convolution layer (the size of a convolution kernel is 7), firstly passes through Batch Normalization, then is subjected to down-sampling and maximum value pooling, and then passes through 3 residual modules (the input and the output of the residual modules are all 128 channels) and is input into an hourglass subnetwork. And the output result of the hourglass is processed by two linear transformation modules, and then is subjected to channel conversion by convolution (the size of a convolution kernel is 1) to obtain the final thermodynamic diagram. With a first order hourglass network as shown in figure 2. The upper half path and the lower half path both comprise a plurality of residual modules (the first number represents an input channel, and the second number represents an output channel), and deeper features are extracted step by step. But the first half is performed in the original scale, and the second half is subjected to the process of down sampling and up sampling. The downsampling uses maxporoling and the upsampling uses nearest neighbor interpolation. And finally, adding the upper half output and the lower half output to obtain final output.

Since the size of the joint thermodynamic diagram is w × H × 3, the encoding resolution is often set to 128 × 128 × 3 for a face image of size 256 × 256, so that the distributed encoding subnetwork E forms a mapping E (I) → H from the face image I coordinates input to the distributed thermodynamic diagram H. The network inputs a face image with a size of 128 × 128, and outputs a w × h distributed thermodynamic diagram (the size of the output layer thermodynamic diagram can be set according to actual needs), and the loss function is shown in formula (2):

specifically, the N thermodynamic diagrams obtained in each two-dimensional plane in step 102 are combined into one thermodynamic diagram in a corresponding point coordinate mapping manner, so as to obtain three 2D combined thermodynamic diagrams with the size of w × h.

specifically, 2D joint thermodynamic diagrams under three two-dimensional planes are superposed by adopting a concat method to obtain a 3D thermodynamic diagram. The Concat method is a joint vector algorithm used to join two or more arrays. These three 2D joint thermodynamic diagrams can be superimposed together by the concat method, resulting in a 3D thermodynamic diagram with a size of w × h × 3 (where 3 represents 3 channels), as shown in equation (3):

H＝concat(p₁,p₂,p₃) (3)

Specifically, the 3D thermodynamic diagrams obtained by the decoding subnetwork are decoded to obtain the detected coordinate vectors of the N3D key points.

The decoding sub-network may be pre-trained to form a mapping D (H) → c of the joint thermodynamic diagram H to the corresponding 3D coordinate vector c. Since the size of the joint thermodynamic diagram H is w × H × 3, the decoding sub-network is constructed by using a 2D full convolution network to decode the thermodynamic diagram, as shown in fig. 6; the decoding sub-network comprises 5 2D convolutional layers, the number of convolutional cores is 128, 128, 256, 256 and 512, the sizes of the convolutional cores are 4 multiplied by 4, the step length is 2, the number of channels of the last convolutional layer is Nmultiplied by 3, a batch normalization function and a LeakyRelu activation function are matched in the middle of each convolutional layer, the last layer is a global average pooling layer, and N3D key point coordinate vectors can be obtained through a 3D joint thermodynamic diagram obtained by a concat method through the decoding sub-network. Further, we pre-train the decoding sub-network with a mean square error loss function, which is shown in equation (5):

therefore, the extraction of the detection coordinate vectors of the N3D key points of the human face is completed.

Further, when an algorithm model is established, the distributed coding sub-network and the decoding sub-network can be pre-trained respectively, and then the two networks are connected together to be used as a whole for fine tuning (in the programming process, coordinate mapping is added between the two network models to carry out combination of N thermodynamic diagrams, a cancat algorithm is added to carry out superposition of 2D combined thermodynamic diagrams to 3D thermodynamic diagrams), and the algorithm model is mainly carried out by two steps:

the first step is as follows: in the pre-training stage, the distributed coding sub-network is trained by using a face image with a coordinate vector, so that the non-linear mapping with 1 face image with the coordinate vector as input and one distributed thermodynamic diagram as output layer is formed. At the same time, the decoding sub-networks are trained using a 3D joint thermodynamic diagram to form a non-linear mapping whose input is the 3D joint thermodynamic diagram and whose output is the 3D detection coordinate vector.

The second step is that: in the fine tuning stage, the decoding sub-network after the pre-training is connected to the back of the distributed coding sub-network after the pre-training, a coordinate mapping algorithm and a concat algorithm are added between the two networks to form a complete distributed thermodynamic diagram human face 3D key point detection network model, the complete network model is fine tuned, finally the whole network is trained in an end-to-end mode, and the loss function is as follows:

wherein the content of the first and second substances,for decoding subnetworksA net loss function;a loss function for the distributed coding sub-network; λ is the weight of the coordinate regression loss (typically a number less than 1, such as 0.1); d represents a decoding subnetwork; c represents a 3D detected coordinate vector; h represents a distributed thermodynamic diagram; e denotes a distributed coding sub-network and I denotes a face image with coordinate vectors.

Fig. 7 illustrates a face 3D keypoint detection system based on joint thermodynamic diagrams, namely an electronic device 310 (e.g., a computer server with program execution functionality) comprising at least one processor 311, a power supply 314, and a memory 312 and an input-output interface 313 communicatively connected to the at least one processor 311, according to an exemplary embodiment of the invention; the memory 312 stores instructions executable by the at least one processor 311, the instructions being executable by the at least one processor 311 to enable the at least one processor 311 to perform a method disclosed in any one of the embodiments; the input/output interface 313 may include a display, a keyboard, a mouse, and a USB interface for inputting/outputting data; the power supply 314 is used to provide power to the electronic device 310.

Those skilled in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

When the integrated unit of the present invention is implemented in the form of a software functional unit and sold or used as a separate product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The foregoing is merely a detailed description of specific embodiments of the invention and is not intended to limit the invention. Various alterations, modifications and improvements will occur to those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. A face 3D key point detection method based on a distributed thermodynamic diagram is characterized by comprising the following steps:

2. The method of claim 1, wherein the distributed coding sub-network is configured to code each 2D reference coordinate vector as a set of consecutive values, and select a maximum value of the set of consecutive values as a coded value, and the thermodynamic diagram corresponding to the coded value is a distributed thermodynamic diagram of the 2D reference coordinate vector.

3. The method of claim 1, wherein the distributed coding sub-network is constructed based on a k-order hourglass network, and is trained using face images with coordinate values to form a non-linear mapping that is input as a face image with coordinate vectors and output as a distributed thermodynamic diagram.

4. The method of claim 1, wherein the decoding sub-network is constructed based on a 2D full convolutional network; and training the decoding subnetwork using the 3D joint thermodynamic diagram to form a nonlinear mapping with an input as the 3D joint thermodynamic diagram and an output as the 3D detection coordinate vector.

5. The method of claim 4, wherein decoding the sub-network comprises: 5 2D convolutional layers, wherein the number of convolutional cores in the 5 2D convolutional layers is respectively as follows: 128, 128, 256, 256, 512; the sizes of convolution kernels are 4 multiplied by 4, and the step length is set to be 2; the batch normalization and LeakyRelu activation functions are collocated in the middle of each convolutional layer.

6. A human face 3D key point detection system based on distributed thermodynamic diagrams is characterized by comprising at least one processor and a memory which is in communication connection with the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 5.