CN109902641B

CN109902641B - Semantic alignment-based face key point detection method, system and device

Info

Publication number: CN109902641B
Application number: CN201910168643.XA
Authority: CN
Inventors: 朱翔昱; 雷震; 王金桥; 刘智威
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2021-03-02
Anticipated expiration: 2039-03-06
Also published as: CN109902641A

Abstract

The invention belongs to the field of face recognition, in particular to a face key point detection method, a system and a device based on semantic alignment, aiming at improving the accuracy of face key point detection, the method adopts a training sample which is constructed by a basic convergent face key point detection network and comprises a face image sample marked with key points and a response image of standard Gaussian with each key point as the center, and optimizes the face key point detection network by using a probability model containing hidden variables as a target of maximum likelihood estimation; and predicting the coordinates of the key points of the human face through the finally optimized human face key point detection network. The invention effectively solves the problem of training oscillation caused by labeling randomness in the network training process and improves the accuracy of face key point detection.

Description

Semantic alignment-based face key point detection method, system and device

Technical Field

The invention belongs to the field of face recognition, and particularly relates to a face key point detection method, system and device based on semantic alignment.

Background

The face key points play an important role in computer vision and pattern recognition applications based on faces, such as video monitoring and identity recognition systems. For most face applications, accurate detection of key points of the face is required first.

The mainstream face key point detection methods in recent years are mainly divided into two categories, one category is a traditional method. One type is a convolutional neural network based approach. The traditional method directly regresses model parameters through manual image features. A representative method is cascade regression, and the fitting process can be summarized as the following formula:

p^k+1＝p^k+Reg^k(Fea(I,p^k))

at the k-th iteration, go through Reg^kRegressing the shape index feature Fea to update the shape parameter p^k. Wherein the shape index feature Fea depends on the input image I and the current shape parameter p^k. The regressor updates the model parameters according to the shape index features and calculates new features for the next iteration. By utilizing the characteristic, a plurality of weak regressors can be connected in series to form a strong regressor so as to gradually reduce the error.

The method is mainly divided into two categories, one is a coordinate regression-based method, the method regards key point positioning as a regression process of mapping image pixels to key point coordinates, and a face picture is input into the convolutional neural network to directly predict a vector formed by the coordinates of each key point. And the other type is a response graph-based method, a response graph of each key point is predicted by a convolutional neural network, and the peak position of the response is taken as the position of the predicted key point.

In the face key point detection method, the positions of the manually marked key points are taken as regression targets during training, so that the positions of the manually marked points are estimated as much as possible by the model. However, there are a large number of weak semantic points in the face key points, and these points are usually only required to be uniformly distributed on the designated edges, such as the face contour, the eye socket, the nose bridge, and other areas, and there is no strict semantic position. Because the identification degree of the texture information around the weak semantic points is low, random errors inevitably exist in manual labeling results, and thus the phenomenon of inconsistent semantics exists in the labeling of different samples. Therefore, the direct training of the model using the manual calibration points results in a large number of invalid errors during the training process, so that the network fitting capability cannot be concentrated on the real required places. So far, the influence of the labeling randomness of key points on model training is not considered.

Disclosure of Invention

In order to solve the above-mentioned problems in the prior art, that is, in order to improve the accuracy of face key point detection, a first aspect of the present invention provides a face key point detection method based on semantic alignment, including:

step S10, acquiring a response image of each key point of the face image to be detected based on the face key point detection network;

step S20, selecting the coordinate of the response peak value as the predicted coordinate of each key point for the obtained response graph of each key point;

the face key point detection network is constructed based on a convolutional neural network, and a probability model containing hidden variables is used as a target of maximum likelihood estimation to perform network optimization and is used for outputting a response graph of key points in a face image.

In some preferred embodiments, the training samples of the face keypoint detection network include face image samples marked with keypoints and response maps of standard gaussians with the keypoints as centers.

In some preferred embodiments, the face key point detection network adopts an objective function in the optimization process of

Wherein x is an input face imageAn image;

a set of coordinates for k keypoints;

predicting a real position with consistent semantics of the kth key point in x by the network; k is the number of key points; w is the network weight; o^kRepresenting the kth labeled key point of x; sigma₁、σ₂Respectively is a preset first weight and a preset second weight;

the distribution distance between the response graph representing the network prediction and the standard Gaussian response graph; n (o)^k) Is to o^kIs the neighborhood of the center.

In some preferred embodiments, in the iterative optimization process of the face key point detection network, the network weight W is fixed in the t-th iteration process, and the real position is calculated by the following formula

Wherein o is_t ^kThe value of (A) is the real position obtained by optimization in the last iteration

And then based on the true position

The optimized network weights W are obtained by the following formula,

in some preferred embodiments, the end condition of the iterative optimization process of the face keypoint detection network is as follows:

objective function

Reach a preset convergence condition, or

The iteration times reach the preset times.

The invention provides a human face key point detection system based on semantic alignment, which comprises a response image acquisition module and a key point prediction coordinate acquisition module;

the response image acquisition module is configured to acquire a response image of each key point of the face image to be detected based on the face key point detection network;

the key point prediction coordinate acquisition module is configured to select the coordinate of the response peak value as the prediction coordinate of each key point for the acquired response graph of each key point;

the face key point detection network is constructed based on a convolutional neural network and is used for outputting a response image of key points in a face image.

Wherein, x is an input face image;

a set of coordinates for k keypoints;

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being suitable for being loaded and executed by a processor to implement the above-mentioned face key point detection method based on semantic alignment.

In a fifth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable to be loaded and executed by a processor to realize the above-mentioned face key point detection method based on semantic alignment.

The invention has the beneficial effects that:

the semantic ambiguity of weak semantic key points distributed on the edge can cause random errors of manual labeling. The manual labeling is directly used as an optimization target of network training, so that a large amount of invalid errors exist in the training, and the predicted position continuously vibrates. Resulting in the network fitting capability not being concentrated where it is really needed. The human face key point detection method based on semantic alignment can calculate the true positions of key points without errors and with consistent semantics, effectively solves the problem of training oscillation caused by labeling randomness, and simultaneously enables the fitting of weak semantic points to be more flexible. Due to the fact that the network fitting capacity is distributed more reasonably, the performance can be improved remarkably on the premise that the operation amount is not changed.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic diagram of training and detection processes in a face key point detection method based on semantic alignment according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention discloses a face key point detection method based on semantic alignment, which comprises the following steps:

the face key point detection network is constructed based on a convolutional neural network, and a probability model containing hidden variables is used as a target of maximum likelihood estimation to perform network optimization, so as to output a response graph of key points in a face image

In order to more clearly explain the human face key point detection method based on semantic alignment, the method of the invention is expanded and detailed from three aspects of training data, network optimization and human face key point detection by combining with the attached figure 1.

1. Training data

Manually calibrating the key points of the human face to serve as training samples, presetting a calibrated human face frame, and accurately calibrating the two-dimensional coordinates of each key point in the frame for each human face frame.

For each labeled keypoint in the training sample, a response graph of a standard Gaussian centered on the labeled position is generated to be used as network supervision information.

In the constructed training data, each training sample comprises a face image sample marked with key points and a response graph of standard gaussians with the key points as centers.

2. Network optimization

The face key point detection network adopted in the embodiment may be an hourglass network or other full convolution network structures. When the network is trained by using the original artificially labeled training set until the network is basically converged, the network is in a shaking state due to continuous fitting of key points with labeling randomness, so that the network cannot be converged to a better local optimal value.

2.1 construction of the objective function

Because the manual labeling has semantic inconsistency, the problem of difficult convergence can be caused by directly taking the manual labeling as an optimization target of network training. Therefore, the invention introduces real key points which are free of errors and have consistent semantics among different samples as the target of network optimization, and the manually marked positions can be regarded as an observed value of the real positions of the key points, thereby more reasonably modeling the detection problem of the key points of the human face. As the training process of the detection network can be regarded as maximum likelihood estimation, the invention designs a probability model containing hidden variables as the target of the maximum likelihood estimation. The probability model is shown in formula (1)

Wherein,

representing semantically consistent true positions (true values) of key points, i.e. hidden variables, o represents

Such as manually labeled key points. x represents the input face picture and W represents the network weight.

For a given true value

Obtaining the probability of the current observed value o, estimating according to experience that the observed value should be close to the true value, so that the prior probability model

Is as defined in formula (2):

wherein K is the number of key points, sigma₁The weight representing the prior probability model, denoted as the first weight, o^kRepresents the kth labeled key point of x. Likelihood function

Defining the confidence degree that the current network predicted value is a true hidden variable as shown in formulas (3) and (4):

wherein,

the invention measures the distribution distance between the response graph predicted by the network and the standard Gaussian response graph by using a chi-square test. i denotes the pixel index, E denotes the standard gaussian response map, and Φ (y | x; W) denotes the response map predicted for the input picture x. Sigma₂The weight representing the likelihood function is denoted as the second weight. By integrating the prior probability model and the likelihood function, the human face key point detection network training can be converted into a constraint optimization problem, as shown in formula (5):

the actual value may not be too far from the observed value

Is limited to the current observed value o^kA central neighborhood N (o)^k) In (1).

2.2 iterative optimization

In the training process, the optimization of the network weight in the face key point detection network is carried out in an iterative optimization mode, the network weight W is fixed in the t-th iteration process, and the real position is calculated through the formula (6)

The actual value obtained by optimization at this time

The location of (a) is integrated into consideration of the predicted location and observed location of the network. The knowledge prior of the whole training set sample is learned by the network, and the optimized knowledge is obtained

Labeling randomness in a single sample can be smoothed to some extent.

And then based on the true position

The optimized network weight W is obtained by equation (7),

considering the objective function as a loss function, the optimization weights W can be considered as a standard network training process. The whole training process of the key point detection model can be regarded as that the real position of the key point is obtained by optimizing in each iteration process

Reuse of true position

And training the network. Compared with the existing algorithm, the method provides a more flexible optimization target, so that unreasonable errors in the training process are effectively reduced, the fitting performance of the convolutional neural network can be more reasonably distributed, and the accuracy of key point positioning is improved.

The preferred steps in the training process are: (1) training a face key point detection network directly based on the artificially labeled face image training sample in the step 1 until basic convergence; (2) and (3) further optimizing the face key point detection network by adopting a 2.2 iterative optimization method based on the training sample constructed in the step 1. The end condition of the iterative optimization process of the face key point detection network is an objective function

And reaching a preset convergence condition, or enabling the iteration times to reach preset times.

3. Face keypoint detection

Step S10, acquiring a response image of each key point of the face image to be detected based on the face key point detection network trained by the method;

step S20, for each obtained response map of the key point, selecting the coordinate of the response peak as the predicted coordinate of the key point.

The invention discloses a human face key point detection system based on semantic alignment, which comprises a response image acquisition module and a key point prediction coordinate acquisition module, wherein the response image acquisition module is used for acquiring a key point prediction coordinate;

The specific training method of the face key point detection network has been described in detail above, and is not repeated here.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the face keypoint detection system based on semantic alignment provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to implement the above-mentioned face key point detection method based on semantic alignment.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable to be loaded and executed by a processor to realize the above-mentioned face key point detection method based on semantic alignment.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A face key point detection method based on semantic alignment is characterized by comprising the following steps:

in the face key point detection network, an objective function adopted in the optimization process is as follows:

wherein e is an input face image;

a set of coordinates for k keypoints;

predicting a real position with consistent semantics of the kth key point in x by the network; k is the number of key points; w is the network weight; e represents the kth labeled key point of the x; sigma₁、σ₂Respectively is a preset first weight and a preset second weight;

the distribution distance between the response graph representing the network prediction and the standard Gaussian response graph; n (o)^k) Is to o^kA neighborhood that is the center;

in the iterative optimization process of the face key point detection network, the network weight W is fixed in the t-th iteration process, and the real position is calculated through the following formula

And then based on the true position

The optimized network weights W are obtained by the following formula,

the end condition of the iterative optimization process of the face key point detection network is as follows: objective function

Reaching a preset convergence condition, or enabling the iteration times to reach preset times;

2. The method for detecting the key points of the human face based on the semantic alignment as claimed in claim 1, wherein the training samples of the human face key point detection network comprise human face image samples marked with key points and response graphs of standard gaussians with the key points as centers.

3. A face key point detection system based on semantic alignment is characterized by comprising: a response map acquisition module and a key point prediction coordinate acquisition module;

the human face key point detection network adopts an objective function in the optimization process of

Wherein, x is an input face image;

a set of coordinates for k keypoints;

4. The semantic alignment-based face keypoint detection system of claim 3, wherein the training samples of the face keypoint detection network comprise face image samples marked with keypoints and response graphs of standard gaussians with the keypoints as centers.

5. A storage device having stored therein a plurality of programs adapted to be loaded and executed by a processor to implement the semantic alignment based face keypoint detection method of any of claims 1-2.

6. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the semantic alignment based face keypoint detection method of any of claims 1-2.