CN109886121B

CN109886121B - Human face key point positioning method for shielding robustness

Info

Publication number: CN109886121B
Application number: CN201910061018.5A
Authority: CN
Inventors: 吴思; 王梁昊; 李东晓; 张明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2021-04-06
Anticipated expiration: 2039-01-23
Also published as: CN109886121A

Abstract

The invention discloses a human face key point positioning method of an occlusion robustness, and belongs to the field of human face key point positioning. The method comprises the following steps: s1: collecting face images marked by key points as a training data set and a test data set; s2: for the training data set and the test data set in the S1, firstly, carrying out face detection and 5-point key point positioning by using a face detection algorithm MTCNN, then transforming a T-transformed face by using a Poinch analysis to obtain a transformed face sample, and forming a training sample and a test sample; taking the corrected human face sample as the input of the stage S3; s3: learning the training samples in S2 through a convolutional network; s4: after the convolutional network completes the training of S3, the test sample in S2 is input into the convolutional network, and the positions of the key points of the face in the image are obtained. The method has robustness for positioning the key points of the face under the shielding condition, and the effectiveness of the method is proved by the test result and the qualitative positioning result of the corresponding picture.

Description

Human face key point positioning method for shielding robustness

Technical Field

The invention belongs to the field of human face key point positioning, and particularly relates to a human face key point positioning method based on occlusion robustness.

Background

The face key point positioning technology is based on a face image, firstly, whether a face exists or not is detected for an input face image or a video stream, and if the face exists, the position of each face and the position of a key point are further given, such as the positions of a nose, eyes, a mouth and other main facial organs. The technology can be used for face posture correction, posture recognition, fatigue monitoring, face three-dimensional reconstruction, face animation, face recognition, expression analysis and the like. The false key point positioning can cause distortion and deformation of the face, so that an algorithm capable of accurately extracting the face key points is very important. The whole process comprises modules of face detection, face preprocessing, face key point positioning and the like.

At present, the face key point positioning algorithm is mainly divided into three types, namely ASM and AAM based on a generation model, extension and expansion of the ASM and AAM, a method based on cascade shape regression and a method based on deep learning. The most widely studied method is based on deep learning, so the invention mainly aims at the application of the deep learning in the positioning of the key points of the human face.

The problems of the current face key point positioning algorithm include:

1. and acquiring and purifying training data. In the training process of the deep learning model, a large number of effective face samples are relied on to improve the positioning precision of the face key points. Theoretically, the richer the training data, the stronger the generalization ability of the trained model.

2. And (5) designing a deep learning network model. The network model is one of the key factors influencing the positioning accuracy of the key points. The more reasonable the network design, the richer the features that can be extracted, and often the higher the final positioning accuracy. Therefore, reasonable design and optimization of the network structure are needed.

3. Selection of a training loss function. The loss function is important for training the network, a good loss function can guide the network to carry out reasonable optimization, and designing the corresponding loss function aiming at a specific task is also the key for solving the problem.

Although there are many relatively sophisticated algorithms for face keypoint localization in medium poses, face keypoint localization in occlusion still has a relatively large lifting space. Aiming at the three problems, the invention discloses a method for positioning key points of an occlusion robust human face, which positions the occluded key points by designing a cascade neural network framework and utilizing shape constraints among the key points and semantic information of adjacent areas.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a human face key point positioning method with robust occlusion.

The invention adopts the following specific technical scheme:

a human face key point positioning method for occlusion robustness comprises the following steps:

s1: collecting face images marked by key points as a training data set and a test data set;

s2: for the training data set and the test data set in the S1, firstly, face detection and 5-point key point positioning are carried out by using a face detection algorithm MTCNN, and then a T-transformed positive face is transformed by using a PowerPoint analysis to form a training sample and a test sample; taking the corrected human face sample as the input of the stage S3;

s3: learning the training samples in the S2 through a convolutional network, which comprises the following specific steps:

s31: reading face images and corresponding key point positions in a training sample, and normalizing each input face image into 3 channels, wherein the height of each input face image is 224 pixels, and the width of each input face image is 224 pixels;

s32: extracting features of the read face image by adopting a convolutional neural network, performing convolution, BN and relu activation, and performing down-sampling by combining a maximum pooling layer to extract a 28 multiplied by A multi-channel feature map, wherein A is the number of channels in the feature map;

s33: carrying out rough positioning on the extracted multi-channel feature map, carrying out convolution on the multi-channel feature map by convolution kernels with different sizes during rough positioning, and supervising visible human face key points by adopting a pixel-by-pixel cross entropy loss function to obtain C pieces of key point heat maps of 28 multiplied by 28, wherein C represents the number of the visible human face key points; the expression of the pixel-by-pixel cross entropy loss function is as follows:

where N is the number of visible face keypoints,

representative is the ith visible face key point heatPixel true value at pixel point (x, y) on the image;

the method comprises the steps that a pixel value at (x, y) on the ith key point heat map output by a network is firstly subjected to sigmoid activation;

s34: fusing the key point heat map obtained in the S33 by rough positioning and the multi-channel feature map extracted in the S32 together according to the channel direction to obtain a first fused feature map, and performing fine positioning; during fine positioning, convolution kernels with different sizes are used for performing convolution with the first fusion characteristic graph, and the key points of the face are supervised by adopting an L2 loss function of pixel values at corresponding positions on the key point heat map to obtain K +1 key point heat maps of 28 multiplied by 28, wherein K is the number of all visible and invisible key points of the face; the expression of the L2 loss function is as follows:

wherein the content of the first and second substances,

is the ith human face key point heat map output by network prediction,

is the ith heat map generated according to the truth value of the key points of the face, i belongs to [1,2, …, K]；

Is a background region heat map of the network prediction output,

is a heat map generated according to the truth value of the background area; (x, y) represents the position of a pixel point on the heat map;

s35: fusing the multi-channel feature map extracted in the S32 and the key point heat map obtained by fine positioning in the S34 together according to the channel direction to obtain a second fused feature map; firstly, fusing channel information of a second fusion characteristic diagram through point-wise convolution, then obtaining a 7 multiplied by 7 characteristic diagram through two convolution and pooling layers, and then obtaining K characteristic vectors delta S of key points in 2 multiplied by K dimensions through two full-connection layers; adding the delta S and the position of the human face key point obtained by fine positioning in the S34, performing residual calculation with a human face key point true value, and introducing a new loss term by adopting a loss function which is a binary coordinate L2 loss, wherein the loss function has the following form:

wherein S^gtIs a binary coordinate vector of the truth value position of the face key point,

is the position of the key point of the face obtained by the fine positioning in S34, D_boxThe length of the diagonal line of the face frame is used as a normalization factor of the loss function; theta is an included angle between a predicted line segment formed by adjacent key points and a line segment formed by corresponding truth values of the adjacent key points; the included angle θ is defined as follows:

wherein

And

representing predictions in S35

The positions of two adjacent face key points in the image,

and

representing the positions of two ground nodes corresponding to the key points of the adjacent human faces, and forming an included angle theta of two line segments by the four points;

s36: obtaining the key point position S of the face on the original image according to the inverse transformation of the positive face in the S2 on the key point position on the positive face obtained in the S35_finalThe inverse transformation process is as follows:

S_final＝T^-1(S₃)

wherein S₃Is the key point position, T, on the corrected face obtained in S35^-1Is an inverse transformation matrix corresponding to the Fourier analysis transformation T;

s4: after the convolutional network completes the training of S3, the test sample in S2 is input into the convolutional network, and the positions of the key points of the face in the image are obtained.

Preferably, in step S1, the key point marking is performed by: and (3) automatically calibrating the positions of the key points of the samples by using a 3DDFA algorithm, and then manually adjusting.

Preferably, in step S33, the convolution kernels of different sizes are 4 types, and the sizes are 3 × 3 × 256, 3 × 3 × 512, 1 × 1 × 512, and 1 × 1 × 68, respectively.

Preferably, in step S34, the convolution kernels of different sizes are 4 types, and the sizes are 3 × 3 × 128, 3 × 3 × 256, 1 × 1 × 128 and 1 × 1 × 68, respectively.

Preferably, in step S33, the method for generating the pixel truth value on the key point heat map includes:

setting the value of each pixel point in 4 pixel radius ranges around the visible face key point as 1, setting the values of the pixel points in the rest areas outside the 4 pixel radius ranges as 0, and generating a corresponding heat map for each visible face key point.

Preferably, in step S34, the method for generating the face keypoint true value on the keypoint heat map includes:

gaussian responses are placed in 3 pixel radius areas around the key points of the human face, and the corresponding Gaussian response functions are as follows:

where (j, K) represents the pixel index on the 28 × 28 × K heat map, (x)_g,y_g) Representing the coordinates, sigma, of the key points of the real face on one of the heat maps_x,σ_yRepresenting the standard deviation of the gaussian response in the x-direction and y-direction, respectively.

The innovation point of the method is that each module of the CNNs is redesigned and combined, a brand-new framework for positioning the key points of the human face with the shielding robustness is provided, and a new loss function is introduced, so that the positioning precision of the key points of the human face under the shielding condition is improved. The method has robustness for positioning the key points of the face under the shielding condition, and the effectiveness of the method is proved by the test result and the qualitative positioning result of the corresponding picture.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a regression target ground route adopted for coarse positioning;

FIG. 3 is a regression target ground route adopted for fine positioning;

FIG. 4 is a new angle constraint introduced by binary coordinate regression;

FIG. 5 is a result of a binary coordinate regression network locating key points on a face image with and without new angle constraints;

FIG. 6 is the key point location result of the algorithm of the present invention on some occluded pictures.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description. The technical features of the embodiments of the present invention can be combined correspondingly without mutual conflict.

In the invention, the method for positioning the key points of the human face with the shielding robustness comprises the following steps:

s1: a certain number of face images are collected as a training data set and a test data set. The face image needs to be marked by key points in advance, and the marking method comprises the following steps: the method comprises the steps of automatically calibrating the positions of key points of a face image by using a 3DDFA algorithm, then manually screening, and manually adjusting some points which are not accurate. When the sample size of the training data set is not enough, the number of the extended samples can be enhanced for the training data set, and the test samples are not enhanced. Image enhancement of the training data set includes gaussian blur (noise standard deviation of 2), rotation [ -30,30], scaling [0.8,1.2], image level flipping, contrast enhancement and color dithering (simulating illumination change).

It should be noted that, after the image is horizontally flipped, the labels of the corresponding face key points are also correspondingly mapped.

S2: for the training data set and the test data set in S1, face detection and 5-point key point positioning are performed by using a face detection algorithm MTCNN, and then a T-transformed positive face is transformed by using a pockmarks analysis to form a training sample and a test sample. The corrected face sample is used as the input of the stage S3, and the corrected face sample is used as the input of the stage S4.

Through the conversion, all samples are aligned to an average shape, and subsequent network learning can pay more attention to the characteristics of key points and non-rigid transformation on the face, such as posture, expression, shielding and other factors, so that the accuracy of final positioning is facilitated.

the feature extraction section generates a feature map containing structural and semantic information pixel by pixel for a subsequent localization task. For the occlusion problem, context semantic information is needed to locate the position of the occluded key point, so the size of the receptive field is important. Large convolution kernels, e.g. 5 x 5, 7 x 7 convolution kernels, can be considered, but this significantly increases the number of parameters and increases the time for model inference. Through a plurality of maximum pooling layers, the resolution of the feature map is gradually reduced, the sliding of convolution kernels is performed on a small feature map, namely the convolution is performed on a large original map by using a large convolution kernel, the fact that the receptive field is enlarged is also equivalent to the introduction of position constraint among key points, and the robustness under the shielding condition is improved. And because the resolution of the characteristic diagram is small, the sliding times of each convolution kernel are few, and the forward propagation time can be effectively reduced.

S33: and performing coarse positioning on the extracted multi-channel feature map, performing convolution on the multi-channel feature map by convolution kernels with different sizes during coarse positioning, and supervising the visible human face key points by adopting a pixel-by-pixel cross entropy loss function to obtain C pieces of key point heat maps of 28 multiplied by 28, wherein C represents the number of the visible human face key points. The convolution kernels of the different sizes are 4 types in the present example, and the sizes are 3 × 3 × 256, 3 × 3 × 512, 1 × 1 × 512, and 1 × 1 × 68, respectively. The expression of the above pixel-by-pixel cross entropy loss function is as follows:

where N is the number of visible face keypoints,

representative is the ith visible facePixel true value at pixel point (x, y) on the keypoint heat map;

is the value after sigmoid activation is performed on the pixel value at (x, y) on the ith key point heat map output by the network.

During the network training in the step, the positions of the key points are predicted by using the whole face area instead of the local image blocks around the key points all the time, so that the global shape constraint can be introduced, and the situation that the key points fall into the local optimal solution is avoided. This portion of the regression target is not directly represented in binary coordinates, but is converted to a 28 × 28 × C key point heat map. The generation method of the pixel truth value on the key point heat map comprises the following steps: setting the value of each pixel point in 4 pixel radius ranges around the visible face key point as 1, setting the values of the pixel points in the rest areas outside the 4 pixel radius ranges as 0, and generating a corresponding heat map for each visible face key point. The generated ground truth heat map at this stage is shown in fig. 2 (enlarged for intuitive display effect).

And no supervision is applied to the occluded key points in the stage, so the key point heat map output by the coarse positioning network on the occluded key points presents a fuzzy cloud state around the true value, only a small response value is obtained on some points, and the other areas are 0. But the response values are strong at visible key points and the response range is small.

S34: and fusing the key point heat map obtained by rough positioning in the S33 and the multi-channel feature map extracted in the S32 together according to the channel direction to obtain a first fused feature map, and performing fine positioning. This multi-channel format guides fine positioning to focus on the inherent relationships between the key points. Thus, in a fine positioning network, invisible keypoints can rely on visible keypoints to provide contextual semantic information for more accurate positioning.

And during fine positioning, performing convolution on convolution kernels with different sizes and the first fusion characteristic graph, and supervising the key points of the face by adopting an L2 loss function of pixel values at corresponding positions on the key point heat map to obtain K +1 key point heat maps of 28 multiplied by 28, wherein K is the number of all visible and invisible key points of the face. In the present step, there are 4 types of convolution kernels with different sizes, and the sizes are 3 × 3 × 128, 3 × 3 × 256, 1 × 1 × 128 and 1 × 1 × 68 respectively. The expression for the L2 loss function above is as follows:

wherein the content of the first and second substances,

is the ith human face key point heat map output by network prediction,

Is a background region heat map of the network prediction output,

is a heat map generated according to the truth value of the background area; (x, y) represents the position of a pixel point on the heat map.

In addition, the generation method of the face key point truth value on the key point heat map is as follows:

where (j, K) represents the pixel index on the 28 × 28 × K heat map, (x)_g,y_g) Representing the coordinates, sigma, of the key points of the real face on one of the heat maps_x,σ_yRepresenting the standard deviation of the gaussian response in the x-direction and y-direction, respectively. Here weThe standard deviation was set to 3. It can be seen that the farther from the center point, the lower the response value, and the response value at the center point is 1.

The generated ground channel in the fine positioning stage is shown in fig. 3 (for the purpose of intuitive display effect, an enlargement process is performed). Here a heat map is generated for each keypoint and background region. This is equivalent to having different weights for keypoints and background regions, and a greater penalty for locating points in the background region as keypoints. The adverse effect is greater because the background area is located such that keypoints are not located (i.e., determined to be a background area), which may not be important in some cases.

Through this stage we obtain a heat map of all visible and invisible keypoints. Compared with the key point heat map generated by the first-stage coarse positioning, the response range of the heat map output by the second-stage fine positioning around the key points is smaller, particularly, the response value of the heat map corresponding to the invisible key points is obviously enhanced compared with that of the first-stage coarse positioning, the response range is reduced, and the positioning is more accurate.

The heat map is used as an expression form of the regression target, and the expression form comprises the spatial position information of the key points and the confidence information of the key points. And the key point heat map can transmit the position information of the key point predicted at the current stage to the next stage, and the next stage obtains more characteristics and information, so that the position of the key point is further adjusted, and accurate positioning is realized.

While in the first stage coarse localization, the confidence of the occluded points is lower than that of the visible keypoints, in the second stage fine localization, the invisible points are predicted by the context information of the visible points. And a heat map of the background area is introduced, and the value of each pixel point on the heat map represents the probability that the point belongs to the background.

The network in the two stages has good expansibility, because no matter how many personal face key points are marked in the data set, the corresponding number of heat maps can be generated, so that the method has good compatibility. Because interference can be generated by directly fusing the bottom-layer features with a lot of irrelevant information (noise) and the subsequent high-layer features, regularization is added to a convolution kernel when the feature graph of the feature extraction part is fused with the heat graphs output by the first part and the second part of the network respectively through point-wise convolution. And introducing convolution kernel regularization to inhibit noise in the bottom-layer features, and then fusing the low-layer features and the high-layer features.

S35: and fusing the multi-channel feature map extracted in the S32 and the key point heat map obtained by fine positioning in the S34 together according to the channel direction to obtain a second fused feature map. The reason for introducing this step is that global regression learning can effectively perform global shape constraint and reduce errors due to local appearance blurring, in the first place. Secondly, since indexes of pixel points on the feature map are integers, and real key point coordinates are floating point type decimal numbers, in order to further improve positioning accuracy, binary coordinate regression of a third part is introduced after fine positioning of the second part, and the specific method is described as follows: firstly, fusing channel information of a second fusion characteristic diagram through point-wise convolution, then obtaining a 7 multiplied by 7 characteristic diagram through two convolution and pooling layers, and then obtaining K characteristic vectors delta S of key points in 2 multiplied by K dimensions through two full-connection layers; adding the delta S and the position of the human face key point obtained by fine positioning in the S34, performing residual calculation with a human face key point true value, and introducing a new loss term by adopting a loss function which is a binary coordinate L2 loss, wherein the loss function has the following form:

is the position of the key point of the face obtained by the fine positioning in S34, D_boxThe length of the diagonal line of the face frame is used as a normalization factor of the loss function; theta is the line segment formed by predicted adjacent key points and the corresponding adjacent relationThe angle between the segments formed by the key truth values is shown in FIG. 4. The included angle θ is defined as follows:

wherein

And

representing predictions in S35

The positions of two adjacent face key points in the image,

and

we add such a penalty constraint to 51 keypoints inside the face, such as at positions such as eyebrows, eyes, mouth, nose, etc., but not to 17 points on the face contour.

The input of the network computing process is carried out in a conversion space, and in order to obtain the key point positions on the original face image, a post-processing module is required to reversely convert the output result of the network into the original space, which is specifically carried out as follows.

S_final＝T^-1(S₃)

s4: and after the convolutional network finishes the training of S3, inputting the test sample in S2 into the convolutional network to obtain the positions of the key points of the face in the image, wherein the test result meets the requirement to indicate that the training of the convolutional network is finished. After the training is finished, the face key point positioning can be carried out on the new face sample through the convolution network.

The methods of S1-S4 are applied to specific examples to show the technical effects of the present invention. The specific practice of each step is as described above, and the steps in the embodiments will be described in a simplified manner, and will not be described again, and the specific parameters and practice of each step are described again.

Example (b):

in this embodiment, the method mainly includes an input image preprocessing module, a first part coarse positioning module, a second part fine positioning module, and a third part binary coordinate regression module, and the accurate positioning result is finally obtained through these modules. The method for positioning the key points of the human face to realize the occlusion robustness comprises the following steps:

1. the data collection and annotation of the aforementioned step S1 is performed: in this embodiment, the open-source face image datasets 300W, 300VW, COFW, and AFLW are synthesized, some face images under occlusion and large-posture (self-occlusion) conditions are collected by the self, the existing high-precision 3DDFA algorithm is adopted to label key points of the face, and then manual screening is performed, and some inaccurate key points are manually adjusted.

2. The preprocessing of the aforementioned step S2 is performed: reading a face image and a key point group route label, firstly solving an average face according to the key point group route, then detecting a face area by adopting MTCNN (multiple-terminal connected computing network), positioning the positions of 5 key points, and then converting the face into a straight face by using a Poincare analysis.

3. And establishing a deep convolutional network, wherein the deep convolutional network is formed by cascading all the modules in the figure 1. The training samples in S2 are learned through the convolutional network as in the foregoing step S3:

firstly, a feature extraction part is composed of 6 convolutional layers, wherein each 2 convolutional layers are down-sampled by adopting a maximum pooling layer, and the resolution of a feature graph finally obtained by the feature extraction part is 28 multiplied by 28. This partial convolution kernel is initialized with the VGG19 pre-trained model.

The first part and the second part are full convolution networks and comprise common convolution layers and point-wise convolution layers, initialization of convolution kernels is carried out weight initialization by adopting random Gaussian distribution, and initialization variance is 0.005.

In the past, the key point positioning algorithm applies supervision to the key point region and ignores the constraint of the background region, so that in addition to constructing the heat map for the key points in the first and second stages of the network, the heat map of the background region is introduced, specifically, the adverse effect of positioning the points (non-key points) in one background region into the key points is larger, and if one key point is only not positioned (namely, the key point is determined to be the background region), the adverse effect may not be important in some cases. Therefore, the algorithm constructs the heat map for the key points and also constructs the heat map for the background area, and the key points are positioned more accurately through the joint constraint of the two sides.

The third part comprises a common convolution layer, a point-wise convolution layer, a full connection layer and an initialization mode of a convolution kernel, wherein the initialization mode adopts xavier initialization.

In the embodiment, each module of the deep convolutional network is the prior art, the modules are rearranged and combined, the sizes of convolutional kernels, the convolution step lengths and the padding mode of images are correspondingly adjusted, and besides the posing layer and the output layer, other layers in the network use relu activation functions and batchnormal transformation, so that training data and test data are distributed and fluctuated within a certain range, gradient disappearance and overflow are avoided, and network convergence is accelerated. Meanwhile, the method is used as a regularization technology, overfitting is prevented, and the generalization capability of the model is improved. And a dropout layer is added before the last fully connected layer of the binary coordinate regression network.

On the basis of a traditional key point binary coordinate value L2 loss function in the third stage of the network, a position constraint relation between adjacent points is introduced, and a new constraint item, namely an angle constraint shown in FIG. 4, is added.

The input of the network computing process is carried out in a conversion space, and to obtain the key point position on the original face image, a post-processing module is further required to inversely convert the output result of the network into the original space, as shown in formula (6):

S_final＝T^-1(S₃) (6)

where Δ S is the output of the last fully-connected layer in the third stage, T^-1Is an inverse transform matrix.

And finally, obtaining a positioning result of the corresponding human face key points on the original image through similarity inverse transformation.

4. After the convolutional network completes the training of S3, the test sample in S2 is input into the convolutional network, and the positions of the key points of the face in the image are obtained.

As can be seen from table 1, after the new angle constraint term is added, NME (%) of face key point location can be effectively reduced. Fig. 5 shows the visualization result of the influence of adding and not adding such constraint on the final face key point positioning accuracy. Fig. 6 shows the robustness of the algorithm of the present invention to face key point localization in occlusion situations. Table 1 shows that the accuracy of face key point positioning under occlusion conditions is improved by the present invention when comparing the average error (%) normalized with the diagonal of the face frame on a 300W test set with and without new constraints.

TABLE 1

300W test set	challenge	common	public	private
					Adding new constraints	2.50	1.47	1.67	2.06
Without adding new constraints	2.68	1.50	1.71	2.16

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A human face key point positioning method for shielding robustness is characterized by comprising the following steps:

where N is the number of visible face keypoints,

representing the true value of the pixel at the pixel point (x, y) on the ith visible face keypoint heat map;

is on the ith keypoint heat map of the network output (x, y)) Taking the pixel value as the value after sigmoid activation;

in step S33, the method for generating the pixel truth value on the key point heat map includes:

setting the value of each pixel point in 4 pixel radius ranges around the visible face key point as 1, setting the values of the pixel points in the rest areas outside the 4 pixel radius ranges as 0, and generating a corresponding heat map for each visible face key point;

wherein the content of the first and second substances,

is the ith human face key point heat map output by network prediction,

Is a background region heat map of the network prediction output,

is a heat map generated according to the truth value of the background area; (x, y) represents heatThe location of a pixel point on the image;

in step S34, the method for generating the face keypoint true value on the keypoint heat map includes:

where (j, K) represents the pixel index on the 28 × 28 × K heat map, (x)_g,y_g) Representing the coordinates, sigma, of the key points of the real face on one of the heat maps_x,σ_yRespectively representing the standard deviation of the gaussian response in the x-direction and the y-direction;

is the position of the key point of the face obtained by the fine positioning in S34, D_boxThe length of the diagonal line of the face frame is used as a normalization factor of the loss function; theta is the line segment formed by predicted adjacent key points and the corresponding adjacent key pointsAn included angle between line segments formed by the near key point truth value; the included angle θ is defined as follows:

wherein

And

representing predictions in S35

The positions of two adjacent face key points in the image,

and

S_final＝T^-1(S₃)

wherein S₃Is in S35Obtaining the position of a key point on the corrected face, T^-1Is an inverse transformation matrix corresponding to the Fourier analysis transformation T;

2. The occlusion robust face key point positioning method of claim 1, wherein in step S1, the key point labeling is performed by: and (3) automatically calibrating the positions of the key points of the samples by using a 3DDFA algorithm, and then manually adjusting.

3. The occlusion robust face key point positioning method of claim 1, wherein in step S33, the convolution kernels of different sizes are 4 in total, and the sizes are 3 × 3 × 256, 3 × 3 × 512, 1 × 1 × 512 and 1 × 1 × 68, respectively.

4. The occlusion robust face key point positioning method of claim 1, wherein in step S34, the convolution kernels of different sizes are 4 in total, and the sizes are 3 × 3 × 128, 3 × 3 × 256, 1 × 1 × 128 and 1 × 1 × 68, respectively.