CN109886121B - Human face key point positioning method for shielding robustness - Google Patents

Human face key point positioning method for shielding robustness Download PDF

Info

Publication number
CN109886121B
CN109886121B CN201910061018.5A CN201910061018A CN109886121B CN 109886121 B CN109886121 B CN 109886121B CN 201910061018 A CN201910061018 A CN 201910061018A CN 109886121 B CN109886121 B CN 109886121B
Authority
CN
China
Prior art keywords
face
key point
key
pixel
key points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910061018.5A
Other languages
Chinese (zh)
Other versions
CN109886121A (en
Inventor
吴思
王梁昊
李东晓
张明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201910061018.5A priority Critical patent/CN109886121B/en
Publication of CN109886121A publication Critical patent/CN109886121A/en
Application granted granted Critical
Publication of CN109886121B publication Critical patent/CN109886121B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a human face key point positioning method of an occlusion robustness, and belongs to the field of human face key point positioning. The method comprises the following steps: s1: collecting face images marked by key points as a training data set and a test data set; s2: for the training data set and the test data set in the S1, firstly, carrying out face detection and 5-point key point positioning by using a face detection algorithm MTCNN, then transforming a T-transformed face by using a Poinch analysis to obtain a transformed face sample, and forming a training sample and a test sample; taking the corrected human face sample as the input of the stage S3; s3: learning the training samples in S2 through a convolutional network; s4: after the convolutional network completes the training of S3, the test sample in S2 is input into the convolutional network, and the positions of the key points of the face in the image are obtained. The method has robustness for positioning the key points of the face under the shielding condition, and the effectiveness of the method is proved by the test result and the qualitative positioning result of the corresponding picture.

Description

Human face key point positioning method for shielding robustness
Technical Field
The invention belongs to the field of human face key point positioning, and particularly relates to a human face key point positioning method based on occlusion robustness.
Background
The face key point positioning technology is based on a face image, firstly, whether a face exists or not is detected for an input face image or a video stream, and if the face exists, the position of each face and the position of a key point are further given, such as the positions of a nose, eyes, a mouth and other main facial organs. The technology can be used for face posture correction, posture recognition, fatigue monitoring, face three-dimensional reconstruction, face animation, face recognition, expression analysis and the like. The false key point positioning can cause distortion and deformation of the face, so that an algorithm capable of accurately extracting the face key points is very important. The whole process comprises modules of face detection, face preprocessing, face key point positioning and the like.
At present, the face key point positioning algorithm is mainly divided into three types, namely ASM and AAM based on a generation model, extension and expansion of the ASM and AAM, a method based on cascade shape regression and a method based on deep learning. The most widely studied method is based on deep learning, so the invention mainly aims at the application of the deep learning in the positioning of the key points of the human face.
The problems of the current face key point positioning algorithm include:
1. and acquiring and purifying training data. In the training process of the deep learning model, a large number of effective face samples are relied on to improve the positioning precision of the face key points. Theoretically, the richer the training data, the stronger the generalization ability of the trained model.
2. And (5) designing a deep learning network model. The network model is one of the key factors influencing the positioning accuracy of the key points. The more reasonable the network design, the richer the features that can be extracted, and often the higher the final positioning accuracy. Therefore, reasonable design and optimization of the network structure are needed.
3. Selection of a training loss function. The loss function is important for training the network, a good loss function can guide the network to carry out reasonable optimization, and designing the corresponding loss function aiming at a specific task is also the key for solving the problem.
Although there are many relatively sophisticated algorithms for face keypoint localization in medium poses, face keypoint localization in occlusion still has a relatively large lifting space. Aiming at the three problems, the invention discloses a method for positioning key points of an occlusion robust human face, which positions the occluded key points by designing a cascade neural network framework and utilizing shape constraints among the key points and semantic information of adjacent areas.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a human face key point positioning method with robust occlusion.
The invention adopts the following specific technical scheme:
a human face key point positioning method for occlusion robustness comprises the following steps:
s1: collecting face images marked by key points as a training data set and a test data set;
s2: for the training data set and the test data set in the S1, firstly, face detection and 5-point key point positioning are carried out by using a face detection algorithm MTCNN, and then a T-transformed positive face is transformed by using a PowerPoint analysis to form a training sample and a test sample; taking the corrected human face sample as the input of the stage S3;
s3: learning the training samples in the S2 through a convolutional network, which comprises the following specific steps:
s31: reading face images and corresponding key point positions in a training sample, and normalizing each input face image into 3 channels, wherein the height of each input face image is 224 pixels, and the width of each input face image is 224 pixels;
s32: extracting features of the read face image by adopting a convolutional neural network, performing convolution, BN and relu activation, and performing down-sampling by combining a maximum pooling layer to extract a 28 multiplied by A multi-channel feature map, wherein A is the number of channels in the feature map;
s33: carrying out rough positioning on the extracted multi-channel feature map, carrying out convolution on the multi-channel feature map by convolution kernels with different sizes during rough positioning, and supervising visible human face key points by adopting a pixel-by-pixel cross entropy loss function to obtain C pieces of key point heat maps of 28 multiplied by 28, wherein C represents the number of the visible human face key points; the expression of the pixel-by-pixel cross entropy loss function is as follows:
Figure BDA0001954113430000021
where N is the number of visible face keypoints,
Figure BDA0001954113430000022
representative is the ith visible face key point heatPixel true value at pixel point (x, y) on the image;
Figure BDA0001954113430000023
the method comprises the steps that a pixel value at (x, y) on the ith key point heat map output by a network is firstly subjected to sigmoid activation;
s34: fusing the key point heat map obtained in the S33 by rough positioning and the multi-channel feature map extracted in the S32 together according to the channel direction to obtain a first fused feature map, and performing fine positioning; during fine positioning, convolution kernels with different sizes are used for performing convolution with the first fusion characteristic graph, and the key points of the face are supervised by adopting an L2 loss function of pixel values at corresponding positions on the key point heat map to obtain K +1 key point heat maps of 28 multiplied by 28, wherein K is the number of all visible and invisible key points of the face; the expression of the L2 loss function is as follows:
Figure BDA0001954113430000031
wherein the content of the first and second substances,
Figure BDA0001954113430000032
is the ith human face key point heat map output by network prediction,
Figure BDA0001954113430000033
is the ith heat map generated according to the truth value of the key points of the face, i belongs to [1,2, …, K];
Figure BDA0001954113430000034
Is a background region heat map of the network prediction output,
Figure BDA0001954113430000035
is a heat map generated according to the truth value of the background area; (x, y) represents the position of a pixel point on the heat map;
s35: fusing the multi-channel feature map extracted in the S32 and the key point heat map obtained by fine positioning in the S34 together according to the channel direction to obtain a second fused feature map; firstly, fusing channel information of a second fusion characteristic diagram through point-wise convolution, then obtaining a 7 multiplied by 7 characteristic diagram through two convolution and pooling layers, and then obtaining K characteristic vectors delta S of key points in 2 multiplied by K dimensions through two full-connection layers; adding the delta S and the position of the human face key point obtained by fine positioning in the S34, performing residual calculation with a human face key point true value, and introducing a new loss term by adopting a loss function which is a binary coordinate L2 loss, wherein the loss function has the following form:
Figure BDA0001954113430000036
wherein SgtIs a binary coordinate vector of the truth value position of the face key point,
Figure BDA0001954113430000037
is the position of the key point of the face obtained by the fine positioning in S34, DboxThe length of the diagonal line of the face frame is used as a normalization factor of the loss function; theta is an included angle between a predicted line segment formed by adjacent key points and a line segment formed by corresponding truth values of the adjacent key points; the included angle θ is defined as follows:
Figure BDA0001954113430000038
Figure BDA0001954113430000039
Figure BDA00019541134300000310
wherein
Figure BDA0001954113430000041
And
Figure BDA0001954113430000042
representing predictions in S35
Figure BDA0001954113430000043
The positions of two adjacent face key points in the image,
Figure BDA0001954113430000044
and
Figure BDA0001954113430000045
representing the positions of two ground nodes corresponding to the key points of the adjacent human faces, and forming an included angle theta of two line segments by the four points;
s36: obtaining the key point position S of the face on the original image according to the inverse transformation of the positive face in the S2 on the key point position on the positive face obtained in the S35finalThe inverse transformation process is as follows:
Sfinal=T-1(S3)
wherein S3Is the key point position, T, on the corrected face obtained in S35-1Is an inverse transformation matrix corresponding to the Fourier analysis transformation T;
s4: after the convolutional network completes the training of S3, the test sample in S2 is input into the convolutional network, and the positions of the key points of the face in the image are obtained.
Preferably, in step S1, the key point marking is performed by: and (3) automatically calibrating the positions of the key points of the samples by using a 3DDFA algorithm, and then manually adjusting.
Preferably, in step S33, the convolution kernels of different sizes are 4 types, and the sizes are 3 × 3 × 256, 3 × 3 × 512, 1 × 1 × 512, and 1 × 1 × 68, respectively.
Preferably, in step S34, the convolution kernels of different sizes are 4 types, and the sizes are 3 × 3 × 128, 3 × 3 × 256, 1 × 1 × 128 and 1 × 1 × 68, respectively.
Preferably, in step S33, the method for generating the pixel truth value on the key point heat map includes:
setting the value of each pixel point in 4 pixel radius ranges around the visible face key point as 1, setting the values of the pixel points in the rest areas outside the 4 pixel radius ranges as 0, and generating a corresponding heat map for each visible face key point.
Preferably, in step S34, the method for generating the face keypoint true value on the keypoint heat map includes:
gaussian responses are placed in 3 pixel radius areas around the key points of the human face, and the corresponding Gaussian response functions are as follows:
Figure BDA0001954113430000046
where (j, K) represents the pixel index on the 28 × 28 × K heat map, (x)g,yg) Representing the coordinates, sigma, of the key points of the real face on one of the heat mapsxyRepresenting the standard deviation of the gaussian response in the x-direction and y-direction, respectively.
The innovation point of the method is that each module of the CNNs is redesigned and combined, a brand-new framework for positioning the key points of the human face with the shielding robustness is provided, and a new loss function is introduced, so that the positioning precision of the key points of the human face under the shielding condition is improved. The method has robustness for positioning the key points of the face under the shielding condition, and the effectiveness of the method is proved by the test result and the qualitative positioning result of the corresponding picture.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a regression target ground route adopted for coarse positioning;
FIG. 3 is a regression target ground route adopted for fine positioning;
FIG. 4 is a new angle constraint introduced by binary coordinate regression;
FIG. 5 is a result of a binary coordinate regression network locating key points on a face image with and without new angle constraints;
FIG. 6 is the key point location result of the algorithm of the present invention on some occluded pictures.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description. The technical features of the embodiments of the present invention can be combined correspondingly without mutual conflict.
In the invention, the method for positioning the key points of the human face with the shielding robustness comprises the following steps:
s1: a certain number of face images are collected as a training data set and a test data set. The face image needs to be marked by key points in advance, and the marking method comprises the following steps: the method comprises the steps of automatically calibrating the positions of key points of a face image by using a 3DDFA algorithm, then manually screening, and manually adjusting some points which are not accurate. When the sample size of the training data set is not enough, the number of the extended samples can be enhanced for the training data set, and the test samples are not enhanced. Image enhancement of the training data set includes gaussian blur (noise standard deviation of 2), rotation [ -30,30], scaling [0.8,1.2], image level flipping, contrast enhancement and color dithering (simulating illumination change).
It should be noted that, after the image is horizontally flipped, the labels of the corresponding face key points are also correspondingly mapped.
S2: for the training data set and the test data set in S1, face detection and 5-point key point positioning are performed by using a face detection algorithm MTCNN, and then a T-transformed positive face is transformed by using a pockmarks analysis to form a training sample and a test sample. The corrected face sample is used as the input of the stage S3, and the corrected face sample is used as the input of the stage S4.
Through the conversion, all samples are aligned to an average shape, and subsequent network learning can pay more attention to the characteristics of key points and non-rigid transformation on the face, such as posture, expression, shielding and other factors, so that the accuracy of final positioning is facilitated.
S3: learning the training samples in the S2 through a convolutional network, which comprises the following specific steps:
s31: reading face images and corresponding key point positions in a training sample, and normalizing each input face image into 3 channels, wherein the height of each input face image is 224 pixels, and the width of each input face image is 224 pixels;
s32: extracting features of the read face image by adopting a convolutional neural network, performing convolution, BN and relu activation, and performing down-sampling by combining a maximum pooling layer to extract a 28 multiplied by A multi-channel feature map, wherein A is the number of channels in the feature map;
the feature extraction section generates a feature map containing structural and semantic information pixel by pixel for a subsequent localization task. For the occlusion problem, context semantic information is needed to locate the position of the occluded key point, so the size of the receptive field is important. Large convolution kernels, e.g. 5 x 5, 7 x 7 convolution kernels, can be considered, but this significantly increases the number of parameters and increases the time for model inference. Through a plurality of maximum pooling layers, the resolution of the feature map is gradually reduced, the sliding of convolution kernels is performed on a small feature map, namely the convolution is performed on a large original map by using a large convolution kernel, the fact that the receptive field is enlarged is also equivalent to the introduction of position constraint among key points, and the robustness under the shielding condition is improved. And because the resolution of the characteristic diagram is small, the sliding times of each convolution kernel are few, and the forward propagation time can be effectively reduced.
S33: and performing coarse positioning on the extracted multi-channel feature map, performing convolution on the multi-channel feature map by convolution kernels with different sizes during coarse positioning, and supervising the visible human face key points by adopting a pixel-by-pixel cross entropy loss function to obtain C pieces of key point heat maps of 28 multiplied by 28, wherein C represents the number of the visible human face key points. The convolution kernels of the different sizes are 4 types in the present example, and the sizes are 3 × 3 × 256, 3 × 3 × 512, 1 × 1 × 512, and 1 × 1 × 68, respectively. The expression of the above pixel-by-pixel cross entropy loss function is as follows:
Figure BDA0001954113430000061
where N is the number of visible face keypoints,
Figure BDA0001954113430000062
representative is the ith visible facePixel true value at pixel point (x, y) on the keypoint heat map;
Figure BDA0001954113430000063
is the value after sigmoid activation is performed on the pixel value at (x, y) on the ith key point heat map output by the network.
During the network training in the step, the positions of the key points are predicted by using the whole face area instead of the local image blocks around the key points all the time, so that the global shape constraint can be introduced, and the situation that the key points fall into the local optimal solution is avoided. This portion of the regression target is not directly represented in binary coordinates, but is converted to a 28 × 28 × C key point heat map. The generation method of the pixel truth value on the key point heat map comprises the following steps: setting the value of each pixel point in 4 pixel radius ranges around the visible face key point as 1, setting the values of the pixel points in the rest areas outside the 4 pixel radius ranges as 0, and generating a corresponding heat map for each visible face key point. The generated ground truth heat map at this stage is shown in fig. 2 (enlarged for intuitive display effect).
And no supervision is applied to the occluded key points in the stage, so the key point heat map output by the coarse positioning network on the occluded key points presents a fuzzy cloud state around the true value, only a small response value is obtained on some points, and the other areas are 0. But the response values are strong at visible key points and the response range is small.
S34: and fusing the key point heat map obtained by rough positioning in the S33 and the multi-channel feature map extracted in the S32 together according to the channel direction to obtain a first fused feature map, and performing fine positioning. This multi-channel format guides fine positioning to focus on the inherent relationships between the key points. Thus, in a fine positioning network, invisible keypoints can rely on visible keypoints to provide contextual semantic information for more accurate positioning.
And during fine positioning, performing convolution on convolution kernels with different sizes and the first fusion characteristic graph, and supervising the key points of the face by adopting an L2 loss function of pixel values at corresponding positions on the key point heat map to obtain K +1 key point heat maps of 28 multiplied by 28, wherein K is the number of all visible and invisible key points of the face. In the present step, there are 4 types of convolution kernels with different sizes, and the sizes are 3 × 3 × 128, 3 × 3 × 256, 1 × 1 × 128 and 1 × 1 × 68 respectively. The expression for the L2 loss function above is as follows:
Figure BDA0001954113430000071
wherein the content of the first and second substances,
Figure BDA0001954113430000072
is the ith human face key point heat map output by network prediction,
Figure BDA0001954113430000073
is the ith heat map generated according to the truth value of the key points of the face, i belongs to [1,2, …, K];
Figure BDA0001954113430000074
Is a background region heat map of the network prediction output,
Figure BDA0001954113430000075
is a heat map generated according to the truth value of the background area; (x, y) represents the position of a pixel point on the heat map.
In addition, the generation method of the face key point truth value on the key point heat map is as follows:
gaussian responses are placed in 3 pixel radius areas around the key points of the human face, and the corresponding Gaussian response functions are as follows:
Figure BDA0001954113430000081
where (j, K) represents the pixel index on the 28 × 28 × K heat map, (x)g,yg) Representing the coordinates, sigma, of the key points of the real face on one of the heat mapsxyRepresenting the standard deviation of the gaussian response in the x-direction and y-direction, respectively. Here weThe standard deviation was set to 3. It can be seen that the farther from the center point, the lower the response value, and the response value at the center point is 1.
The generated ground channel in the fine positioning stage is shown in fig. 3 (for the purpose of intuitive display effect, an enlargement process is performed). Here a heat map is generated for each keypoint and background region. This is equivalent to having different weights for keypoints and background regions, and a greater penalty for locating points in the background region as keypoints. The adverse effect is greater because the background area is located such that keypoints are not located (i.e., determined to be a background area), which may not be important in some cases.
Through this stage we obtain a heat map of all visible and invisible keypoints. Compared with the key point heat map generated by the first-stage coarse positioning, the response range of the heat map output by the second-stage fine positioning around the key points is smaller, particularly, the response value of the heat map corresponding to the invisible key points is obviously enhanced compared with that of the first-stage coarse positioning, the response range is reduced, and the positioning is more accurate.
The heat map is used as an expression form of the regression target, and the expression form comprises the spatial position information of the key points and the confidence information of the key points. And the key point heat map can transmit the position information of the key point predicted at the current stage to the next stage, and the next stage obtains more characteristics and information, so that the position of the key point is further adjusted, and accurate positioning is realized.
While in the first stage coarse localization, the confidence of the occluded points is lower than that of the visible keypoints, in the second stage fine localization, the invisible points are predicted by the context information of the visible points. And a heat map of the background area is introduced, and the value of each pixel point on the heat map represents the probability that the point belongs to the background.
The network in the two stages has good expansibility, because no matter how many personal face key points are marked in the data set, the corresponding number of heat maps can be generated, so that the method has good compatibility. Because interference can be generated by directly fusing the bottom-layer features with a lot of irrelevant information (noise) and the subsequent high-layer features, regularization is added to a convolution kernel when the feature graph of the feature extraction part is fused with the heat graphs output by the first part and the second part of the network respectively through point-wise convolution. And introducing convolution kernel regularization to inhibit noise in the bottom-layer features, and then fusing the low-layer features and the high-layer features.
S35: and fusing the multi-channel feature map extracted in the S32 and the key point heat map obtained by fine positioning in the S34 together according to the channel direction to obtain a second fused feature map. The reason for introducing this step is that global regression learning can effectively perform global shape constraint and reduce errors due to local appearance blurring, in the first place. Secondly, since indexes of pixel points on the feature map are integers, and real key point coordinates are floating point type decimal numbers, in order to further improve positioning accuracy, binary coordinate regression of a third part is introduced after fine positioning of the second part, and the specific method is described as follows: firstly, fusing channel information of a second fusion characteristic diagram through point-wise convolution, then obtaining a 7 multiplied by 7 characteristic diagram through two convolution and pooling layers, and then obtaining K characteristic vectors delta S of key points in 2 multiplied by K dimensions through two full-connection layers; adding the delta S and the position of the human face key point obtained by fine positioning in the S34, performing residual calculation with a human face key point true value, and introducing a new loss term by adopting a loss function which is a binary coordinate L2 loss, wherein the loss function has the following form:
Figure BDA0001954113430000091
wherein SgtIs a binary coordinate vector of the truth value position of the face key point,
Figure BDA0001954113430000092
is the position of the key point of the face obtained by the fine positioning in S34, DboxThe length of the diagonal line of the face frame is used as a normalization factor of the loss function; theta is the line segment formed by predicted adjacent key points and the corresponding adjacent relationThe angle between the segments formed by the key truth values is shown in FIG. 4. The included angle θ is defined as follows:
Figure BDA0001954113430000093
Figure BDA0001954113430000094
Figure BDA0001954113430000095
wherein
Figure BDA0001954113430000096
And
Figure BDA0001954113430000097
representing predictions in S35
Figure BDA0001954113430000098
The positions of two adjacent face key points in the image,
Figure BDA0001954113430000099
and
Figure BDA00019541134300000910
representing the positions of two ground nodes corresponding to the key points of the adjacent human faces, and forming an included angle theta of two line segments by the four points;
we add such a penalty constraint to 51 keypoints inside the face, such as at positions such as eyebrows, eyes, mouth, nose, etc., but not to 17 points on the face contour.
The input of the network computing process is carried out in a conversion space, and in order to obtain the key point positions on the original face image, a post-processing module is required to reversely convert the output result of the network into the original space, which is specifically carried out as follows.
S36: obtaining the key point position S of the face on the original image according to the inverse transformation of the positive face in the S2 on the key point position on the positive face obtained in the S35finalThe inverse transformation process is as follows:
Sfinal=T-1(S3)
wherein S3Is the key point position, T, on the corrected face obtained in S35-1Is an inverse transformation matrix corresponding to the Fourier analysis transformation T;
s4: and after the convolutional network finishes the training of S3, inputting the test sample in S2 into the convolutional network to obtain the positions of the key points of the face in the image, wherein the test result meets the requirement to indicate that the training of the convolutional network is finished. After the training is finished, the face key point positioning can be carried out on the new face sample through the convolution network.
The methods of S1-S4 are applied to specific examples to show the technical effects of the present invention. The specific practice of each step is as described above, and the steps in the embodiments will be described in a simplified manner, and will not be described again, and the specific parameters and practice of each step are described again.
Example (b):
in this embodiment, the method mainly includes an input image preprocessing module, a first part coarse positioning module, a second part fine positioning module, and a third part binary coordinate regression module, and the accurate positioning result is finally obtained through these modules. The method for positioning the key points of the human face to realize the occlusion robustness comprises the following steps:
1. the data collection and annotation of the aforementioned step S1 is performed: in this embodiment, the open-source face image datasets 300W, 300VW, COFW, and AFLW are synthesized, some face images under occlusion and large-posture (self-occlusion) conditions are collected by the self, the existing high-precision 3DDFA algorithm is adopted to label key points of the face, and then manual screening is performed, and some inaccurate key points are manually adjusted.
2. The preprocessing of the aforementioned step S2 is performed: reading a face image and a key point group route label, firstly solving an average face according to the key point group route, then detecting a face area by adopting MTCNN (multiple-terminal connected computing network), positioning the positions of 5 key points, and then converting the face into a straight face by using a Poincare analysis.
3. And establishing a deep convolutional network, wherein the deep convolutional network is formed by cascading all the modules in the figure 1. The training samples in S2 are learned through the convolutional network as in the foregoing step S3:
firstly, a feature extraction part is composed of 6 convolutional layers, wherein each 2 convolutional layers are down-sampled by adopting a maximum pooling layer, and the resolution of a feature graph finally obtained by the feature extraction part is 28 multiplied by 28. This partial convolution kernel is initialized with the VGG19 pre-trained model.
The first part and the second part are full convolution networks and comprise common convolution layers and point-wise convolution layers, initialization of convolution kernels is carried out weight initialization by adopting random Gaussian distribution, and initialization variance is 0.005.
In the past, the key point positioning algorithm applies supervision to the key point region and ignores the constraint of the background region, so that in addition to constructing the heat map for the key points in the first and second stages of the network, the heat map of the background region is introduced, specifically, the adverse effect of positioning the points (non-key points) in one background region into the key points is larger, and if one key point is only not positioned (namely, the key point is determined to be the background region), the adverse effect may not be important in some cases. Therefore, the algorithm constructs the heat map for the key points and also constructs the heat map for the background area, and the key points are positioned more accurately through the joint constraint of the two sides.
The third part comprises a common convolution layer, a point-wise convolution layer, a full connection layer and an initialization mode of a convolution kernel, wherein the initialization mode adopts xavier initialization.
In the embodiment, each module of the deep convolutional network is the prior art, the modules are rearranged and combined, the sizes of convolutional kernels, the convolution step lengths and the padding mode of images are correspondingly adjusted, and besides the posing layer and the output layer, other layers in the network use relu activation functions and batchnormal transformation, so that training data and test data are distributed and fluctuated within a certain range, gradient disappearance and overflow are avoided, and network convergence is accelerated. Meanwhile, the method is used as a regularization technology, overfitting is prevented, and the generalization capability of the model is improved. And a dropout layer is added before the last fully connected layer of the binary coordinate regression network.
On the basis of a traditional key point binary coordinate value L2 loss function in the third stage of the network, a position constraint relation between adjacent points is introduced, and a new constraint item, namely an angle constraint shown in FIG. 4, is added.
The input of the network computing process is carried out in a conversion space, and to obtain the key point position on the original face image, a post-processing module is further required to inversely convert the output result of the network into the original space, as shown in formula (6):
Sfinal=T-1(S3) (6)
where Δ S is the output of the last fully-connected layer in the third stage, T-1Is an inverse transform matrix.
And finally, obtaining a positioning result of the corresponding human face key points on the original image through similarity inverse transformation.
4. After the convolutional network completes the training of S3, the test sample in S2 is input into the convolutional network, and the positions of the key points of the face in the image are obtained.
As can be seen from table 1, after the new angle constraint term is added, NME (%) of face key point location can be effectively reduced. Fig. 5 shows the visualization result of the influence of adding and not adding such constraint on the final face key point positioning accuracy. Fig. 6 shows the robustness of the algorithm of the present invention to face key point localization in occlusion situations. Table 1 shows that the accuracy of face key point positioning under occlusion conditions is improved by the present invention when comparing the average error (%) normalized with the diagonal of the face frame on a 300W test set with and without new constraints.
TABLE 1
300W test set challenge common public private
Adding new constraints 2.50 1.47 1.67 2.06
Without adding new constraints 2.68 1.50 1.71 2.16
The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (4)

1. A human face key point positioning method for shielding robustness is characterized by comprising the following steps:
s1: collecting face images marked by key points as a training data set and a test data set;
s2: for the training data set and the test data set in the S1, firstly, face detection and 5-point key point positioning are carried out by using a face detection algorithm MTCNN, and then a T-transformed positive face is transformed by using a PowerPoint analysis to form a training sample and a test sample; taking the corrected human face sample as the input of the stage S3;
s3: learning the training samples in the S2 through a convolutional network, which comprises the following specific steps:
s31: reading face images and corresponding key point positions in a training sample, and normalizing each input face image into 3 channels, wherein the height of each input face image is 224 pixels, and the width of each input face image is 224 pixels;
s32: extracting features of the read face image by adopting a convolutional neural network, performing convolution, BN and relu activation, and performing down-sampling by combining a maximum pooling layer to extract a 28 multiplied by A multi-channel feature map, wherein A is the number of channels in the feature map;
s33: carrying out rough positioning on the extracted multi-channel feature map, carrying out convolution on the multi-channel feature map by convolution kernels with different sizes during rough positioning, and supervising visible human face key points by adopting a pixel-by-pixel cross entropy loss function to obtain C pieces of key point heat maps of 28 multiplied by 28, wherein C represents the number of the visible human face key points; the expression of the pixel-by-pixel cross entropy loss function is as follows:
Figure FDA0002808910010000011
where N is the number of visible face keypoints,
Figure FDA0002808910010000012
representing the true value of the pixel at the pixel point (x, y) on the ith visible face keypoint heat map;
Figure FDA0002808910010000013
is on the ith keypoint heat map of the network output (x, y)) Taking the pixel value as the value after sigmoid activation;
in step S33, the method for generating the pixel truth value on the key point heat map includes:
setting the value of each pixel point in 4 pixel radius ranges around the visible face key point as 1, setting the values of the pixel points in the rest areas outside the 4 pixel radius ranges as 0, and generating a corresponding heat map for each visible face key point;
s34: fusing the key point heat map obtained in the S33 by rough positioning and the multi-channel feature map extracted in the S32 together according to the channel direction to obtain a first fused feature map, and performing fine positioning; during fine positioning, convolution kernels with different sizes are used for performing convolution with the first fusion characteristic graph, and the key points of the face are supervised by adopting an L2 loss function of pixel values at corresponding positions on the key point heat map to obtain K +1 key point heat maps of 28 multiplied by 28, wherein K is the number of all visible and invisible key points of the face; the expression of the L2 loss function is as follows:
Figure FDA0002808910010000021
wherein the content of the first and second substances,
Figure FDA0002808910010000022
is the ith human face key point heat map output by network prediction,
Figure FDA0002808910010000023
is the ith heat map generated according to the truth value of the key points of the face, i belongs to [1,2, …, K];
Figure FDA0002808910010000024
Is a background region heat map of the network prediction output,
Figure FDA0002808910010000025
is a heat map generated according to the truth value of the background area; (x, y) represents heatThe location of a pixel point on the image;
in step S34, the method for generating the face keypoint true value on the keypoint heat map includes:
gaussian responses are placed in 3 pixel radius areas around the key points of the human face, and the corresponding Gaussian response functions are as follows:
Figure FDA0002808910010000026
where (j, K) represents the pixel index on the 28 × 28 × K heat map, (x)g,yg) Representing the coordinates, sigma, of the key points of the real face on one of the heat mapsxyRespectively representing the standard deviation of the gaussian response in the x-direction and the y-direction;
s35: fusing the multi-channel feature map extracted in the S32 and the key point heat map obtained by fine positioning in the S34 together according to the channel direction to obtain a second fused feature map; firstly, fusing channel information of a second fusion characteristic diagram through point-wise convolution, then obtaining a 7 multiplied by 7 characteristic diagram through two convolution and pooling layers, and then obtaining K characteristic vectors delta S of key points in 2 multiplied by K dimensions through two full-connection layers; adding the delta S and the position of the human face key point obtained by fine positioning in the S34, performing residual calculation with a human face key point true value, and introducing a new loss term by adopting a loss function which is a binary coordinate L2 loss, wherein the loss function has the following form:
Figure FDA0002808910010000027
wherein SgtIs a binary coordinate vector of the truth value position of the face key point,
Figure FDA0002808910010000028
is the position of the key point of the face obtained by the fine positioning in S34, DboxThe length of the diagonal line of the face frame is used as a normalization factor of the loss function; theta is the line segment formed by predicted adjacent key points and the corresponding adjacent key pointsAn included angle between line segments formed by the near key point truth value; the included angle θ is defined as follows:
Figure FDA0002808910010000031
Figure FDA0002808910010000032
Figure FDA0002808910010000033
wherein
Figure FDA0002808910010000034
And
Figure FDA0002808910010000035
representing predictions in S35
Figure FDA0002808910010000036
The positions of two adjacent face key points in the image,
Figure FDA0002808910010000037
and
Figure FDA0002808910010000038
representing the positions of two ground nodes corresponding to the key points of the adjacent human faces, and forming an included angle theta of two line segments by the four points;
s36: obtaining the key point position S of the face on the original image according to the inverse transformation of the positive face in the S2 on the key point position on the positive face obtained in the S35finalThe inverse transformation process is as follows:
Sfinal=T-1(S3)
wherein S3Is in S35Obtaining the position of a key point on the corrected face, T-1Is an inverse transformation matrix corresponding to the Fourier analysis transformation T;
s4: after the convolutional network completes the training of S3, the test sample in S2 is input into the convolutional network, and the positions of the key points of the face in the image are obtained.
2. The occlusion robust face key point positioning method of claim 1, wherein in step S1, the key point labeling is performed by: and (3) automatically calibrating the positions of the key points of the samples by using a 3DDFA algorithm, and then manually adjusting.
3. The occlusion robust face key point positioning method of claim 1, wherein in step S33, the convolution kernels of different sizes are 4 in total, and the sizes are 3 × 3 × 256, 3 × 3 × 512, 1 × 1 × 512 and 1 × 1 × 68, respectively.
4. The occlusion robust face key point positioning method of claim 1, wherein in step S34, the convolution kernels of different sizes are 4 in total, and the sizes are 3 × 3 × 128, 3 × 3 × 256, 1 × 1 × 128 and 1 × 1 × 68, respectively.
CN201910061018.5A 2019-01-23 2019-01-23 Human face key point positioning method for shielding robustness Active CN109886121B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910061018.5A CN109886121B (en) 2019-01-23 2019-01-23 Human face key point positioning method for shielding robustness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910061018.5A CN109886121B (en) 2019-01-23 2019-01-23 Human face key point positioning method for shielding robustness

Publications (2)

Publication Number Publication Date
CN109886121A CN109886121A (en) 2019-06-14
CN109886121B true CN109886121B (en) 2021-04-06

Family

ID=66926560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910061018.5A Active CN109886121B (en) 2019-01-23 2019-01-23 Human face key point positioning method for shielding robustness

Country Status (1)

Country Link
CN (1) CN109886121B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287846B (en) * 2019-06-19 2023-08-04 南京云智控产业技术研究院有限公司 Attention mechanism-based face key point detection method
CN110415171B (en) * 2019-07-08 2021-06-25 北京三快在线科技有限公司 Image processing method, image processing device, storage medium and electronic equipment
CN110674730A (en) * 2019-09-20 2020-01-10 华南理工大学 Monocular-based face silence living body detection method
CN110909690B (en) * 2019-11-26 2023-03-31 电子科技大学 Method for detecting occluded face image based on region generation
CN111080576B (en) * 2019-11-26 2023-09-26 京东科技信息技术有限公司 Key point detection method and device and storage medium
CN111652240B (en) * 2019-12-18 2023-06-27 南京航空航天大学 CNN-based image local feature detection and description method
CN111046826B (en) * 2019-12-20 2023-07-04 北京碧拓科技有限公司 Method for positioning key points of far infrared thermal imaging face
CN110874587B (en) * 2019-12-26 2020-07-28 浙江大学 Face characteristic parameter extraction system
CN113128277A (en) * 2019-12-31 2021-07-16 Tcl集团股份有限公司 Generation method of face key point detection model and related equipment
CN111339883A (en) * 2020-02-19 2020-06-26 国网浙江省电力有限公司 Method for identifying and detecting abnormal behaviors in transformer substation based on artificial intelligence in complex scene
CN111723709B (en) * 2020-06-09 2023-07-11 大连海事大学 Fly face recognition method based on deep convolutional neural network
WO2021258588A1 (en) * 2020-06-24 2021-12-30 北京百度网讯科技有限公司 Face image recognition method, apparatus and device and storage medium
CN112287802A (en) * 2020-10-26 2021-01-29 汇纳科技股份有限公司 Face image detection method, system, storage medium and equipment
CN112464809B (en) * 2020-11-26 2023-06-06 北京奇艺世纪科技有限公司 Face key point detection method and device, electronic equipment and storage medium
CN112699837A (en) * 2021-01-13 2021-04-23 新大陆数字技术股份有限公司 Gesture recognition method and device based on deep learning
CN112884326A (en) * 2021-02-23 2021-06-01 无锡爱视智能科技有限责任公司 Video interview evaluation method and device based on multi-modal analysis and storage medium
CN113158870B (en) * 2021-04-15 2023-07-18 华南理工大学 Antagonistic training method, system and medium of 2D multi-person gesture estimation network
CN113298052B (en) * 2021-07-26 2021-10-15 浙江霖研精密科技有限公司 Human face detection device and method based on Gaussian attention and storage medium
CN113762136A (en) * 2021-09-02 2021-12-07 北京格灵深瞳信息技术股份有限公司 Face image occlusion judgment method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102568026A (en) * 2011-12-12 2012-07-11 浙江大学 Three-dimensional enhancing realizing method for multi-viewpoint free stereo display
CN104866829A (en) * 2015-05-25 2015-08-26 苏州大学 Cross-age face verify method based on characteristic learning
WO2018093796A1 (en) * 2016-11-15 2018-05-24 Magic Leap, Inc. Deep learning system for cuboid detection
CN108229509A (en) * 2016-12-16 2018-06-29 北京市商汤科技开发有限公司 For identifying object type method for distinguishing and device, electronic equipment
CN108229488A (en) * 2016-12-27 2018-06-29 北京市商汤科技开发有限公司 For the method, apparatus and electronic equipment of detection object key point
CN108388876A (en) * 2018-03-13 2018-08-10 腾讯科技(深圳)有限公司 A kind of image-recognizing method, device and relevant device
CN108932693A (en) * 2018-06-15 2018-12-04 中国科学院自动化研究所 Face editor complementing method and device based on face geological information

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016370B (en) * 2017-04-10 2019-10-11 电子科技大学 A kind of partial occlusion face identification method based on data enhancing
CN108805040A (en) * 2018-05-24 2018-11-13 复旦大学 It is a kind of that face recognition algorithms are blocked based on piecemeal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102568026A (en) * 2011-12-12 2012-07-11 浙江大学 Three-dimensional enhancing realizing method for multi-viewpoint free stereo display
CN104866829A (en) * 2015-05-25 2015-08-26 苏州大学 Cross-age face verify method based on characteristic learning
WO2018093796A1 (en) * 2016-11-15 2018-05-24 Magic Leap, Inc. Deep learning system for cuboid detection
CN108229509A (en) * 2016-12-16 2018-06-29 北京市商汤科技开发有限公司 For identifying object type method for distinguishing and device, electronic equipment
CN108229488A (en) * 2016-12-27 2018-06-29 北京市商汤科技开发有限公司 For the method, apparatus and electronic equipment of detection object key point
CN108388876A (en) * 2018-03-13 2018-08-10 腾讯科技(深圳)有限公司 A kind of image-recognizing method, device and relevant device
CN108932693A (en) * 2018-06-15 2018-12-04 中国科学院自动化研究所 Face editor complementing method and device based on face geological information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Hand and face tracking under occlusion with anthropomorphic constraints;Abhishek Sen 等;《2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC)》;20121213;第242-247页 *
基于深度学习的人脸识别技术研究;刘施乐;《电子制作》;20181231;第50-51,96页 *

Also Published As

Publication number Publication date
CN109886121A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
CN109886121B (en) Human face key point positioning method for shielding robustness
US11556797B2 (en) Systems and methods for polygon object annotation and a method of training an object annotation system
Li et al. A deep learning method for change detection in synthetic aperture radar images
CN110443143B (en) Multi-branch convolutional neural network fused remote sensing image scene classification method
CN109977918B (en) Target detection positioning optimization method based on unsupervised domain adaptation
CN112287941B (en) License plate recognition method based on automatic character region perception
CN114898284B (en) Crowd counting method based on feature pyramid local difference attention mechanism
CN110287798B (en) Vector network pedestrian detection method based on feature modularization and context fusion
Yan et al. Learning part-aware attention networks for kinship verification
Fang et al. Laser stripe image denoising using convolutional autoencoder
CN111652240A (en) Image local feature detection and description method based on CNN
CN113378812A (en) Digital dial plate identification method based on Mask R-CNN and CRNN
CN111445496B (en) Underwater image recognition tracking system and method
Wang et al. Low-light image enhancement based on deep learning: a survey
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN108009512A (en) A kind of recognition methods again of the personage based on convolutional neural networks feature learning
Li et al. AEMS: an attention enhancement network of modules stacking for lowlight image enhancement
CN114492634A (en) Fine-grained equipment image classification and identification method and system
CN112668662A (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
CN110020986B (en) Single-frame image super-resolution reconstruction method based on Euclidean subspace group double-remapping
Zeng et al. Masanet: Multi-angle self-attention network for semantic segmentation of remote sensing images
CN116758340A (en) Small target detection method based on super-resolution feature pyramid and attention mechanism
Vijayalakshmi K et al. Copy-paste forgery detection using deep learning with error level analysis
CN115439669A (en) Feature point detection network based on deep learning and cross-resolution image matching method
Bhattacharya et al. Simplified face quality assessment (sfqa)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant