CN109389030B

CN109389030B - Face characteristic point detection method and device, computer equipment and storage medium

Info

Publication number: CN109389030B
Application number: CN201810963841.0A
Authority: CN
Inventors: 戴磊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2022-11-29
Anticipated expiration: 2038-08-23
Also published as: CN109389030A; WO2020037898A1

Abstract

The invention discloses a method and a device for detecting human face characteristic points, computer equipment and a storage medium. The method comprises the following steps: dividing the sample data set into a training data set and a test data set according to a preset division ratio; training a face detection model comprising K parallel convolutional layers, a splicing layer and a global pooling layer by using a training data set; testing the face detection model by using the test data set, and calculating the positioning accuracy of the face detection model to the face characteristic points according to the test result; if the positioning accuracy is smaller than the preset accuracy threshold, dividing the sample data set again, and retraining and testing until the positioning accuracy is larger than or equal to the preset accuracy threshold; and inputting the human face picture to be detected into the trained human face detection model for calculation to obtain a characteristic point prediction result of the human face picture. The technical scheme of the invention can effectively improve the positioning capability and the prediction accuracy of the face detection model on the face characteristic points.

Description

Face characteristic point detection method and device, computer equipment and storage medium

Technical Field

The invention relates to the field of computers, in particular to a method and a device for detecting human face characteristic points, computer equipment and a storage medium.

Background

At present, face recognition is widely applied to various practical applications, identity authentication through face recognition gradually becomes a common identity authentication mode, and in the face recognition process, detection of face characteristic points is a precondition and a basis for face recognition and related applications.

In the existing depth model design process for detecting the human face feature points, in order to be suitable for practical application scenes and consume less execution time, the depth model is usually required to be designed into a small model, but the existing depth model adopting the small model design mode has poor prediction capability and low prediction accuracy rate, so that the model cannot accurately position the feature points of human faces such as fuzzy human faces, large-angle human faces, exaggerated expressions and the like.

Disclosure of Invention

The embodiment of the invention provides a face characteristic point detection method, a face characteristic point detection device, computer equipment and a storage medium, and aims to solve the problem that the accuracy of prediction of a current depth model on a face characteristic point is low.

A face feature point detection method comprises the following steps:

acquiring a sample data set, wherein the sample data set comprises face sample pictures and face characteristic point marking information of each face sample picture;

dividing the sample data set into a training data set and a test data set according to a preset division ratio;

training an initial face detection model by using the training data set to obtain a trained face detection model, wherein the initial face detection model is a convolutional neural network comprising K parallel convolutional layers, a splicing layer and a global pooling layer, each parallel convolutional layer has a visual perception range with different preset scales, and K is a positive integer greater than or equal to 3;

testing the trained face detection model by using the test data set, and calculating the positioning accuracy of the trained face detection model on the face characteristic points according to the test result;

if the positioning accuracy is smaller than a preset accuracy threshold, re-dividing the face sample pictures in the sample data set to obtain a new training data set and a new testing data set, training the trained face detection model by using the new training data set to update the trained face detection model, and testing the trained face detection model by using the new testing data set until the positioning accuracy is greater than or equal to the preset accuracy threshold;

if the positioning accuracy is greater than or equal to the preset accuracy threshold, determining the trained face detection model with the positioning accuracy greater than or equal to the preset accuracy threshold as a trained face detection model;

acquiring a human face picture to be detected;

and inputting the human face picture to be detected into the trained human face detection model for calculation to obtain a feature point prediction result of the human face picture, wherein the feature point prediction result comprises attribute information and position information of a target feature point.

A face feature point detecting device comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a sample data set, and the sample data set comprises human face sample pictures and human face characteristic point marking information of each human face sample picture;

the sample dividing module is used for dividing the sample data set into a training data set and a testing data set according to a preset dividing proportion;

the model training module is used for training an initial face detection model by using the training data set to obtain a trained face detection model, wherein the initial face detection model is a convolutional neural network comprising K parallel convolutional layers, a splicing layer and a global pooling layer, each parallel convolutional layer has a visual perception range with different preset scales, and K is a positive integer greater than or equal to 3;

the model testing module is used for testing the trained face detection model by using the testing data set and calculating the positioning accuracy of the trained face detection model to the face characteristic points according to the testing result;

a model optimization module, configured to, if the positioning accuracy is smaller than a preset accuracy threshold, re-divide the face sample picture in the sample data set to obtain a new training data set and a new testing data set, train the trained face detection model using the new training data set to update the trained face detection model, and test the trained face detection model using the new testing data set until the positioning accuracy is greater than or equal to the preset accuracy threshold;

a training result module, configured to determine, if the positioning accuracy is greater than or equal to the preset accuracy threshold, the trained face detection model with the positioning accuracy greater than or equal to the preset accuracy threshold as a trained face detection model; the second acquisition module is used for acquiring a human face picture to be detected;

and the model prediction module is used for inputting the human face picture to be detected into the trained human face detection model for calculation to obtain a feature point prediction result of the human face picture, wherein the feature point prediction result comprises attribute information and position information of a target feature point.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above-mentioned face feature point detection method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned face feature point detection method.

On one hand, a convolutional neural network comprising a plurality of parallel convolutional layers, a splicing layer and a global pooling layer is constructed to serve as a face detection model, wherein the parallel convolutional layers have visual perception ranges with different preset scales, parallel convolutional calculation is carried out on each parallel convolutional layer by using the visual perception ranges with different scales, and the calculation results of each parallel convolutional layer are spliced together by the splicing layer, so that the face detection model can capture detail features with different scales at the same time, the expression capability of the face detection model is improved, the output result of the face detection model can have the characteristic of invariance relative to the position through the pooling calculation of the global pooling layer, and meanwhile, overfitting is avoided; on the other hand, a sample data set consisting of a face sample picture containing accurate face characteristic point labeling information is obtained, the sample data set is divided into a training data set and a testing data set according to a preset proportion, the face detection model is trained by using the training data set, the trained face detection model is tested by using the testing data set, then the positioning accuracy of the face detection model is calculated according to a testing result, the prediction capability of the trained face detection model is judged according to the positioning accuracy, and the training of the face detection model is continuously optimized by adjusting the training data set and the testing data set until the satisfactory positioning accuracy is reached, so that the training optimization of the face detection model is realized, and the prediction capability of the face detection model is further enhanced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a method for detecting facial feature points according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for detecting facial feature points according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a network structure of a face detection model including three parallel convolution layers in the face feature point detection method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating the step S8 of the face feature point detection method according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating the method for detecting facial feature points according to the present invention, wherein the step S4 is a step of calculating the accuracy of the facial feature points positioning by the facial detection model according to the test results;

FIG. 6 is a flowchart illustrating the step S1 of the face feature point detection method according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating the step S14 of the face feature point detection method according to the embodiment of the present invention;

FIG. 8 is a schematic view of a face feature point detection apparatus according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for detecting the human face characteristic points can be applied to an application environment shown in fig. 1, the application environment comprises a server and a client, the server and the client are connected through a network, the network can be a wired network or a wireless network, the client specifically comprises but is not limited to various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server can be specifically realized by an independent server or a server cluster formed by a plurality of servers. The client sends the collected sample data set and the human face picture to be detected to the server, the server performs model training according to the received sample data set, and performs feature point detection on the human face picture to be detected by using the trained human face detection model.

In an embodiment, as shown in fig. 2, a method for detecting a facial feature point is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and is detailed as follows:

s1: acquiring a sample data set, wherein the sample data set comprises face sample pictures and face characteristic point marking information of each face sample picture.

Specifically, the sample data set may be collected in advance and stored in a sample data base, and the sample data set includes a plurality of face sample pictures and face feature point labeling information of each face sample picture.

As can be understood, the face sample picture and the face feature point labeling information of the face sample picture are stored in the sample data set in an associated manner.

The face feature point labeling information may include attribute information and location information of the face feature point. The attribute information is facial feature point information, and the position information is pixel point coordinates of the facial feature points in the facial sample picture.

For example, a specific face feature point label message is "eyes" (200, 150), "where" eyes "is the information of five sense organs to which the face feature point belongs, i.e. attribute information," (200, 150) "is the coordinates of the pixel points of the face feature point in the face sample picture, i.e. position information.

S2: and dividing the sample data set into a training data set and a test data set according to a preset division ratio.

Specifically, according to a preset division ratio, randomly dividing the face sample pictures in the sample data set obtained in the step S1 to obtain a training data set and a test data set.

For example, the preset division ratio is 3:2, and assuming that the sample data set includes 100 ten thousand face sample pictures, 60 ten thousand face sample pictures are randomly selected from the sample data set as a training data set, and the remaining 40 ten thousand face sample pictures are used as a test data set.

It should be noted that the preset dividing ratio may be set according to the requirement of practical application, and is not limited herein.

S3: the method comprises the steps of training an initial face detection model by using a training data set to obtain a trained face detection model, wherein the initial face detection model is a convolutional neural network comprising K parallel convolutional layers, a splicing layer and a global pooling layer, each parallel convolutional layer has a visual perception range with different preset scales, and K is a positive integer greater than or equal to 3.

In this embodiment, the initial face detection model, the trained face detection model, and the trained face detection model mentioned below all refer to a face detection model including a laminated convolutional neural network structure, a convolutional neural network of the face detection model includes K parallel convolutional layers, a concatenation layer, and a global pooling layer, and convolution kernels of visual perception ranges with different preset scales are set in each parallel convolutional layer, where the K parallel convolutional layers are arranged according to a preset sequence, output data of each parallel convolutional layer is used as input data of a next parallel convolutional layer, output data of each parallel convolutional layer is used as input data of the concatenation layer, output data of the concatenation layer is used as input data of the global pooling layer, output data of the global pooling layer is an output result of the face detection model, and the output result includes attribute information and position information of a face feature point in a face sample picture predicted by the face detection model.

As shown in fig. 3, fig. 3 is a schematic diagram of a network structure of a face detection model including three parallel convolutional layers. The three parallel convolution layers are a convolution layer A, a convolution layer B and a convolution layer C respectively, the visual perception ranges of the preset scales corresponding to each parallel convolution layer are convolution kernels of 3 x 3, 5 x 5 and 7 x 7 respectively, and the unit of each convolution kernel is a pixel point.

The parallel convolution calculation is carried out on each parallel convolution layer by using the visual perception ranges of different scales, and the calculation results of each parallel convolution layer are spliced together by the splicing layers, so that the face detection model can capture the detail features of different scales simultaneously, the expression capability of the face detection model is improved, the output result of the face detection model has the characteristic of invariance relative to the position through the pooling calculation of the global pooling layer, and the overfitting is avoided. The laminated convolution neural network structure can improve the positioning capability of the face detection model on the face characteristic points, and particularly can accurately position the characteristic points of the face such as a blurred face, a large-angle face, an exaggerated expression face and the like, so that the prediction accuracy of the face detection model is effectively improved.

Specifically, when an initial face detection model is trained by using a training data set, a face sample picture in the training data set is input into the initial face detection model, layer-by-layer calculation is carried out according to a laminated convolutional neural network structure of the initial face detection model, an obtained output result of the initial face detection model is used as a test result, comparison learning is carried out between the test result and face characteristic point labeling information of the face sample picture, parameters of each layer of the laminated convolutional neural network structure are adjusted according to the comparison learning result, and the trained face detection model is obtained through repeated training and parameter adjustment.

Further, before performing convolution calculation on each parallel convolution layer, normalization processing may be performed on input data of the parallel convolution layer, where the Normalization processing specifically may include global Normalization (BN) processing and single-side suppression processing. The disappearance of the gradient or explosion can be prevented through the global normalization processing, and the training speed is accelerated. The single-side inhibition process uses a modified linear unit (ReLU) as an activation function to perform single-side inhibition on the output after the global normalization process, so that the sparse face detection model can achieve more accurate mining of face feature points and fitting of training data. Meanwhile, convolution calculation is carried out on the input data after the standardization processing, the calculation amount can be effectively reduced, and the calculation efficiency is improved.

S4: and testing the trained face detection model by using the test data set, and calculating the positioning accuracy of the trained face detection model to the face characteristic points according to the test result.

Specifically, the face sample picture in the test data set is input into the trained face detection model obtained in step S3 for testing, so as to obtain a test result output by the face feature model, where the test result includes predicted position information of each face feature point in the face sample picture.

And aiming at each face sample picture, comparing the test result of the face sample picture with the actual position information of each face characteristic point in the face characteristic point marking information of the face sample picture, judging whether the test result is accurate to obtain a judgment result, and calculating the positioning accuracy of the trained face detection model to the face characteristic points according to the judgment result of each face sample picture in the test data set.

In a specific embodiment, the determination result may include two values, i.e., correct and incorrect, when the test result of the face sample picture is consistent with the face feature point label information of the face sample picture, the determination result is correct, otherwise, the determination result is incorrect, the number of the face sample pictures with the correct determination result is counted in the test data set, and the ratio of the number to the total number of the face sample pictures included in the test data set is used as the positioning accuracy.

Further, the positioning accuracy can also be obtained by calculating a Normalized Mean Error (NME) of the test data set.

S5: if the positioning accuracy is smaller than the preset accuracy threshold, the face sample pictures in the sample data set are divided again to obtain a new training data set and a new testing data set, the trained face detection model is trained by using the new training data set to update the trained face detection model, and the trained face detection model is tested by using the new testing data set until the positioning accuracy is larger than or equal to the preset accuracy threshold.

Specifically, the positioning accuracy obtained in step S4 is compared with a preset accuracy threshold, and if the positioning accuracy is smaller than the accuracy threshold, it is determined that the training of the trained face detection model is not completed, and the trained face detection model needs to be continuously adjusted and optimized in network parameters.

And (4) randomly selecting face sample pictures in the sample data set again according to a preset division ratio to obtain a new training data set and a new testing data set, training the trained face detection model by using the new training data set in the same training process as the step (S3) to update the trained face detection model, testing the trained face detection model by using the new testing data set in the same testing process as the step (S4) after the training is finished, and calculating the positioning accuracy according to the testing result.

If the positioning accuracy is still smaller than the accuracy threshold, the step is continuously repeated, and the training and the testing are repeated until the positioning accuracy is larger than or equal to the accuracy threshold, and the training and the testing are finished.

S6: and if the positioning accuracy is greater than or equal to the preset accuracy threshold, determining the trained face detection model with the positioning accuracy greater than or equal to the preset accuracy threshold as the trained face detection model.

Specifically, if the positioning accuracy obtained in step S4 is greater than or equal to the preset accuracy threshold, or the positioning accuracy obtained after repeated training and testing in step S5 is greater than or equal to the preset accuracy threshold, the trained face detection model with the positioning accuracy greater than or equal to the preset accuracy threshold is the trained face detection model, and the trained face detection model can be used for detecting the face feature points.

S7: and acquiring a human face picture to be detected.

Specifically, the face picture to be detected may be a face picture input by a user through a client to perform identity recognition, and the server obtains the face picture to be detected from the client.

S8: inputting a human face picture to be detected into a trained human face detection model for calculation to obtain a feature point prediction result of the human face picture, wherein the feature point prediction result comprises attribute information and position information of a target feature point.

Specifically, the face image obtained in step S7 is input into the trained face detection model obtained in step S6, and calculation is performed according to the laminated convolutional neural network structure in the trained face detection model, so as to obtain an output of the trained face detection model, where the output includes attribute information and position information of the target feature point in the identified face image to be detected. Namely, the feature point prediction result of the human face picture to be detected.

In this embodiment, on one hand, a convolutional neural network including a plurality of parallel convolutional layers, a concatenation layer and a global pooling layer is constructed as a face detection model, wherein the parallel convolutional layers have visual perception ranges of different preset scales, parallel convolutional calculation is performed on each parallel convolutional layer by using the visual perception ranges of different scales, and the calculation results of each parallel convolutional layer are concatenated together by the concatenation layer, so that the face detection model can capture detail features of different scales at the same time, thereby improving the expression capability of the face detection model, and the output result of the face detection model can have the characteristic of invariance relative to the position through the pooling calculation of the global pooling layer, and meanwhile avoiding over-fitting; on the other hand, the training optimization of the face detection model is realized and the prediction capability of the face detection model is further enhanced by acquiring a sample data set consisting of a face sample picture containing accurate face characteristic point marking information, dividing the sample data set into a training data set and a testing data set according to a preset proportion, training the face detection model by using the training data set, testing the trained face detection model by using the testing data set, calculating the positioning accuracy of the face detection model according to a test result, judging the prediction capability of the trained face detection model by the positioning accuracy, and continuously optimizing the training of the face detection model by adjusting the training data set and the testing data set until the satisfactory positioning accuracy is reached.

In an embodiment, as shown in fig. 4, K is equal to 3, and the K parallel convolution layers include a first convolution layer, a second convolution layer, and a third convolution layer, and in step S8, inputting a face image to be detected into a trained face detection model for calculation, and obtaining a feature point prediction result of the face image specifically includes the following steps:

s81: and carrying out standardized processing on the face picture to be detected to obtain first face data.

The standardization treatment comprises global normalization treatment and single-side inhibition treatment, wherein the global normalization treatment is BN treatment, and the disappearance of gradients or explosion can be effectively prevented through the global normalization treatment; and the unilateral inhibition processing is to use the ReLU as an activation function to carry out unilateral inhibition on the output image after the global normalization processing, so as to avoid overfitting.

Specifically, after global normalization processing and single-side suppression processing are performed on a face picture to be detected, first face data are obtained.

S82: and inputting the first face data into the first convolution layer for convolution calculation to obtain a first convolution result.

Specifically, the first face data obtained in step S81 is input to the first convolution layer and convolution calculation is performed thereon, the convolution calculation performs convolution transformation on the image matrix of the first face data, and the Feature of the image matrix is extracted by the convolution kernel of the first convolution layer, thereby outputting a Feature Map (Feature Map), that is, a first convolution result.

S83: and carrying out standardization processing on the first convolution result to obtain second face data.

Specifically, the first convolution result obtained in step S82 is continuously normalized to obtain second face data.

The normalization processing procedure for the first convolution result may adopt the same global normalization processing and single-side suppression processing procedure as that in step S81, and details are not repeated here.

S84: and inputting the second face data into a second convolution layer for convolution calculation to obtain a second convolution result.

Specifically, the second face data obtained in step S83 is input to the second convolution layer to perform convolution calculation, the convolution calculation performs convolution transformation on the image matrix of the second face data, the feature of the image matrix is extracted by the convolution kernel of the second convolution layer, and the second convolution result is output.

S85: and carrying out standardization processing on the second convolution result to obtain third face data.

Specifically, the second convolution result obtained in step S84 is continuously normalized to obtain third face data.

The normalization processing procedure of the second convolution result may adopt the same global normalization processing and single-side suppression processing procedure as step S81, and details are not repeated here.

S86: and inputting the third face data into a third convolution layer for convolution calculation to obtain a third convolution result.

Specifically, the third face data obtained in step S85 is input to the third convolution layer to perform convolution calculation, the convolution calculation performs convolution transformation on the image matrix of the third face data, the feature of the image matrix is extracted by the convolution kernel of the third convolution layer, and the third convolution result is output.

It should be noted that, the sizes of the convolution kernels of the first convolution layer, the second convolution layer, and the third convolution layer may be set in advance according to the needs of practical application, and they may be the same or different from each other, and are not limited herein.

S87: and inputting the first convolution result, the second convolution result and the third convolution result into a splicing layer for splicing calculation to obtain a convolution output result.

Specifically, the first convolution result obtained in step S82, the second convolution result obtained in step S84, and the third convolution result obtained in step S86 are simultaneously input to the splice layer for splicing calculation, so as to obtain a convolution output result.

S88: and inputting the convolution output result into a global pooling layer to perform pooling calculation to obtain a feature point prediction result of the human face picture to be detected.

Specifically, the convolution output result obtained in step S87 is input into the global pooling layer to perform pooling calculation, so as to obtain a prediction result of the feature point of the face picture to be detected.

Because the number of the feature parameters contained in the convolution output result is large, and simultaneously, redundant and miscellaneous features which have no practical significance or are repeated and the like may exist, the redundant and miscellaneous features can be screened out through the pooling calculation of the global pooling layer, unnecessary parameters are reduced, and overfitting is avoided.

Further, pooling calculations were performed using either a Max Pooling (Max Pooling) method or a mean Pooling (mean Pooling) method. The maximum pooling method is to take the maximum value of the feature map region as the pooled value of the region. The average pooling method is to calculate the average value of the feature map area as the result of pooling the area.

In this embodiment, when the face detection model includes three parallel convolution layers, the face image to be detected is normalized to obtain first face data, the first face data is input into the first convolution layer to perform convolution calculation to obtain a first convolution result, the first convolution result is further normalized to obtain second face data, the second face data is input into the second convolution layer to perform convolution calculation to obtain a second convolution result, the second convolution result is further normalized to obtain third face data, the third face data is input into the third convolution layer to perform convolution calculation to obtain a third convolution result, outputs of the three parallel convolution layers are all input into the splicing layer to perform splicing calculation to obtain a convolution output result, and the convolution output result is input into the global pooling layer to perform pooling calculation to obtain a feature point prediction result of the face image to be detected, and the face image to be detected can accurately locate feature points of the face, especially feature points of facial blurring, exaggeration, expression exaggeration, and the like, so that the face prediction rate of the face image to be detected can be accurately located.

In an embodiment, as shown in fig. 5, in step S4, calculating the accuracy of the trained face detection model for locating the face feature points according to the test result specifically includes the following steps:

s41: and calculating the normalized average error of each test sample in the test data set corresponding to the test result according to the test result.

Specifically, the test result includes the predicted position information of each human face feature point in the test sample of the test data set corresponding to the test result, and the Normalized Mean Error (NME) of each test sample is calculated according to the following formula:

where P is the normalized average error for each test sample, N is the actual number of face feature points for that test sample, x _k Actual position information of kth individual face feature point for the test sample, y _k For the predicted position information of the kth individual face feature point in the test result of the test sample, | x _k -y _k I is the actual position sum of the kth individual's face feature pointThe distance between the predicted positions, d, is the face image size of the test sample. The actual position information and the predicted position information may specifically be coordinate information, and the face image size may specifically be a pixel area of a face picture.

S42: and averagely dividing the preset error threshold value according to a preset interval numerical value to obtain P sub-threshold values, wherein P is a positive integer.

Specifically, the values from 0 to the preset error threshold are divided equally according to the preset interval values to obtain P sub-thresholds.

It should be noted that, the preset error threshold and the preset interval value may be set according to the needs of practical application, and are not limited herein.

For example, if the preset error threshold is 0.07 and the preset interval value is 0.001, the values between 0 and 0.07 are divided equally at intervals of 0.001, and 70 sub-thresholds are obtained.

It should be noted that, there is no necessary sequential execution order between step S41 and step S42, and the steps may also be executed in parallel, which is not limited herein.

S43: and counting the statistical number of the test samples with the normalized average error smaller than each sub-threshold, and calculating the percentage of the statistical number in the total number of the test samples in the test data set corresponding to the test result to obtain P percentage values.

Specifically, for the normalized average error of each test sample obtained in step S41, the normalized average error of the test sample is compared with each sub-threshold, and according to the statistical number of test samples whose normalized average error is smaller than each sub-threshold, the quotient between the statistical number and the total number of test samples in the test data set corresponding to the test result is calculated, so as to obtain P quotients, that is, P percentage values.

For example, if the predetermined error threshold is 0.2 and the predetermined interval value is 0.05, then P is 4,4 with sub-thresholds of 0.05, 0.1, 0.15 and 0.2, respectively. Assuming that the test data set corresponding to the test result contains 10 test samples in total, the normalized average error of each test sample is 0.003, 0.12, 0.06, 0.07, 0.23, 0.18, 0.11, 0.04, 0.09 and 0.215, respectively. Then statistics can be found:

normalized mean errors of less than 0.05 of 0.003 and 0.04, i.e., the statistical number of test samples with normalized mean errors of less than 0.05 was 2;

normalized mean errors of less than 0.1 of 0.003, 0.075, 0.04, 0.06, 0.07 and 0.09, i.e. the statistical number of test samples having a normalized mean error of less than 0.1 is 6;

normalized mean errors of less than 0.15 of 0.003, 0.075, 0.04, 0.06, 0.07, 0.09, and 0.11, i.e., the statistical number of test samples having normalized mean errors of less than 0.15 is 7;

normalized mean errors of less than 0.2 of 0.003, 0.075, 0.04, 0.06, 0.07, 0.09, 0.11, and 0.18, i.e., the statistical number of test samples having normalized mean errors of less than 0.2 is 8;

the 4 percentage values obtained according to the calculation mode in the step are respectively as follows: 2/10=20%, 6/10=60%, 7/10=70%, and 8/10=80%.

S44: and calculating the average value of the P percentage values, and taking the average value as the positioning accuracy.

Specifically, according to the P percentage values obtained in step S43, an arithmetic mean of the P percentage values is calculated, and the arithmetic mean is the positioning accuracy.

Continuing with the example of step S43, the average of the 4 percentage values is (20% +60% +70% + 80%)/4 =57.8%.

In the embodiment, the normalized average error of the test sample is calculated, the preset error threshold is averagely divided according to the preset interval value, then the statistical quantity of the test samples with the normalized average error smaller than each sub-threshold is counted, the percentage of the statistical quantity to the total number of the test samples in the test data set corresponding to the test result is calculated, P percentage values are obtained, the arithmetic mean of the P percentage values is used as the positioning accuracy, the positioning accuracy obtained by the calculation method of the embodiment can objectively and accurately reflect the prediction accuracy of the trained face detection model to the feature points, and further, an accurate judgment basis is provided for further optimization of the model training parameters.

In an embodiment, as shown in fig. 6, in step S1, acquiring the sample data set specifically includes the following steps:

s11: video data and pictures are acquired.

Specifically, the video data is acquired from a preset video source channel, where the video source channel may be video data recorded in the monitoring device, video data stored in a server database, video data collected in a video application, and the like. The method comprises the steps of obtaining pictures from a preset picture source channel, wherein the picture source channel can be pictures disclosed by the Internet, pictures pre-stored in a server database and the like.

It can be understood that the acquired video data and the acquired pictures are multiple.

S12: and extracting a target video frame image from the video data according to a preset frame extraction frequency and a preset maximum frame number.

Specifically, each piece of video data acquired in step S11 is processed, and a frame image is extracted from a preset position of the piece of video data according to a preset frame extraction frequency and a preset maximum frame number, so as to obtain a target video frame image. The preset position may be a first frame position of the video data, or may be other positions, which is not limited herein.

It should be noted that the preset frame extraction frequency may be generally set to randomly extract 1 frame image from every 2 consecutive frame images, and the preset maximum frame number is generally an empirical value, and a value range thereof may be 1700 to 1800, but is not limited thereto, and both the preset frame extraction frequency and the preset maximum frame number may be set according to needs of practical applications, and are not limited herein.

For example, assuming that the preset frame extraction frequency is 1 frame image randomly extracted from every 5 continuous frame images, and the preset maximum frame number is 1800, if the total frame number of the video data is 2500 frames and the extraction is started from the first frame of the video data, the number of the target video frame images is 500 frames.

S13: and respectively labeling the face characteristic points of the target video frame image and the picture to respectively obtain the face characteristic point labeling information of the target video frame image and the face characteristic point labeling information of the picture.

In particular, the amount of the solvent to be used,

and labeling the face characteristic points of each target video frame image obtained in the step S12 to obtain face characteristic point labeling information of each target video frame image, and labeling the face characteristic points of the image obtained in the step S11 to obtain face characteristic point labeling information of each image, wherein the face characteristic point labeling information comprises attribute information and position information of the face characteristic points. The attribute information is facial feature point information, and the position information is pixel point coordinates of the facial feature points in the facial sample picture. Further, the method for labeling the human face characteristic points of the target video frame image and the picture is realized by combining a preset human face characteristic point labeling tool with manual correction, and the detailed description is as follows:

(1) And respectively inputting the target video frame image and the picture into preset human face characteristic point marking tools, and respectively marking the human face characteristic points in the target video frame image and the picture by the human face characteristic point marking tools to obtain a first marking result.

The preset face characteristic point marking tool can be an existing neural network tool capable of achieving the face characteristic point marking function, and the face characteristic points comprise face characteristics such as ears, eyebrows, eyes, a nose, lips and a face shape.

Because the existing neural network tool capable of realizing the human face feature point labeling function has low labeling accuracy, further manual correction is needed.

(2) And sending the first labeling result to a target user for confirmation and adjustment, receiving correction information returned by the target user, and updating the information of the labeling error in the first labeling result according to the correction information to obtain accurate human face characteristic point labeling information.

S14: and processing the picture according to a preset processing mode to obtain a new picture and the face characteristic point marking information of the new picture.

Specifically, the preset processing manner includes, but is not limited to, horizontal turning, clockwise random rotation, counterclockwise random rotation, translation, scaling, brightness increase and decrease, and the like, the image obtained in step S11 is processed according to the preset processing manner to obtain a new image, and the position information in the face characteristic point label information of the image is correspondingly and synchronously updated according to the processing manner to obtain the face characteristic point label information of the new image.

It should be noted that, by processing the picture according to a preset processing mode, a new picture and the corresponding face feature point labeling information thereof are obtained, the sample data set can be enriched rapidly, and the labeling process of the face feature point labeling information in the step S13 does not need to be repeated, so that abundant and diverse face sample pictures are provided for training and testing of the face detection model, the diversity and balance of the samples are ensured, and the training and testing of the face detection model can be better supported.

S15: and taking the target video frame image, the picture and the new picture as the face sample picture.

Specifically, the target video frame image obtained in step S12, the picture obtained in step S11, and the new picture obtained in step S14 are all used as face sample pictures of the sample data set, and the face characteristic point annotation information of the target video frame image, the picture, and the new picture is the face characteristic point annotation information of the face sample picture.

In this embodiment, on one hand, video frames are extracted from video data, and facial feature points are labeled on an obtained target video frame image, because the change of facial poses in continuous frame images of the video data is small, when facial feature points are labeled on the target video frame image by using a preset facial feature point labeling tool and a manual correction combined mode, low-cost and accurate labeling can be realized to obtain a large amount of accurate sample data, meanwhile, when the target video frame image is extracted, insufficient data diversity caused by small changes of poses and expressions of continuous multiframes of faces in the video data is avoided by setting frame extraction frequency, and overfitting of a face detection model caused by leading of a long video is avoided by setting a maximum frame number; on the other hand, the picture data is expanded to the same order of magnitude as the video data by processing the picture. According to the embodiment, the labeling cost of the face sample pictures is reduced, the sample data set containing abundant face sample pictures is obtained, and the training and testing of the face detection model can be effectively supported, so that the training accuracy and the prediction capability of the face detection model are improved.

In an embodiment, as shown in fig. 7, in step S14, the processing the picture according to a preset processing manner to obtain a new picture and face feature point labeling information of the new picture specifically includes the following steps:

s141: and horizontally turning the picture to obtain the first picture and the face characteristic point marking information of the first picture.

Specifically, the picture is horizontally flipped, and the position information of each face feature point in the face feature point labeling information of the picture is synchronously and correspondingly adjusted according to the horizontally flipped corresponding relationship, so that the first picture and the face feature point labeling information of the first picture are obtained.

It is understood that the number of pictures and the number of the first pictures are the same, and when the sum of the number of pictures and the number of the first pictures is taken as the first number, the first number is 2 times the number of the pictures.

S142: and respectively rotating the picture and the first picture according to a preset rotating mode to obtain the second picture and the face characteristic point marking information of the second picture.

Specifically, the picture and the first picture obtained in step S141 are respectively rotated according to a preset rotation mode to obtain a second picture, and the position information of each face feature point in the face feature point labeling information of the picture and the first picture is synchronously and correspondingly adjusted according to the corresponding relationship of the preset rotation mode to obtain the face feature point labeling information of the second picture.

It should be noted that the preset rotation mode may specifically be clockwise random rotation or counterclockwise random rotation, but is not limited thereto, and the preset rotation mode may be set according to the needs of the practical application, and is not limited herein.

It can be understood that, if the preset rotation mode is a clockwise random rotation mode and a counterclockwise random rotation mode, the number of the obtained second pictures is 4 times of the number of the pictures, and at this time, the sum of the number of the second pictures and the first number is taken as the second number, and the second number is 6 times of the number of the pictures.

S143: and respectively carrying out translation processing and zooming processing on the face rectangular frames in the picture, the first picture and the second picture in sequence according to a preset offset and a preset zooming ratio to obtain the face characteristic point labeling information of the third picture and the third picture.

Specifically, the method includes respectively performing translation processing on face rectangular frames in a picture, a first picture and a second picture according to a preset offset, then performing scaling processing on the face rectangular frames in the picture after the translation processing, the first picture and the second picture according to a preset scaling to obtain a third picture, and simultaneously, synchronously and correspondingly adjusting position information of each face characteristic point in face characteristic point marking information according to a corresponding relation between the preset offset and the preset scaling to obtain the face characteristic point marking information of the third picture.

The preset offset and the preset scaling may be random values within a preset range.

It is understood that the number of the third pictures is 2 × 3 × 2=12 times the number of pictures.

S144: and randomly selecting a target picture from the pictures, the first picture, the second picture and the third picture according to a preset extraction ratio, and carrying out random brightness change processing on the target picture to obtain the face characteristic point labeling information of the fourth picture and the fourth picture.

Specifically, the target picture is randomly selected from the first picture obtained in step S141, the second picture obtained in step S142, the third picture obtained in step S143, and the pictures according to a preset extraction ratio. And carrying out random brightness change processing on the selected target picture to obtain a fourth picture, wherein the face characteristic point marking information of the target picture is the face characteristic point marking information of the fourth picture.

The random brightness change processing comprises brightness increasing or brightness reducing processing on randomly selected pixel points, and the increasing amplitude and the reducing amplitude can be randomly generated or can be determined by a preset amplitude threshold value. The preset extraction ratio may be set to 30% in general, but is not limited thereto, and may be specifically set according to the needs of the actual application.

It is understood that, when the preset extraction ratio is 30%, the number of the fourth pictures is 12 × 1.3=15.6 times the number of pictures.

S145: and taking the first picture, the second picture, the third picture and the fourth picture as new pictures.

Specifically, the first picture obtained in step S141, the second picture obtained in step S142, the third picture obtained in step S143, and the fourth picture obtained in step S144 are all used as new pictures, and the face feature point label information of the first picture, the second picture, the third picture, and the fourth picture is the face feature point label information of the new pictures.

For example, assuming that the number of acquired pictures is 3300, the number of new pictures obtained by the present embodiment after augmentation is about 5 ten thousand, which effectively expands the sample data set.

In the embodiment, a series of horizontal turning processing, rotation processing, translation processing, scaling processing, random brightness change processing and the like are performed on the pictures, so that the number of the obtained new pictures is increased in a series of stages, the sample data set is rapidly expanded on the basis of not increasing the labeling cost of the labeling information of the human face characteristic points, the acquisition efficiency of the sample data set is improved, the sample data set containing abundant human face sample pictures is obtained, the training and testing of the human face detection model can be effectively supported, and the training accuracy and the prediction capability of the human face detection model are improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not limit the implementation process of the embodiments of the present invention in any way.

In an embodiment, a face feature point detection device is provided, which corresponds to the face feature point detection methods in the above embodiments one to one. As shown in fig. 8, the human face feature point detection apparatus includes a first obtaining module 81, a sample dividing module 82, a model training module 83, a model testing module 84, a model optimizing module 85, a training result module 86, a second obtaining module 87, and a model predicting module 88. The functional modules are explained in detail as follows:

a first obtaining module 81, configured to obtain a sample data set, where the sample data set includes face sample pictures and face feature point labeling information of each face sample picture;

the sample dividing module 82 is used for dividing the sample data set into a training data set and a testing data set according to a preset dividing proportion;

the model training module 83 is configured to train an initial face detection model by using a training data set to obtain a trained face detection model, where the initial face detection model is a convolutional neural network including K parallel convolutional layers, a stitching layer, and a global pooling layer, each parallel convolutional layer has a visual perception range with a different preset scale, and K is a positive integer greater than or equal to 3;

the model testing module 84 is used for testing the trained face detection model by using the test data set and calculating the positioning accuracy of the trained face detection model on the face characteristic points according to the test result;

the model optimization module 85 is configured to, if the positioning accuracy is smaller than the preset accuracy threshold, re-divide the face sample pictures in the sample data set to obtain a new training data set and a new testing data set, train the trained face detection model using the new training data set to update the trained face detection model, and test the trained face detection model using the new testing data set until the positioning accuracy is greater than or equal to the preset accuracy threshold;

a training result module 86, configured to determine, if the positioning accuracy is greater than or equal to the preset accuracy threshold, a trained face detection model whose positioning accuracy is greater than or equal to the preset accuracy threshold as a trained face detection model;

the second obtaining module 87 is used for obtaining a human face picture to be detected;

and the model prediction module 88 is configured to input the face picture to be detected into the trained face detection model for calculation, so as to obtain a feature point prediction result of the face picture, where the feature point prediction result includes attribute information and position information of the target feature point.

Further, K equals 3, and the K parallel convolutional layers include a first convolutional layer, a second convolutional layer, and a third convolutional layer, the model prediction module 88 includes:

a first normalization sub-module 881, configured to perform normalization processing on the face image to be detected to obtain first face data;

the first convolution calculation sub-module 882 is configured to input the first face data into a first convolution layer for convolution calculation to obtain a first convolution result;

a second normalization sub-module 883, configured to perform normalization processing on the first convolution result to obtain second face data;

the second convolution calculation submodule 884 is configured to input the second face data into the second convolution layer for convolution calculation to obtain a second convolution result;

a third normalization submodule 885, configured to perform normalization processing on the second convolution result to obtain third face data;

a third convolution calculation submodule 886, configured to input the third face data into a third convolution layer for convolution calculation, so as to obtain a third convolution result;

the splicing submodule 887 is configured to input the first convolution result, the second convolution result, and the third convolution result into a splicing layer for splicing calculation, so as to obtain a convolution output result;

and the pooling sub-module 888 is used for inputting the convolution output result into the global pooling layer to perform pooling calculation so as to obtain a feature point prediction result of the human face picture to be detected.

Further, the model test module 84 includes:

the error calculation submodule 841 is configured to calculate, according to the test result, a normalized average error of each test sample in the test data set corresponding to the test result;

a threshold segmentation submodule 842, configured to perform average segmentation on a preset error threshold according to a preset interval value to obtain P sub-thresholds, where P is a positive integer;

the percentage calculation submodule 843 is configured to count the statistical number of the test samples whose normalized average error is smaller than each sub-threshold, and calculate the percentage of the statistical number to the total number of the test samples in the test data set corresponding to the test result, so as to obtain P percentage values;

an accuracy calculation sub-module 844 is used for calculating an average of the P percentage values and taking the average as the positioning accuracy.

Further, the first obtaining module 81 includes:

a data acquisition submodule 811 for acquiring video data and pictures;

a video frame extraction sub-module 812 for extracting a target video frame image from the video data according to a preset frame extraction frequency and a preset maximum frame number;

the labeling submodule 813 is configured to label the face feature points of the target video frame image and the picture respectively to obtain face feature point labeling information of the target video frame image and face feature point labeling information of the picture respectively;

the picture processing submodule 814 is configured to process the picture according to a preset processing manner, so as to obtain a new picture and face feature point labeling information of the new picture;

and the sample augmentation submodule 815 is used for taking the target video frame image, the picture and the new picture as the face sample picture.

Further, the picture processing submodule 814 includes:

the turning submodule 8141 is used for horizontally turning the picture to obtain a first picture and the face characteristic point marking information of the first picture;

the rotation sub-module 8142 is configured to respectively rotate the picture and the first picture according to a preset rotation mode to obtain a second picture and face feature point labeling information of the second picture;

the translation and scaling submodule 8143 is configured to perform translation processing and scaling processing on the face rectangular frames in the picture, the first picture and the second picture in sequence according to a preset offset and a preset scaling ratio, so as to obtain face feature point labeling information of a third picture and the third picture;

the brightness processing submodule 8144 is configured to randomly select a target picture from the picture, the first picture, the second picture and the third picture according to a preset extraction ratio, and perform random brightness change processing on the target picture to obtain the fourth picture and the face feature point labeling information of the fourth picture;

and a new sample submodule 8145, configured to take the first picture, the second picture, the third picture and the fourth picture as new pictures.

For the specific definition of the face feature point detection device, reference may be made to the above definition of the face feature point detection method, which is not described herein again. The various modules in the human face feature point detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is for storing a set of sample data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a face feature point detection method.

In an embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the face feature point detection method of the above embodiment, such as steps S1 to S8 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the human face feature point detection apparatus in the above-described embodiments, such as the functions of the modules 81 to 88 shown in fig. 8. To avoid repetition, the description is omitted here.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program is executed by a processor to implement the method for detecting the face feature point in the above method embodiment, or the computer program is executed by the processor to implement the functions of each module/unit in the apparatus for detecting the face feature point in the above apparatus embodiment. To avoid repetition, further description is omitted here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A face feature point detection method is characterized by comprising the following steps:

testing the trained face detection model by using the test data set, and calculating the normalized average error of each test sample in the test data set corresponding to the test result according to the test result;

averagely dividing the error threshold from 0 to a preset error threshold according to a preset interval numerical value to obtain P sub-thresholds, wherein P is a positive integer;

counting the number of the test samples with the normalized average error smaller than each sub-threshold, and calculating the percentage of the counted number in the total number of the test samples in the test data set corresponding to the test result to obtain P percentage values;

calculating the average value of the P percentage values, and taking the average value as the positioning accuracy;

acquiring a human face picture to be detected;

and inputting the human face picture to be detected into the trained human face detection model for calculation to obtain a characteristic point prediction result of the human face picture, wherein the characteristic point prediction result comprises attribute information and position information of a target characteristic point.

2. The method according to claim 1, wherein K is equal to 3, and the K parallel convolutional layers include a first convolutional layer, a second convolutional layer and a third convolutional layer, and the inputting the face picture to be detected into the trained face detection model for calculation to obtain the feature point prediction result of the face picture comprises:

carrying out standardization processing on the face picture to be detected to obtain first face data;

inputting the first face data into the first convolution layer for convolution calculation to obtain a first convolution result;

the first convolution result is subjected to standardization processing to obtain second face data;

inputting the second face data into the second convolution layer to carry out convolution calculation to obtain a second convolution result;

the second convolution result is subjected to standardization processing to obtain third face data;

inputting the third face data into the third convolution layer for convolution calculation to obtain a third convolution result;

inputting the first convolution result, the second convolution result and the third convolution result into the splicing layer for splicing calculation to obtain a convolution output result;

and inputting the convolution output result into the global pooling layer to perform pooling calculation to obtain the feature point prediction result.

3. The method according to any one of claims 1 to 2, wherein the acquiring the sample data set comprises:

acquiring video data and pictures;

extracting a target video frame image from the video data according to a preset frame extraction frequency and a preset maximum frame number;

respectively carrying out face characteristic point labeling on the target video frame image and the picture to respectively obtain face characteristic point labeling information of the target video frame image and face characteristic point labeling information of the picture;

processing the picture according to a preset processing mode to obtain a new picture and face characteristic point marking information of the new picture;

and taking the target video frame image, the picture and the new picture as the face sample picture.

4. The method for detecting human face feature points according to claim 3, wherein the processing the picture according to a preset processing mode to obtain a new picture and human face feature point labeling information of the new picture comprises:

performing horizontal turning processing on the picture to obtain a first picture and face characteristic point labeling information of the first picture;

respectively rotating the picture and the first picture according to a preset rotating mode to obtain a second picture and the marking information of the human face characteristic points of the second picture;

respectively carrying out translation processing and scaling processing on the face rectangular frames in the picture, the first picture and the second picture in sequence according to a preset offset and a preset scaling ratio to obtain a third picture and face characteristic point marking information of the third picture;

randomly selecting a target picture from the picture, the first picture, the second picture and the third picture according to a preset extraction ratio, and carrying out random brightness change processing on the target picture to obtain a fourth picture and the face characteristic point marking information of the fourth picture;

and taking the first picture, the second picture, the third picture and the fourth picture as the new pictures.

5. A face feature point detection device, comprising:

the sample dividing module is used for dividing the sample data set into a training data set and a test data set according to a preset dividing proportion;

a model testing module for testing the trained face detection model using the test data set, the model testing module comprising:

the error calculation submodule is used for calculating the normalized average error of each test sample in the test data set corresponding to the test result according to the test result;

the threshold segmentation sub-module is used for averagely segmenting the error threshold from 0 to a preset error threshold according to a preset interval numerical value to obtain P sub-thresholds, wherein P is a positive integer;

the percentage calculation submodule is used for counting the statistical number of the test samples of which the normalized average error is smaller than each sub-threshold value and calculating the percentage of the statistical number to the total number of the test samples in the test data set corresponding to the test result to obtain P percentage values;

the accuracy calculation submodule is used for calculating the average value of the P percentage values and taking the average value as the positioning accuracy;

a model optimization module, configured to, if the positioning accuracy is smaller than a preset accuracy threshold, re-divide the face sample picture in the sample data set to obtain a new training data set and a new test data set, train the trained face detection model using the new training data set to update the trained face detection model, and test the trained face detection model using the new test data set until the positioning accuracy is greater than or equal to the preset accuracy threshold;

a training result module, configured to determine, if the positioning accuracy is greater than or equal to the preset accuracy threshold, the trained face detection model with the positioning accuracy greater than or equal to the preset accuracy threshold as a trained face detection model;

the second acquisition module is used for acquiring a human face picture to be detected;

6. The human face feature point detection device of claim 5, wherein K is equal to 3, and the K parallel convolution layers include a first convolution layer, a second convolution layer, and a third convolution layer, the model prediction module comprising:

the first standardization sub-module is used for carrying out standardization processing on the human face picture to be detected to obtain first human face data;

the first convolution calculation submodule is used for inputting the first face data into the first convolution layer to carry out convolution calculation so as to obtain a first convolution result;

the second standardization sub-module is used for carrying out standardization processing on the first volume result to obtain second face data;

the second convolution calculation submodule is used for inputting the second face data into the second convolution layer to carry out convolution calculation so as to obtain a second convolution result;

the third normalization submodule is used for carrying out the normalization processing on the second convolution result to obtain third face data;

the third convolution calculation submodule is used for inputting the third face data into the third convolution layer to carry out convolution calculation so as to obtain a third convolution result;

the splicing submodule is used for inputting the first convolution result, the second convolution result and the third convolution result into the splicing layer for splicing calculation to obtain a convolution output result;

and the pooling sub-module is used for inputting the convolution output result into the global pooling layer to carry out pooling calculation so as to obtain the feature point prediction result.

7. The face feature point detection device according to claim 5 or 6, wherein the first acquisition module includes:

the data acquisition submodule is used for acquiring video data and pictures;

the video frame extraction submodule is used for extracting a target video frame image from the video data according to a preset frame extraction frequency and a preset maximum frame number;

the marking submodule is used for marking the face characteristic points of the target video frame image and the picture respectively to obtain face characteristic point marking information of the target video frame image and face characteristic point marking information of the picture respectively;

the picture processing submodule is used for processing the picture according to a preset processing mode to obtain a new picture and face characteristic point marking information of the new picture;

and the sample augmentation submodule is used for taking the target video frame image, the picture and the new picture as the face sample picture.

8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the face feature point detection method according to any one of claims 1 to 4 when executing the computer program.

9. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for face feature point detection according to any one of claims 1 to 4.