CN112101105A

CN112101105A - Training method and device for face key point detection model and storage medium

Info

Publication number: CN112101105A
Application number: CN202010794471.XA
Authority: CN
Inventors: 马啸; 张阿强
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2020-12-18
Anticipated expiration: 2040-08-07
Also published as: CN112101105B

Abstract

The embodiment of the invention discloses a training method, a device and a storage medium of a face key point detection model, wherein continuous multiframe face video frame images in a face video containing marked face key points are marked and smoothed to remove jitter, and a first face key point detection model is trained by utilizing the re-marked multiframe face video frame images, so that the position relation of the same face key point in the continuous face video frame images with the jitter removed is used in the training process, the stability and the accuracy of a second face key point detection model obtained by training can be effectively improved, the method, the device and the storage medium are particularly suitable for detecting the face key points in the face video, and the jitter can be effectively reduced.

Description

Training method and device for face key point detection model and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a training method and a training device for a face key point detection model and a storage medium.

Background

The human face key points have the functions of accurately positioning and segmenting the positions of all parts of the human face, such as the accurate outer contours of eyes, eyebrows and a mouth, the outer contour of a face and the like, and the human face key points can be applied to various fields, such as human face deformation (face thinning, large eyes and the like), virtual makeup, animation movies and the like.

With the gradual increase of the application fields of the face key points, the requirement on the accuracy of the detection of the face key points is higher and higher, and at present, a common mode is to train to obtain a face key point detection model, that is, the face key points are detected by using the face key point detection model, however, the problems of low detection stability and accuracy still exist when the face key point detection model is used, and particularly, the shake of the face key points can be caused when the face in a video is detected.

Disclosure of Invention

The invention mainly aims to provide a training method, a training device and a storage medium for a face key point detection model, which can solve the problem that the face key points detected by the face key point detection model in the prior art are inaccurate and have jitter.

In order to achieve the above object, a first aspect of the present invention provides a method for training a face keypoint detection model, where the method includes:

acquiring a target face sample data set with marked face key points, wherein the target face sample data set comprises at least one sample subset, and the sample subset comprises continuous multi-frame face video frame images in a face video;

performing annotation smoothing processing on face key points of a face video frame image in a target sample subset to obtain a re-annotated target sample subset, wherein the target sample subset is any one sample subset;

and training the first face key point detection model by using the target face sample data set containing the re-labeled sample subset to obtain the trained target face key point detection model.

In order to achieve the above object, a second aspect of the present invention provides a device for training a face keypoint detection model, the device comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a target face sample data set with labeled face key points, the target face sample data set comprises at least one sample subset, and the sample subset comprises continuous multi-frame face video frame images in a face video;

the smoothing module is used for carrying out labeling smoothing processing on the face key points of the face video frame images in the target sample subsets to obtain re-labeled target sample subsets, wherein the target sample subsets are any sample subsets; and the training module is used for training the first face key point detection model by using the target face sample data set containing the re-labeled sample subset to obtain the trained target face key point detection model.

To achieve the above object, a third aspect of the present invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps as described in the first aspect.

The embodiment of the invention has the following beneficial effects:

the invention provides a training method of a face key point detection model, which comprises the following steps: acquiring a target face sample data set with marked face key points, wherein the target face sample data set comprises at least one sample subset, and the sample subset comprises continuous multi-frame face video frame images in a face video; and training the first face key point detection model by using the target face sample data set to obtain a trained target face key point detection model. The method comprises the steps of carrying out labeling smoothing processing on continuous multiframe human face video frame images in a human face video containing labeled human face key points to remove jitter, and training a first human face key point detection model by utilizing the re-labeled multiframe human face video frame images, so that the position relation of the same human face key point in the continuous human face video frame images with the jitter removed is used in the training process, the stability and the accuracy of the trained second human face key point detection model can be effectively improved, the method is particularly suitable for detecting the human face key points in the human face video, and the jitter can be effectively reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a schematic flow chart of a training method of a face key point detection model in an embodiment of the present invention;

FIG. 2 is another schematic flow chart of a training method for a face keypoint detection model according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method of label smoothing according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an embodiment of a target frame number segment after deduplication processing;

FIG. 5 is a schematic flow chart illustrating a refinement step of step 101 shown in FIG. 1 according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a training device for a face keypoint detection model in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a training method for a face keypoint detection model according to an embodiment of the present invention, where the method includes:

step 101, obtaining a target face sample data set with labeled face key points, wherein the target face sample data set comprises at least one sample subset, and the sample subset comprises continuous multi-frame face video frame images in a face video;

in an embodiment of the present invention, the training method for the face keypoint detection model is mainly implemented by a training device for the face keypoint detection model, the training device is a program module and is stored in a readable storage medium of a computer device, and a processor in the computer device can call and operate the training device for the face keypoint detection model to implement the training method.

In the embodiment of the invention, a target human face sample data set is used for training a first human face key point detection model, so that the trained target human face key point detection model is used for detecting the human face key points of a human face video to be detected, the accuracy of human face key point detection is improved, and the jitter is reduced.

The method comprises the steps of using a target face sample data set with labeled face key points, wherein the target face sample data set contains sample data used for training a first face key point detection model and comprises but is not limited to at least one sample subset, and one sample subset contains continuous multiframe face video frame images in a face video. The number of labeled face key points in each frame of the face video frame image is the same, and is N, where N is a positive integer, for example, N may be 68, that is, 68 face key points, and the 68 face key points may be numbered in sequence, and the numbering sequence is 0 to 67. It is understood that the number of face keypoints that can be detected by the trained target face keypoint model is consistent with the number of face keypoints in each frame of face video frame image in the sample subset, and may be 68 face keypoints, for example.

The labeled human face key points refer to determining a position coordinate value of each human face key point in each frame of human face video frame image according to the number of the set human face key points, and if the position coordinate value of one human face key point is determined, it indicates that the labeling of the human face key point is completed, for example, if the number 39 is preset as the position of the inner corner of the left eye of a human, the inner corner of the left eye of the human needs to be manually found during manual labeling, and the number 39 is labeled, at this time, the position labeled with the number 39 indicates one human face key point 39, and the position coordinate value where the human face key point is located indicates the position coordinate value of the human face key point 39.

And each frame of face video frame image is marked with face key points, and the number of the marked face key points is the same. For example, each frame of the face video frame image is labeled with 68 face key points. It can be understood that the face key points with the same number in different face video frame images represent the same key point, for example, if the face key point with the number 39 is used for identifying the inner canthus of the left eye of a person, the face key points with the number 39 in one sample subset all represent the inner canthus of the left eye of the person, and if the position coordinate values of the face key point with the number 39 in different face video frame images are different, the expression of the face and/or the overall position of the face in the face video frame images are different.

In addition, the labeling of the key points of the human face may be performed in a manual labeling manner, or may also be performed in other automatic labeling manners, which will be described in detail later herein, and will not be described herein again.

It should be understood that, in the embodiment of the present invention, the target face sample data set may only include the sample subset described above, or include the sample subset described above and other sample data, and in the case of including other sample data, the number of face keypoints already labeled in each frame or image is the same, and the labeling rule is consistent.

102, performing annotation smoothing on face key points of a face video frame image in a target sample subset to obtain a re-annotated target sample subset, wherein the target sample subset is any sample subset;

and 103, training the first face key point detection model by using the target face sample data set containing the re-labeled sample subset to obtain the trained target face key point detection model.

In the embodiment of the invention, although the face key points are labeled in the continuous multi-frame face video frame images contained in the sample subset of the target face sample data set, the position relationship between the face key points in the multi-frame face video frame images cannot be noticed by a manual labeling mode or other labeling modes, the position of the same face key point in the multi-frame face video frame images is not necessarily smooth, the probability of jitter is very high, the first face key point detection model is trained by using the continuous multi-frame face video frame images with unsmooth face key points, and the target face key point detection model obtained by training cannot achieve the purpose of eliminating the face key point jitter when the face key point detection is carried out on the face video to be detected. Therefore, in order to fully utilize the position relationship of the same face key point between the continuous multi-frame face video frame images in the sample subsets to realize the jitter removal when the face key point detection is carried out on the face video, the face key point of the continuous multi-frame face video frame images in each sample subset is subjected to the labeling smoothing treatment to obtain the re-labeled sample subset.

The annotation smoothing process is to update the position coordinate values of the same face key point in the continuous multi-frame face video frame images in a sample subset, so that the position coordinate values of the updated face key point in the multi-frame face video frame images can form a smooth line, and the jitter of the face key point in the multi-frame face video frame images in the sample subset can be removed in the annotation smoothing process.

It should be noted that, the target face sample data set includes at least one sample subset, and the face key points in each sample subset need to be labeled and smoothed, and in order to make the technical solution described more clear, the concept of the target sample subset is used here, and the target sample subset represents any sample subset in the target face sample data set. The method may include labeling and smoothing face key points of a face video frame image in a target sample subset to obtain a relabeled target sample subset, and further training a first face key point detection model by using a target face data set including the relabeled sample subset to obtain a trained target face key point detection model. It can be understood that at least one sample subset in the target face sample data set is subjected to label smoothing processing.

In the embodiment of the invention, continuous multiframe human face video frame images in the human face video containing the marked human face key points are marked and smoothed to remove jitter, and the first human face key point detection model is trained by utilizing the re-marked multiframe human face video frame images, so that the position relation of the same human face key point in the continuous human face video frame images with the jitter removed is used in the training process, the model can learn the accuracy of the position of each key point and the stability among the continuous frame key points, the stability and the accuracy of the second human face key point detection model obtained by training can be effectively improved, the method is particularly suitable for detecting the human face key points in the human face video, and the jitter can be effectively reduced.

For better understanding of the technical solution in the embodiment of the present invention, please refer to fig. 2, which is another schematic flow chart of the training method of the face keypoint detection model in the embodiment of the present invention, including:

step 201, obtaining a target face sample data set with labeled face key points, wherein the target face sample data set comprises at least one sample subset, and the sample subset comprises continuous multi-frame face video frame images in a face video;

it is understood that the content of step 201 is similar to that described in step 101 in the embodiment shown in fig. 1, and specific reference may be made to the related content described in step 101, which is not described herein again.

202, acquiring a first position data sequence of the target key point, wherein the first position data sequence comprises position coordinate values of the target key point in a plurality of frames of target face video frame images in the target sample subset, and the position coordinate values in the first position data sequence are ordered according to the frame number of the target face video frame images;

step 203, performing label smoothing processing on the first position data sequence to obtain a second position data sequence of the target key point, and updating labels of the target key points in the multi-frame target face video frame images according to the second position data sequence to obtain a re-labeled target sample subset;

and 204, training the first face key point detection model by using the target face sample data set containing the re-labeled sample subset to obtain the trained target face key point detection model.

In the embodiment of the present invention, the sample subset in the target face sample data set needs to be labeled and smoothed, because the sample subset contains continuous multi-frame face video frame images in the face video, and the labeled and smoothed sample subset can obtain the sample subset with reduced or removed jitter, so that the trained target face keypoint detection model can reduce jitter and obtain a better detection result when performing face keypoint detection on the face video.

Taking the target sample subset as an example, because the target sample subset includes consecutive multi-frame face video frame images, and the number of face key points included in each face video frame image is the same, and the numbering rules are the same, for example, the number of face key points may be 68, and the numbers are 0 to 67, and the same number represents the same type of face key points in different face video frames. A face keypoint, such as that numbered 39, represents the inner corner of the human left eye. Taking a target key point in the target sample subset as an example, a first position data sequence of the target key point may be obtained, where the first position data sequence includes position coordinate values of the target key point in multiple frames of target face video frame images in the target sample subset, and the position coordinate values in the first position data sequence are sorted according to a frame number of the target face video frame image. For example, if the number of the face key points is 68, the target key point may be any one of the 68 face key points. And if the face key point with the number 39 needs to be labeled and smoothed, the face key point 39 can be used as a target key point to obtain a corresponding first position data sequence. If the target sample subset includes 100 continuous frames of face video frame images, the first position data sequence of the face key point 39 includes the position coordinate values of the face key point 39 in the 100 frames of face video frame images, and the position coordinate values are ordered according to the frame number sequence of the 100 frames of face video frame images.

In the embodiment of the invention, after the first position data sequence of the target key point is obtained, the first position data sequence is subjected to labeling smoothing processing to obtain the second position data sequence of the target key point, and the label of the target key point in the multi-frame target human face video frame image in the target sample subset is updated according to the second position data sequence to obtain the re-labeled target sample subset. And labeling each key point according to a processing mode of the key point and the target key point to obtain a re-labeled target sample subset.

Specifically, in a feasible implementation manner, the method for labeling smoothing processing specifically refers to fig. 3, which is a schematic flow chart of the method for labeling smoothing processing in the embodiment of the present invention, and includes:

301, performing deduplication processing on position coordinate values belonging to a target frame number segment in the first position data sequence to enable one target frame number segment to correspond to one position coordinate value so as to obtain a third position data sequence, wherein the target frame number segment comprises at least two consecutive frame numbers, and the position coordinate values corresponding to the two consecutive frame numbers are the same;

step 302, performing label smoothing processing on the third position data sequence to obtain a fourth position data sequence;

step 303, using the position coordinate value corresponding to the target frame number segment in the fourth position data sequence as the position coordinate value corresponding to each of the two consecutive frame numbers to obtain the second position data sequence.

In the embodiment of the present invention, the above steps 301 to 303 describe a process of performing labeling smoothing processing on the first position data sequence to obtain a second position data sequence of the target keypoint.

Each position coordinate value in the first position data sequence corresponds to a frame number, a target frame number segment in the first position data sequence can be determined, wherein the target frame number segment comprises at least two continuous frame numbers, the position coordinate values corresponding to the at least two continuous frame numbers are the same, the position coordinate values belonging to the target frame number segment in the first position data sequence can be subjected to de-duplication processing, one target frame number segment corresponds to one position coordinate value, and a third position data sequence can be obtained. It is understood that the first position data sequence is formed by position coordinate values and may form a curve, and the label smoothing process in the embodiment of the present invention may also be understood as a smoothing process on the curve.

And further, performing labeling smoothing processing on the third position data sequence to obtain a fourth position data sequence, and after the fourth position data sequence is obtained, taking the position coordinate values corresponding to the target frame number segment in the fourth position data sequence as the position coordinate values corresponding to the two continuous frame numbers respectively to obtain a second position coordinate sequence.

For example, in the target sample subset, the first position data sequence of the target keypoint i may be represented as f_i(x, y, t), wherein x, y represents coordinate values in a standard two-dimensional coordinate system, and t represents a frame number. Since the face may be stationary for more than 2 frames or more in a certain period of time in the face video corresponding to the target sample subset, the position coordinate values of the target keypoint i in the consecutive frames of the face video may be the same, and a part of the position coordinate values need to be removed by a way of reprocessing. If the first position data sequence includes t 1-t 7 position coordinate values, and the frame numbers of t4, t5 and t6 are consecutive and the position coordinate values are the same, then t4-t7 are used as the target frame number segment, and after the target frame number segment is subjected to the de-duplication processing, as shown in fig. 4, the de-duplication processing is performed on the target frame number segment, which is a schematic diagram of the de-duplication processing of the target frame number segment in the embodiment of the present invention, in fig. 4, the position coordinate values of t1, t2 and t3 are different and need to be reserved, the position coordinate values of t4-t6 are the same, and after the de-duplication processing, the position coordinate values of t4-t6 correspond to one positionThe position coordinate values are used to obtain a third position data sequence, wherein the third position data sequence comprises 5 position coordinate values, which are respectively a position coordinate value corresponding to t1, t2, t3, a position coordinate value corresponding to t4-t6, and a position coordinate value corresponding to t 7.

After the third position data sequence is obtained, the third position data sequence is subjected to labeling smoothing processing to obtain a fourth position data sequence, wherein the fourth position data sequence comprises 5 position coordinate values after smoothing processing, the 5 position coordinate values respectively correspond to t1, t2, t3, t4-t6 and t7, and the position coordinate values are respectively a1, a2, a3, a4 and a 5. And after the smooth processing is labeled, restoring the target frame number segment in the fourth position coordinate value, specifically, taking the position coordinate value corresponding to the target frame number segment in the fourth position coordinate sequence as the position coordinate value corresponding to each of two consecutive frame numbers to obtain the second position data sequence. If the position coordinate value after the smoothing processing of the target frame number segment t4-t6 is a4, it is determined that t4 corresponds to the position coordinate value a4, t5 corresponds to the position coordinate value a4, and t6 corresponds to the position coordinate value a4, and the obtained second position data sequence is a1, a2, a3, a4, a4, a4, and a5, so that each frame number has a corresponding position coordinate value.

In the embodiment of the invention, the duplication elimination processing is performed on the first position data sequence, so as to realize effective annotation smoothing processing on the target key points, so that the annotation smoothing processing effect is better, and the fourth position data sequence obtained by the annotation smoothing processing is restored, so that the omission of data of the face video image frame is avoided, and the data of the target sample subset finally re-annotated is complete.

Further, in the embodiment of the present invention, the step 302 specifically includes the following steps:

b1, placing a preset sliding window at the beginning of the third position data sequence, and determining a position coordinate set in the preset sliding window in the third position data sequence, wherein the length of the preset sliding window is K, and K is a positive odd number, wherein the placing of the preset sliding window at the beginning of the third position data sequence means that a first value in the preset sliding window is a first position coordinate value in the third position data sequence;

b2, determining the side length ratio of the external rectangle of the position coordinate set and a fitting straight line of the position coordinate set, wherein the side length ratio is the ratio of the longest edge and the shortest edge of the external rectangle;

b3, updating the ith position coordinate value in the preset sliding window according to the side length ratio and the fitted straight line, wherein i is (K +1)/2, moving the preset sliding window, and executing the step of determining the position coordinate set in the preset sliding window in the third position data sequence until the preset sliding window is arranged at the tail of the third position data sequence, wherein the fact that the last value in the preset sliding window is the last position coordinate value in the third position data sequence is that the preset sliding window is arranged at the tail of the third position data sequence;

and b4, after the preset sliding window is arranged at the end of the third position data sequence, determining the position data sequence obtained after the ith position coordinate value in the preset sliding window is updated as a fourth position data sequence.

In the embodiment of the present invention, a preset sliding window is provided, where the length of the preset sliding window is K, where K is a positive odd number, the preset sliding window may be located at the beginning of the third location data sequence, and a position coordinate set in the preset sliding window is determined in the third location data sequence, where when the preset sliding window is located at the beginning of the third location data sequence, it means that a first value in the preset sliding window is a first position coordinate value in the third location data sequence. For example, if the third position data sequence includes 100 position coordinate values sorted according to the frame number and the value of K is 7, when the predetermined sliding window is set at the beginning of the third position data sequence, the predetermined sliding window includes 7 position coordinate values from t1 to t7, and the position coordinate value t1 is the first position coordinate value in the third position data sequence.

And the circumscribed rectangle of the position coordinate set is to be determined, specifically, the long side of the circumscribed rectangle is determined, if the position coordinate value is taken as the value in the two-dimensional standard coordinate system, the maximum value in the x coordinate direction in the position coordinate set is determined in the x coordinate direction,namely max (K)_x) And determining the minimum value in the x coordinate direction, i.e., min (K), in the set of position coordinates_x) The side length of the circumscribed rectangle in the x coordinate direction is

L_x＝max(K_x)-min(K_x)

Similarly, the side length of the external rectangle in the y coordinate direction can be obtained as follows:

L_y＝max(K_y)-min(K_y)

wherein L is_xRepresenting the side length, L, of the circumscribed rectangle in the x-coordinate direction_yRepresenting the side length of the circumscribed rectangle in the y-coordinate direction, K_yRepresenting the component of the position coordinate values in the set of position coordinates in the y coordinate direction.

Solving the ratio of the longest edge and the shortest edge of the external rectangle to serve as the side length ratio of the external rectangle, wherein the formula of the side length ratio is as follows: max (L)_x，L_y)/min(L_x，L_y)。

Furthermore, a fitting straight line of the position coordinate set may be determined, where the formula of the fitting straight line is y ═ kx + b, and the ith position coordinate value in the preset sliding window may be updated according to the side length ratio of the circumscribed rectangle of the position coordinate set and the fitting straight line of the position coordinate set, where i is (K +1)/2, for example, if the value of K is 7, the 4 th position coordinate value in the preset sliding window is updated.

In the above, the preset sliding window is set in the third position data sequence as an example, it is understood that the sliding step length of the preset sliding window may also be set to be 1, namely, each time the position coordinate data is moved, the ith position coordinate value in the sliding window is updated according to the above mode every time the sliding window is moved once until the preset sliding window is arranged at the end of the third position data sequence, the ith position coordinate value in the preset sliding window is updated when the preset sliding window is arranged at the end of the third position data sequence, updating the position coordinate value of the whole third position data sequence is completed to obtain a fourth position data sequence, the preset sliding window is arranged at the end of the third position data sequence, and the last value in the preset sliding window is the coordinate value of the last position in the third position data sequence. For example, if the third position data sequence includes 100 position coordinate values, and the corresponding frame numbers are 1 to 100, respectively, and the size of the preset sliding window is 7, the number of the position coordinate values in the preset sliding window is 7 each time, when the preset sliding window is located at the beginning of the third position data sequence, the position coordinate set includes the position coordinate values from t1 to t7, the 4 th position coordinate value, i.e., the position coordinate value of t4, is updated, and is moved by one position coordinate value according to the step length, the position coordinate set includes the position coordinate values from t2 to t8, and t4 is the updated position coordinate value, at this time, the 4 th position coordinate value, i.e., the position coordinate value of t5, and so on, the update of the position coordinate values from t4 to t96 is completed, and each update, the last updated position data sequence is used to continue to slide the preset sliding window by one step length, the sliding window will reach the end of the third position data sequence, including the position coordinate values from t94 to t100, and update the 4 th position coordinate value, i.e. the position coordinate value of t97, wherein t100 is the last position coordinate value of the third position data sequence, so that the preset sliding window cannot be moved continuously, and the update of the position coordinate value of the third position data sequence is finished, so as to obtain the final fourth position data sequence.

It can be understood that, in the embodiment of the present invention, since the ith position coordinate value in the sliding window is updated, and i is (K +1)/2, the first (K-1)/2 position coordinate values and the last (K-1)/2 position coordinate values in the third position data sequence cannot be updated using the sliding window, in practical applications, since the influence of the ending position coordinate value on the whole is relatively small, the update may be abandoned, or after the smoothing process on the target key points in a target sample subset is completed, the face video image frames in the target sample subset whose head and tail position coordinate values are not updated are deleted. In practical applications, the retention or deletion may be selected according to needs, and is not limited herein. In addition, the above-mentioned i ═ k +1)/2 is a preferable mode for determining the position coordinate value to be updated, and may be other, for example, i ═ [ (k +1)/2] +1, or i ═ [ (k +1)/2] — 1, or may be any number other than the first and last two numbers of k, and i may be determined as necessary in practical application, and is not limited herein.

In an embodiment of the present invention, in the step b3, the manner of updating the ith position coordinate value in the preset sliding window according to the side length ratio and the fitted straight line may be specifically: when the side length ratio is smaller than a preset first threshold and the judgment coefficient of the fitted straight line is smaller than a preset second threshold, determining a gravity center point of a position coordinate value contained in the position coordinate set; and determining the distance from each position coordinate value contained in the position coordinate set to the gravity center point, and updating the ith position coordinate value to the position coordinate value of the minimum distance.

Wherein a decision coefficient (coefficient of determination) for fitting the straight line is to be determined, which may be denoted as R²And the judgment coefficient represents the proportion of the regression sum of squares to the sum of squares of the total error, the reaction is the fitting of a fitting straight line, and the value range is [0,1 ]]In between, the closer R2 is to 1, the better the fit of the straight line is; the closer R2 approaches 0, the worse the fit of the fitted line.

The judgment coefficient of the fitting straight line is compared with a preset second threshold, and when the judgment coefficient is smaller than the preset second threshold, the fitting straight line is more prone to a curve or a broken line, and the comparison of fitting is poor.

When the side length ratio is smaller than the preset first threshold, the direction of a connecting line of K position coordinate values in the preset sliding window is likely to tend to a curve or a broken line instead of a straight line.

When the side length ratio is smaller than a preset first threshold and the decision coefficient of the fitted straight line is smaller than a preset second threshold, it indicates that jitter exists among the K position coordinate values in the preset sliding window, and the ith position coordinate value needs to be eliminated and updated, so that the jitter can be eliminated.

The specific way of updating the ith position coordinate value is as follows: and calculating the gravity center point of the K position coordinate values in the preset sliding window, solving the distance from each position coordinate value to the gravity center point, and taking the position coordinate value corresponding to the minimum distance as the updated ith position coordinate value. For example, if the value of K is 7, the K position coordinate values are t1 to t7, the calculated gravity center point is o, the distances from t1 to t7 to the gravity center point o are calculated, and the distances are d1 to d7, respectively, and the minimum value among the 7 distances is selected as d2, the position coordinate value of t2 is used as the position coordinate value after updating of t 4.

In the embodiment of the invention, the jitter of the target key point in the target sample subset can be effectively removed in a mode of label smoothing processing, so that a target face key point detection model obtained by training the target sample subset can refer to the position relation of the target key point in different face video image frames, and the accuracy of face key point detection by using the target face key point detection model is improved.

It is understood that the above-mentioned label smoothing method is applicable to a curve formed by a first position data sequence of a target key point, and is particularly applicable to a small range of knots, shakes and the like appearing in the curve, and is generally applied to smoothing shakes appearing in a case where the face motion is small or static, and may be referred to as a first shaking scene.

In addition, under another shaking condition, specifically, when the human face moves relatively fast, the curve formed by the first position data sequence of the target key point is not smooth and smooth, and appears in a broken line shape, which may be referred to as a second shaking scene.

For the second jitter scenario, a smoothing process may also be performed, and the labeling smoothing process performed on the third position data sequence in step 302 to obtain the fourth position data sequence specifically includes:

and carrying out labeling smoothing processing on the third position data sequence by utilizing a preset smoothing algorithm to obtain a fourth position data sequence, wherein the smoothing algorithm is a median filtering algorithm, a Gaussian filtering algorithm or a Savitzky-Golay filter. By the method, the jitter in the second jitter scene can be effectively removed.

It can be understood that, in practical application, the jitter in the first type of jitter scene may be removed first, and then the jitter in the second type of jitter scene is removed, or the jitter in the second type of jitter scene may be removed first, and then the jitter in the first type of jitter scene is removed, or whether the jitter in the first type of jitter scene and the jitter in the second type of jitter scene exists may be determined first, and a corresponding manner is selected to remove the jitter, which may be specifically set according to actual needs, and is not limited herein.

In the embodiment of the present invention, the multi-frame face video frame images in the sample subset of the target face sample data set may be pre-labeled in a manual labeling manner, and then re-labeled in a manner of label smoothing processing, or may be pre-labeled in an automatic labeling manner without a manual labeling manner, so that the pre-labeling is realized in an automatic labeling manner. Specifically, referring to fig. 5, a schematic flow chart of the refining step of step 101 shown in fig. 1 according to the embodiment of the present invention includes:

501, obtaining an initial face sample data set, wherein the initial face sample data set comprises a plurality of common face sample images with face key points marked, and the common face sample images are single-person single images which are discontinuous in time;

step 502, training a first face key point detection model by using an initial face sample data set to obtain a second face key point detection model;

step 503, inputting the face video into the second face key point detection model, so as to perform face key point labeling on the continuous multi-frame face video frame images in the face video, obtain a sample subset of the face video, and obtain a target face sample data set.

In the embodiment of the invention, an initial face sample data set is obtained, the initial face sample data set comprises a plurality of common face sample images marked with face key points, the common face sample image is a single-person single-image discontinuous in time, the initial face sample data set is firstly utilized to train a first face key point detection model to obtain a second face key point detection model, and as can be understood, the second face key point detection module can detect face key points and can be used for labeling the face key points, therefore, the face video can be input into the second face key point detection model to label the face key points of the continuous multi-frame face video frame image information in the face video to obtain a sample subset of the face video, and the sample subset of the face video can form the target face sample data set.

Further, after labeling the multiple frames of continuous video frame images in the sample subset based on the second face keypoint detection model, in step 103, the first face keypoint detection model may be trained by using a target face sample data set including the re-labeled sample subset, or by using the initial face sample data set and the re-labeled target face sample data set, so as to obtain a trained target face keypoint detection module, so that the model training can be completed.

Or, in another feasible implementation manner, the target face sample data set after re-labeling may be utilized, or the initial face sample data set and the target face sample data set after re-labeling are utilized to perform fine adjustment on the second face key point detection model, so as to obtain the target face key point detection model.

The fine tuning process generally performs two types of operations: one class is the output of the modification model (e.g., number or kind of modification classification types, type or parameters of modification loss functions); one is that initial values of parameters of the network are not generated randomly during training, but the parameters of the model trained on a large data set are used as initial values of training. The reason for this is that the parameters in the model trained on a large data set already contain a large number of useful convolution filters, rather than initializing all the parameters of the model from scratch, using the already trained parameters as the starting point for training. By doing so, not only can a large amount of training time be saved, but also the performance of the classifier can be improved. The second face key point detection model obtained by training by using initial face sample data is finely adjusted, so that a large amount of training time can be effectively saved, the performance of the target face key point detection model obtained by training is improved, and at least a re-labeled multi-frame face video frame image is used for fine adjustment, so that the finely adjusted target face key point detection model can refer to the position relation among face key points in the multi-frame face video frame image, when the target face key point detection model is used for detecting the face key points, the face key points with little or no shake can be obtained, and the detection effect is good.

In the embodiment of the invention, the continuous multiframe human face video frame images in the human face video containing the marked human face key points are marked and smoothed to remove the jitter, and the first human face key point detection model is trained by utilizing the re-marked multiframe human face video frame images, so that the position relation of the same human face key point in the continuous human face video frame images with the jitter removed is used in the training process, the stability and the accuracy of the trained second human face key point detection model can be effectively improved, the method is particularly suitable for detecting the human face key points in the human face video, and the jitter can be effectively reduced.

It can be understood that the target face key point detection model obtained through training is preferentially applied to detection of face key points in a face video, specifically, a face video to be detected can be input into the target face key point detection model, face key points contained in a face video frame image of each frame in the face video are output, and the face key points are small in smooth jitter and good in detection effect.

Please refer to fig. 6, which is a schematic structural diagram of a training apparatus for a face keypoint detection model according to an embodiment of the present invention, including:

an obtaining module 601, configured to obtain a target face sample data set with labeled face key points, where the target face sample data set includes at least one sample subset, and the sample subset includes multiple continuous frames of face video frame images in a face video;

a smoothing module 602, configured to perform labeling smoothing processing on face key points of a face video frame image in a target sample subset to obtain a target sample subset that is re-labeled, where the target sample subset is any sample subset;

the training module 603 is configured to train the first face keypoint detection model by using the target face sample data set including the re-labeled sample subset, so as to obtain a trained target face keypoint detection model.

In the embodiment of the present invention, the content described in the apparatus embodiment shown in fig. 6 is similar to the content described in the foregoing method embodiment, and may specifically refer to the related content in the foregoing method embodiment, which is not described herein again.

In one embodiment, a computer-readable storage medium is proposed, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A training method for a face key point detection model is characterized by comprising the following steps:

2. The method according to claim 1, wherein the performing annotation smoothing on the face key points of the face video frame image in the target sample subset to obtain a re-annotated target sample subset comprises:

acquiring a first position data sequence of a target key point, wherein the first position data sequence comprises position coordinate values of the target key point in multi-frame target face video frame images in the target sample subset, the position coordinate values in the first position data sequence are sorted according to the frame number of the target face video frame images, and the target key point is any face key point;

and performing annotation smoothing processing on the first position data sequence to obtain a second position data sequence of the target key point, and updating the annotation of the target key point in the multi-frame target face video frame image according to the second position data sequence to obtain a re-annotated target sample subset.

3. The method according to claim 2, wherein the performing labeling smoothing processing on the first position data sequence to obtain a second position data sequence of the target keypoint comprises:

carrying out deduplication processing on position coordinate values belonging to a target frame number segment in the first position data sequence to enable one target frame number segment to correspond to one position coordinate value so as to obtain a third position data sequence, wherein the target frame number segment comprises at least two continuous frame numbers, and the position coordinate values corresponding to the two continuous frame numbers are the same;

performing label smoothing processing on the third position data sequence to obtain a fourth position data sequence;

and taking the position coordinate value corresponding to the target frame number segment in the fourth position data sequence as the position coordinate value corresponding to each of the two continuous frame numbers to obtain the second position data sequence.

4. The method of claim 3, wherein the label smoothing the third position data sequence to obtain a fourth position data sequence comprises:

placing a preset sliding window at the beginning of a third position data sequence, and determining a position coordinate set in the preset sliding window in the third position data sequence, wherein the length of the preset sliding window is K, and K is a positive odd number, and the fact that the preset sliding window is placed at the beginning of the third position data sequence means that a first value in the preset sliding window is a first position coordinate value in the third position data sequence;

determining a side length ratio of an external rectangle of the position coordinate set and a fitting straight line of the position coordinate set, wherein the side length ratio is the ratio of the longest edge and the shortest edge of the external rectangle;

updating the ith position coordinate value in the preset sliding window according to the side length ratio and the fitting straight line, wherein i is any value in [1, K ], moving the preset sliding window, and executing the step of determining the position coordinate set in the preset sliding window in the third position data sequence until the preset sliding window is arranged at the tail of the third position data sequence, wherein the step that the last value in the preset sliding window is the last position coordinate value in the third position data sequence is realized when the preset sliding window is arranged at the tail of the third position data sequence;

and after the preset sliding window is arranged at the end of the third position data sequence, determining a position data sequence obtained after the ith position coordinate value in the preset sliding window is updated as the fourth position data sequence.

5. The method of claim 4, wherein the updating the i-th position coordinate value within the preset sliding window according to the side length ratio and the fitted straight line comprises:

when the side length ratio is smaller than a preset first threshold and the judgment coefficient of the fitting straight line is smaller than a preset second threshold, determining a gravity center point of a position coordinate value contained in the position coordinate set;

and determining the distance from each position coordinate value contained in the position coordinate set to the gravity center point, and updating the ith position coordinate value to the position coordinate value of the minimum distance.

6. The method of claim 3, wherein the label smoothing the third position data sequence to obtain a fourth position data sequence comprises:

and performing labeling smoothing processing on the third position data sequence by using a preset smoothing algorithm to obtain a fourth position data sequence, wherein the smoothing algorithm is a median filtering algorithm, a Gaussian filtering algorithm or a Savitzky-Golay filter.

7. The method according to any one of claims 1 to 6, wherein the obtaining the target face sample data set with labeled face key points comprises:

acquiring an initial face sample data set, wherein the initial face sample data set comprises a plurality of common face sample images with face key points marked, and the common face sample images are single-person single images which are discontinuous in time;

training the first face key point detection model by using the initial face sample data set to obtain a second face key point detection model;

inputting the face video into the second face key point detection model to perform face key point labeling on continuous multi-frame face video frame images in the face video, so as to obtain the sample subset of the face video and obtain the target face sample data set.

8. The method of claim 7, wherein the training the first face keypoint detection model using the target face sample set comprising the re-labeled sample subset to obtain the trained target face keypoint detection model comprises:

and fine-tuning the second face key point detection model by using the target face sample data set after re-labeling or by using the initial face sample data set and the target face sample data set after re-labeling to obtain the target face key point detection model.

9. A training device for a face key point detection model, the device comprising:

the smoothing module is used for carrying out labeling smoothing processing on the face key points of the face video frame images in the target sample subsets to obtain re-labeled target sample subsets, wherein the target sample subsets are any sample subsets;

and the training module is used for training the first face key point detection model by using the target face sample data set containing the re-labeled sample subset to obtain the trained target face key point detection model.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 8.