CN115457635A

CN115457635A - Face key point detection model training method, live image processing method and device

Info

Publication number: CN115457635A
Application number: CN202211148240.7A
Authority: CN
Inventors: 宫凯程
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-09

Abstract

The application relates to the technical field of live broadcast and image processing, and provides a face key point detection model training method, a live broadcast image processing method, a device, equipment and a medium. According to the method and the device, the accuracy and the efficiency of face key point detection can be considered. The method comprises the following steps: obtaining an average position according to the labeling position of each face key point on each face image in a data set, determining a principal component set of the face images in the data set and obtaining the fitting coefficient of each image in each principal component, inputting the images into a face key point detection model to be trained, obtaining the transformed face images through a first spatial transformation network, obtaining the prediction fitting coefficient of each principal component through a coefficient prediction network, obtaining the prediction position of each face key point according to the prediction fitting coefficient, each principal component and the average position, obtaining the prediction position of each face key point on the face images through a second spatial transformation network, and training the model according to a first loss representing the consistency of the fitting coefficients and a second loss representing the consistency of the positions.

Description

Face key point detection model training method, live image processing method and device

Technical Field

The present application relates to the field of live webcast and image processing technologies, and in particular, to a method for training a face key point detection model, a method and an apparatus for processing live webcast face images, an electronic device, and a computer-readable storage medium.

Background

With the development of the live webcasting technology, the image processing technology is applied to live webcasting, and for example, special effects such as beautifying, making up and shaping are applied to a face image of live webcasting to improve the propagation effect of high-quality content shared in live webcasting. The face key point detection algorithm is the basic algorithm of the special effects. With the rapid development of deep learning, the positioning accuracy of face key point detection will be higher and higher.

The human face key point detection algorithm based on deep learning needs to be trained on a large number of human face image data sets with artificial labels, the calculated amount of the mode of carrying out model training and application by fusing multiple data sets provided by the prior art is generally very large, and the real-time requirement on human face key point detection in the fields of network live broadcast and the like is difficult to meet.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a training method for a face keypoint detection model, a live webcast face image processing method, a live webcast face image processing apparatus, an electronic device, and a computer-readable storage medium.

In a first aspect, the application provides a training method for a face key point detection model. The method comprises the following steps:

obtaining the average position of each face key point according to the labeling position of each face key point on each face image in the face image data set;

determining a principal component set of the face images in the face image data set based on principal component analysis and labeling positions, and acquiring a fitting coefficient corresponding to each principal component of each face image; different principal components in the principal component set respectively correspond to different morphological change dimensions of the human face;

inputting the face image into a face key point detection model to be trained, so that the face key point detection model to be trained obtains a transformed face image according to the face image through a first spatial transformation network, obtains prediction fitting coefficients corresponding to main components according to the transformed face image through a coefficient prediction network, obtains the prediction position of each face key point on the transformed face image according to the prediction fitting coefficients, the main components and the average position, and obtains the prediction position of each face key point on the face image through a second spatial transformation network according to the prediction position of each face key point on the transformed face image;

acquiring a first model loss representing the consistency of the prediction fitting coefficient and the fitting coefficient, and acquiring a second model loss representing the consistency of the prediction position and the labeling position of each face key point;

and training the face key point detection model to be trained according to the first model loss and the second model loss.

In one embodiment, the obtaining the predicted position of each key point of the face on the transformed face image according to the prediction fitting coefficient and the principal components and average positions includes: obtaining the predicted position change of each face key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component; and obtaining the predicted position of each face key point on the transformed face image according to the predicted position change and the average position.

In one embodiment, the determining a principal component set of a face image in the face image data set includes: performing similarity transformation on the labeling positions of the face key points on each face image based on the average positions of the face key points to obtain the transformation positions of the face key points on each face image; and performing principal component analysis according to the transformation position of each face key point on each face image to obtain the principal component set.

In one embodiment, the performing principal component analysis according to the transformed position of each face key point on each face image to obtain the principal component set includes: normalizing the transformation position of each face key point on each face image relative to the center of the face image to obtain the normalized transformation position of each face key point on each face image; and performing principal component analysis according to the normalized transformation position of each face key point on each face image to obtain the principal component set.

In one embodiment, the obtaining of the fitting coefficient corresponding to each face image in each principal component includes: and fitting the principal component set by utilizing the normalized transformation position of each face key point on each face image to obtain a fitting coefficient corresponding to each principal component of each face image.

In one embodiment, the obtaining a first model loss characterizing the consistency of the predicted fit coefficients and the fit coefficients comprises: and obtaining the loss of the first model according to the difference value of the predicted fitting coefficient and the fitting coefficient.

In one embodiment, the obtaining a second model loss characterizing consistency of the predicted positions and the labeled positions of the face key points includes: and obtaining a difference value between the predicted position and the marked position corresponding to each face key point, obtaining a difference value average value based on the difference value between the predicted position and the marked position corresponding to each face key point, and obtaining a second model loss according to the difference value average value.

In a second aspect, the application provides a method for processing a face image in live webcasting. The method comprises the following steps:

acquiring a face image to be processed in live webcasting;

detecting through a trained human face key point detection model to obtain each human face key point on the human face image to be processed; the face key point detection model is obtained by training according to the training method of the face key point detection model;

and applying a special effect to the face image to be processed based on each face key point on the face image to be processed.

In a third aspect, the application provides a training device for a face key point detection model. The device comprises:

the position acquisition module is used for acquiring the average position of each face key point according to the labeling position of each face key point on each face image in the face image data set;

the principal component analysis module is used for determining a principal component set of the face images in the face image data set based on principal component analysis and labeling positions and acquiring a fitting coefficient corresponding to each principal component of each face image; different principal components in the principal component set respectively correspond to different form change dimensions of the human face;

the image input module is used for inputting the face image into a face key point detection model to be trained so as to enable the face key point detection model to be trained to obtain a transformed face image according to the face image through a first spatial transformation network, obtain a prediction fitting coefficient corresponding to each principal component according to the transformed face image through a coefficient prediction network, obtain a prediction position of each face key point on the transformed face image according to the prediction fitting coefficient, each principal component and an average position, and obtain a prediction position of each face key point on the face image according to the prediction position of each face key point on the transformed face image through a second spatial transformation network;

the loss acquisition module is used for acquiring a first model loss representing the consistency of the predicted fitting coefficient and the fitting coefficient and acquiring a second model loss representing the consistency of the predicted position and the marked position of each face key point;

and the model training module is used for training the face key point detection model to be trained according to the first model loss and the second model loss.

In a fourth aspect, the application provides a live webcast face image processing device. The device comprises:

the image acquisition module is used for acquiring a face image to be processed in live webcasting;

the key point detection module is used for detecting and obtaining each face key point on the face image to be processed through the trained face key point detection model; the face key point detection model is obtained by training by using the training device of the face key point detection model;

and the image processing module is used for applying a special effect to the face image to be processed based on each face key point on the face image to be processed.

In a fifth aspect, the present application provides an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor realizes the following steps when executing the computer program:

obtaining the average position of each face key point according to the labeling position of each face key point on each face image in the face image data set; determining a principal component set of the face images in the face image data set based on principal component analysis and labeling positions, and obtaining a fitting coefficient corresponding to each principal component of each face image; different principal components in the principal component set respectively correspond to different form change dimensions of the human face; inputting the face image into a face key point detection model to be trained, so that the face key point detection model to be trained obtains a transformed face image according to the face image through a first spatial transformation network, obtains a prediction fitting coefficient corresponding to each principal component according to the transformed face image through a coefficient prediction network, obtains a prediction position of each face key point on the transformed face image according to the prediction fitting coefficient, each principal component and an average position, and obtains a prediction position of each face key point on the face image according to the prediction position of each face key point on the transformed face image through a second spatial transformation network; acquiring a first model loss representing the consistency of the prediction fitting coefficient and the fitting coefficient, and acquiring a second model loss representing the consistency of the prediction position and the labeling position of each face key point; and training the face key point detection model to be trained according to the first model loss and the second model loss.

In a sixth aspect, the present application provides an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor realizes the following steps when executing the computer program:

acquiring a face image to be processed in live webcasting; detecting through a trained human face key point detection model to obtain each human face key point on the human face image to be processed; the face key point detection model is obtained by training according to the training method of the face key point detection model; and applying a special effect to the face image to be processed based on each face key point on the face image to be processed.

In a seventh aspect, the present application provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

obtaining the average position of each face key point according to the labeling position of each face key point on each face image in the face image data set; determining a principal component set of the face images in the face image data set based on principal component analysis and labeling positions, and obtaining a fitting coefficient corresponding to each principal component of each face image; different principal components in the principal component set respectively correspond to different form change dimensions of the human face; inputting the face image into a face key point detection model to be trained, so that the face key point detection model to be trained obtains a transformed face image according to the face image through a first spatial transformation network, obtains prediction fitting coefficients corresponding to main components according to the transformed face image through a coefficient prediction network, obtains the prediction position of each face key point on the transformed face image according to the prediction fitting coefficients, the main components and the average position, and obtains the prediction position of each face key point on the face image through a second spatial transformation network according to the prediction position of each face key point on the transformed face image; acquiring a first model loss representing the consistency of the prediction fitting coefficient and the fitting coefficient, and acquiring a second model loss representing the consistency of the prediction position and the labeling position of each face key point; and training the face key point detection model to be trained according to the first model loss and the second model loss.

In an eighth aspect, the present application provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The training method of the face key point detection model, the live webcast face image processing method, the live webcast face image processing device and the live webcast face image processing medium are characterized in that the average position of each face key point is obtained according to the labeling position of each face key point on each face image in a face image data set, the principal component set of the face images in the data set is determined and the fitting coefficient corresponding to each principal component of each face image is obtained on the basis of principal component analysis and the labeling position, different principal components in the principal component set respectively correspond to different morphological change dimensions of a face, then the face images are input into the key face point detection model to be trained, the model obtains the transformed face images through a first spatial transformation network, the prediction fitting coefficients corresponding to each principal component of the transformed face images are obtained through a coefficient prediction network, the prediction positions of each face key point on the transformed face images are obtained according to the prediction fitting coefficients, the prediction positions of each face key point on the transformed face images are obtained through a second spatial transformation network, and the prediction positions of each face key point on the face images are obtained according to the prediction coefficients, and the second spatial transformation network, and the first model is trained according to the loss of the first model representing the consistency of the prediction fitting coefficients and the loss of the second model. The scheme is based on the fitting coefficients of each principal component obtained by analyzing the principal components of a face image data set and each face image corresponding to each principal component, the labeling positions and the corresponding average positions of each face key point on each face image in the data set are combined, a face key point detection model is trained to accurately predict the fitting coefficients of the face image, and the positions of each face key point on the face image can be accurately predicted based on the fitting coefficients, the model structure is simple and efficient, the accuracy and the efficiency of face key point detection can be considered, and the real-time requirements of the fields of network live broadcast and the like on the face key point detection can be met while the face key point is accurately detected.

Drawings

FIG. 1 is a diagram illustrating an application scenario of a correlation method in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a training method of a face key point detection model in the embodiment of the present application;

FIG. 3 is a schematic diagram illustrating average positions of key points of each face in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a relationship between principal components and face key points in an embodiment of the present application;

FIG. 5 is a schematic diagram of a similarity transformation in an embodiment of the present application;

FIG. 6 is a schematic view of a process of processing a face image by a face keypoint detection model in an embodiment of the present application;

fig. 7 is a schematic flow chart of a face image processing method of live webcasting in the embodiment of the present application;

FIG. 8 is a block diagram of a training apparatus for a face keypoint detection model in an embodiment of the present application;

fig. 9 is a block diagram of a structure of a face image processing apparatus for live webcasting in an embodiment of the present application;

fig. 10 is an internal structural view of an electronic apparatus in an embodiment of the present application;

fig. 11 is an internal structural diagram of an electronic device in another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The training method of the face key point detection model and the live webcast face image processing method provided by the embodiment of the application can be applied to an application scenario shown in fig. 1, where the application scenario may include a terminal 110 and a server 120, and the terminal 110 communicates with the server 120 through a network. The terminal 110 may be, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 120 may be implemented by an independent server or a server cluster formed by a plurality of servers.

The following sections sequentially explain a training method of the face key point detection model and a live webcast face image processing method in the present application, based on the application scenario shown in fig. 1, in conjunction with various embodiments and related drawings.

In one embodiment, as shown in fig. 2, the present application provides a method for training a face keypoint detection model, which may be performed by a server 120, and the method may include the following steps:

step S201, obtaining the average position of each face key point according to the labeling position of each face key point on each face image in the face image data set.

Specifically, the face image data set may include a plurality of face images, and may further include a position where each face key point on the plurality of face images is labeled, which is recorded as a labeling position. The human face images are images containing human faces, the labeling positions of the key points of the human faces on the human face images can be obtained through manual labeling, the positions of the key points of the human faces on the human face images can be labeled in a manual mode to obtain the labeling positions, and a human face image data set can be obtained after the labeling is finished. Wherein, the annotation position can be represented by position coordinates on the face image. In an actual scene, relevant people can complete position labeling of each face key point on the terminal 110 to obtain a face image data set, then the terminal 110 transmits the face image data set to the server 120 through a network, and the server 120 obtains the face image data set. In the face image data set obtained by the server 120, each face image has the same number of face key points, the total number of face images included in the face image data set is counted as N, each face image has L face key points, and L is the total number of key points corresponding to a face. In this step, the server 120 may calculate an average position of each face key point of the L face key points according to the labeled position of each face key point on each face image in the face image data set.

Specifically, the effect of marking a set containing the average positions of L face key points as M and marking the set M on a blank image in the form of points is shown in fig. 3, that is, the set M also contains the position information of L face key points, and in the set M, the average position of the ith face key point can be expressed as:

wherein,

and indicating the labeling position of the ith personal face key point on the jth human face image.

Step S202, based on principal component analysis and labeling position, determining a principal component set of the face images in the face image data set, and obtaining a fitting coefficient corresponding to each principal component of each face image.

In this step, the server 120 determines the principal component set of the face images in the face image data set based on the principal component analysis and the labeling positions of the key points of the face on each face image in the face image data set, and obtains the fitting coefficients corresponding to each principal component of each face image. Wherein, different principal components in the principal component set respectively correspond to different form change dimensions of the human face. Specifically, in this step, principal Component Analysis (PCA) may be performed based on coordinates of the labeling position of each face key point on each face image in the face image data set, so as to obtain a Principal component set. The principal component set comprises a plurality of principal components, and different principal components respectively correspond to different form change dimensions of the human face. If the total number of the face key points is L, and the position of each face key point can be represented as two coordinate values (x, y), then each principal component obtained by principal component analysis is a 2 × L-dimensional vector, the total number of the principal components in the extracted principal component set is at most 2 × L, and each principal component corresponds to different morphological change dimensions of the face, or each principal component controls different morphological change attributes of the face. In this regard, with reference to fig. 4, fig. 4 shows the position changes of key points of each face (overall, the morphological changes of the face) when the first five principal components in the principal component set are given different coefficients (+ 0.5, -0.5), where the first principal component may correspond to a morphological change dimension of left-right rotation of the face, the second principal component may correspond to a morphological change dimension of head-up and head-down of the face, the third principal component may correspond to a morphological change dimension of body-fat and thin of the face, and so on, and it can be seen that each principal component may be respectively used to control different morphological change attributes of the face. Then, in this step, the server 120 may further fit the obtained principal component set based on the labeled position of each face key point on each face image in the face image data set, so as to obtain a fitting coefficient corresponding to each principal component of each face image in the face image data set. In specific implementation, fitting principal component sets are fitted based on the labeled positions of all face key points on a face image through a standard PCA process to obtain fitting coefficients corresponding to all principal components of the face image in the principal component sets respectively. If the total number of the face images in the face image data set is recorded as N, and the number of the obtained principal components in the principal component set is P, a PCA coefficient matrix can be obtained through fitting, the matrix dimension of the PCA coefficient matrix is NxP (N rows and P columns), and matrix elements in the PCA coefficient matrix are fitting coefficients corresponding to the corresponding principal components.

Further, in some embodiments, determining the principal component set of the face image in the face image data set in step S202 may include:

performing similarity transformation on the labeling positions of the key points of the human face on each human face image based on the average position of the key points of the human face to obtain the transformation positions of the key points of the human face on each human face image; and performing principal component analysis according to the transformation position of each face key point on each face image to obtain a principal component set.

In this embodiment, the server 120 may perform similarity transformation on the labeling position of each face key point on each face image in the face image data set based on the average position of each face key point, so as to obtain a labeling position after the similarity transformation of each face key point on each face image, which is recorded as a transformation position, and then may perform principal component analysis according to the coordinates of the transformation position of each face key point on each face image, so as to obtain a principal component set, so as to improve the accuracy and reliability of the trained model for detecting the face key points. For the processing of similarity transformation, in a specific implementation, for all face images in the face image data set, the server 120 may calculate a similarity transformation matrix according to the position of each artificially labeled face key point, that is, the set M of the labeled position of each face key point and the average position of each face key point, as shown in fig. 5, the server 120 may correct each face image in the face image data set according to the similarity transformation matrix, and may also correct each face key point (labeled position) on each face image according to the similarity transformation matrix, thereby obtaining the transformation position of each face key point on each face image.

Based on this, in an embodiment, in the above embodiment, performing principal component analysis according to the transformation position of each face key point on each face image to obtain a principal component set, further includes:

normalizing the transformation position of each face key point on each face image relative to the center of the face image to obtain the normalized transformation position of each face key point on each face image; and performing principal component analysis according to the normalized transformation position of each face key point on each face image to obtain a principal component set.

In this embodiment, the server 120 may perform similarity transformation on the labeled positions of the face key points on each face image according to the above similarity transformation manner, to obtain the transformation positions of the face key points on each face image, and on this basis, perform normalization processing on the transformation positions of the face key points on each face image, to obtain the normalized transformation positions of the face key points on each face image, and then the server 120 performs principal component analysis according to the normalized transformation positions of the face key points on each face image, to obtain a principal component set, so as to further improve the accuracy and reliability of the trained model for detecting the face key points. For the normalization processing, in a specific implementation, the server 120 may perform normalization processing on the transformed positions of the key points of the face on each face image with respect to the center of the face image, so that the position coordinate range is from-1 to 1, and obtain the normalized transformed positions of the key points of the face on each face image. As an example, assuming that the width and height of the face image are 100 and 200, respectively, and the coordinates of the transformation positions of the key points of the face thereon are (50, 200), the normalized transformation positions obtained by performing the normalization process with respect to the center of the face image are (0, 1).

Based on the foregoing embodiments, in an embodiment, the obtaining of the fitting coefficient corresponding to each main component of each face image in step S202 may further include: and fitting the principal component set by utilizing the normalized transformation position of each face key point on each face image to obtain a fitting coefficient corresponding to each principal component of each face image. In this embodiment, specifically, the server 120 may fit the principal component set obtained in the foregoing step by using the coordinates of the normalized transformation position of each face key point on each face image obtained after the normalization processing, so as to obtain a fitting coefficient corresponding to each principal component of each face image in the face image data set, so as to improve the accuracy and reliability of the trained model for detecting the face key points.

Step S203, inputting a face image into the face key point detection model to be trained, so that the face key point detection model to be trained obtains a transformed face image according to the face image through a first spatial transformation network, obtains a prediction fitting coefficient corresponding to each principal component according to the transformed face image through a coefficient prediction network, obtains a prediction position of each face key point on the transformed face image according to the prediction fitting coefficient, each principal component and an average position, and obtains a prediction position of each face key point on the face image according to the prediction position of each face key point on the transformed face image through a second spatial transformation network.

In this step, the server 120 inputs the face image into the face key point detection model to be trained, and the model performs a correlation process according to the input face image. Specifically, this is described with reference to fig. 6, as shown in fig. 6, the server 120 inputs a face image into a face keypoint detection model to be trained, where the face keypoint detection model may include two Spatial Transform Networks (STN), which are respectively denoted as a first Spatial transform Network and a second Spatial transform Network, the first Spatial transform Network may be used to perform rectification-related transform processing on the face image, and the second Spatial transform Network may be used to perform inverse transform processing corresponding to the rectification on the predicted position of the face keypoint, so that the predicted position of the face keypoint is mapped back to the face image; the human face key point detection model also comprises a coefficient prediction network, wherein the network can be realized based on a ResNet-18 model structure, and the prediction fitting coefficients corresponding to all main components of the human face image after image characteristics are extracted and finally a full connection layer is used for predicting and transforming. Based on this, after the server 120 inputs the face image into the face key point detection model to be trained, the first spatial transform network obtains the transformed face image according to the face image, then the transformed face image is transmitted to the coefficient prediction network, the coefficient prediction network obtains the prediction fitting coefficients corresponding to the principal components according to the transformed face image, then the model can obtain the prediction positions of the face key points on the transformed face image according to the prediction fitting coefficients and the average positions of the principal components and the face key points, the prediction positions of the face key points on the transformed face image are input into the second spatial transform network, and the second spatial transform network obtains the prediction positions of the face key points on the face image input into the model according to the prediction positions of the face key points on the transformed face image.

In one embodiment, the obtaining of the predicted position of each key point of the face on the transformed face image according to the prediction fitting coefficient, each principal component, and the average position in step S203 specifically includes:

obtaining the predicted position change of each face key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component; and obtaining the predicted position of each face key point on the transformed face image according to the predicted position change and the average position.

In this embodiment, after the server 120 inputs the face image into the face key point detection model to be trained, the face image is sequentially processed through the first spatial transform network and the coefficient prediction network, and the coefficient prediction network outputs the prediction fitting coefficients corresponding to the transformed face image in each principal component, and then, the server 120 may obtain the predicted position change of each face key point on the transformed face image according to the prediction fitting coefficients corresponding to each principal component and each principal component of the transformed face image, where the predicted position change may be position change information, or referred to as position offset information, of each face key point on the transformed face image relative to an average position of each face key point, predicted position information of each face key point on the transformed face image predicted by the model, and thus the server 120 may obtain the predicted position of each face key point on the transformed face image according to the predicted position change and the average position of each face key point, so as to achieve the purpose of efficiently and accurately predicting the position of each face key point on the transformed face image. Specifically, with reference to fig. 6, after obtaining the prediction fitting coefficients corresponding to the principal components of the transformed face image output by the coefficient prediction network, the model may multiply and sum the prediction fitting coefficients corresponding to the principal components of the transformed face image with the principal components, and obtain the prediction position change of each face key point on the transformed face image, and then add the prediction position change of each face key point on the transformed face image with the average position of each face key point to obtain the prediction position of each face key point on the transformed face image.

And step S204, acquiring a first model loss representing the consistency of the predicted fitting coefficient and the fitting coefficient, and acquiring a second model loss representing the consistency of the predicted position and the marked position of each face key point.

And S205, training the face key point detection model to be trained according to the first model loss and the second model loss.

The steps S204 and S205 are steps of the server 120 obtaining the model loss and training the face key point detection model to be trained according to the model loss, and the model may be trained by updating the network parameters including the first spatial transform network, the coefficient prediction network and the second spatial transform network in the model according to the model loss calculation gradient back propagation. Wherein the model losses used for model training may include a first model loss and a second model loss. The first model loss is used for representing the consistency between the predicted fitting coefficient predicted by the model and the fitting coefficient obtained in the step S202, and the second model loss is used for representing the consistency between the predicted position of each face key point predicted by the model and the labeled position of the artificial label.

Specifically, for the first model loss L1, in an embodiment, the obtaining of the first model loss characterizing the consistency of the predicted fitting coefficient and the fitting coefficient in step S204 may include: and obtaining the loss of the first model according to the difference value of the predicted fitting coefficient and the fitting coefficient. The absolute value of the difference between the predicted fitting coefficient and the fitting coefficient may be relatively, simply and accurately quantified to quantify the consistency between the predicted fitting coefficient and the fitting coefficient, and therefore, the absolute value of the difference between the predicted fitting coefficient and the fitting coefficient may be used as the first model loss L1, and specifically, may be represented by the following formula: l1= | S _gt -S _pred L, wherein S _gt Represents the fitting coefficient, S, obtained in the above step S202 _pred Representing the prediction fit coefficients of the model predictions. For the second model loss L2, in an embodiment, the obtaining of the second model loss characterizing the consistency between the predicted positions and the labeled positions of the key points of the face in step S204 may include: and obtaining the difference value between the predicted position and the marked position corresponding to each face key point, obtaining the difference value average value based on the difference value between the predicted position and the marked position corresponding to each face key point, and obtaining the loss of the second model according to the difference value average value. As described above, in this embodiment, the absolute value of the difference between the predicted position and the labeled position corresponding to each face key point may be obtained, and the sum of the absolute values of the differences between the predicted position and the labeled position corresponding to each face key point and the total number L of the face key points are combined to calculate the absolute valueTo the mean of the absolute values of the differences, the mean of the absolute values of the differences may then be taken as the second model loss L2, which may be expressed, in particular, via the equation:

wherein, g _i Coordinates of the annotation location, p, of the ith human face key point representing the artificial annotation _i Coordinates representing the predicted location of the model predicted ith personal face keypoint.

For model training, as an example, the server 120 may obtain a total model loss according to a sum of the first model loss and the second model loss, and in the model training process, when the server 120 determines that the total model loss is less than or equal to the preset total loss, the server 120 may obtain a trained or referred to as a trained face keypoint detection model. As another example, in the model training process, when the server 120 determines that the total model loss is less than or equal to the preset total loss threshold, and both the first model loss and the second model loss are less than or equal to the respective preset model loss thresholds, the server 120 may obtain the trained face keypoint detection model.

The method for training the face key point detection model comprises the steps of obtaining the average position of each face key point according to the labeling position of each face key point on each face image in a face image data set, determining a principal component set of the face images in the data set and obtaining a fitting coefficient corresponding to each principal component of each face image on the basis of principal component analysis and labeling position, wherein different principal components in the principal component set correspond to different morphological change dimensions of a face respectively, inputting the face images into the face key point detection model to be trained, obtaining a transformed face image by the model through a first spatial transformation network, obtaining a predicted fitting coefficient corresponding to each principal component of the transformed face image through a coefficient prediction network, obtaining the predicted position of each face key point on the transformed face image according to the predicted fitting coefficient, each principal component and the average position, obtaining the predicted position of each face key point on the face image through a second spatial transformation network according to the predicted fitting coefficient, and the first model loss of the consistency of the predicted fitting coefficient and the second model representing the second model loss of the consistency of the predicted position and the labeling position. The scheme is based on the fitting coefficients of each principal component obtained by analyzing the principal components of a face image data set and each face image corresponding to each principal component, the labeling positions and the corresponding average positions of each face key point on each face image in the data set are combined, a face key point detection model is trained to accurately predict the fitting coefficients of the face image, and the positions of each face key point on the face image can be accurately predicted based on the fitting coefficients, the model structure is simple and efficient, the accuracy and the efficiency of face key point detection can be considered, and the real-time requirements of the fields of network live broadcast and the like on the face key point detection can be met while the face key point is accurately detected.

In one embodiment, as shown in fig. 7, the present application provides a live webcast face image processing method, which may be applied to the terminal 110 or the server 120 shown in fig. 1, and the method may include the following steps:

step S701, acquiring a face image to be processed in live webcasting;

step S702, each face key point on the face image to be processed is obtained through the detection of the trained face key point detection model. The face key point detection model can be obtained by training according to the training method of the face key point detection model described in any one of the embodiments.

And step S703, applying a special effect to the face image to be processed based on each face key point on the face image to be processed.

Specifically, the method provided in this embodiment is applied to the terminal 110 shown in fig. 1 for description, wherein the training of the face key point detection model may be performed in the server 120, the server 120 may train according to the training method of the face key point detection model described in any of the embodiments of the present application to obtain the face key point detection model, and the server 120 sends the trained face key point detection model to the terminal 110. In live webcasting, the terminal 110 may acquire a human face image of a anchor as a human face image to be processed, detect each human face key point on the human face image to be processed through a trained human face key point detection model, apply special effects such as beauty, make-up and the like to the human face image to be processed based on each human face key point on the human face image to be processed, and present the human face image with the special effects to the anchor and audiences.

In addition, the method provided by the present embodiment is applied to the server 120 shown in fig. 1 for explanation. In live webcasting, the terminal 110 may obtain a face image of a anchor program and send the face image to the server 120, the server 120 uses the received face image of the anchor program as a face image to be processed, then detects each face key point on the face image to be processed through the trained face key point detection model, and applies special effects such as beauty and make-up based on the face key points to obtain a face image with a special effect, and the face image with the special effect can be presented to the anchor program and audiences.

According to the method, the training method of the face key point detection model provided by the application can be specifically applied to face key point detection in a network live face image, the real-time requirement of a live scene on the face key point detection can be met while the face key point detection precision is improved, special effects such as beauty and make-up are added on the basis of efficient and accurate detection on the face key point, the added special effects can also have efficient and accurate effects, the purpose of balancing model calculation amount and effect is achieved, and the face key point detection technology can be better served for the live scene.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a related apparatus for implementing the related method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the above method, so specific limitations in one or more embodiments of the related apparatus provided below may refer to the limitations on the related method in the foregoing, and details are not described here.

In one embodiment, as shown in fig. 8, there is provided an apparatus for training a face keypoint detection model, where the apparatus 800 may include:

a position obtaining module 801, configured to obtain an average position of each face key point according to a label position of each face key point on each face image in the face image data set;

a principal component analysis module 802, configured to determine a principal component set of the face images in the face image data set based on principal component analysis and labeling position, and obtain a fitting coefficient corresponding to each principal component of each face image; different principal components in the principal component set respectively correspond to different morphological change dimensions of the human face;

an image input module 803, configured to input the face image into a face key point detection model to be trained, so that the face key point detection model to be trained obtains a transformed face image according to the face image through a first spatial transformation network, obtains a prediction fitting coefficient corresponding to each principal component according to the transformed face image through a coefficient prediction network, obtains a prediction position of each face key point on the transformed face image according to the prediction fitting coefficient, the principal components, and an average position, and obtains a prediction position of each face key point on the face image according to the prediction position of each face key point on the transformed face image through a second spatial transformation network;

a loss obtaining module 804, configured to obtain a first model loss representing the consistency between the predicted fitting coefficient and the fitting coefficient, and obtain a second model loss representing the consistency between the predicted position and the labeled position of each face keypoint;

and the model training module 805 is configured to train the face key point detection model to be trained according to the first model loss and the second model loss.

In an embodiment, the image input module 803 is further configured to obtain a predicted position change of each key point of the face on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component; and obtaining the predicted position of each face key point on the transformed face image according to the predicted position change and the average position.

In an embodiment, the principal component analysis module 802 is further configured to perform similar transformation on the labeled positions of the face key points on each face image based on the average position of the face key points, and obtain the transformed positions of the face key points on each face image; and performing principal component analysis according to the transformation position of each face key point on each face image to obtain the principal component set.

In an embodiment, the principal component analysis module 802 is further configured to perform normalization processing on the transform position of each face key point on each face image with respect to the center of the face image, so as to obtain a normalized transform position of each face key point on each face image; and performing principal component analysis according to the normalized transformation position of each face key point on each face image to obtain the principal component set.

In an embodiment, the principal component analysis module 802 is further configured to fit the principal component set by using the normalized transformation position of each face key point on each face image, so as to obtain a fitting coefficient corresponding to each principal component of each face image.

In an embodiment, the loss obtaining module 804 is configured to obtain the first model loss according to a difference between the predicted fitting coefficient and the fitting coefficient.

In an embodiment, the loss obtaining module 804 is configured to obtain a difference between a predicted position and an annotated position corresponding to each face keypoint, obtain a difference mean value based on the difference between the predicted position and the annotated position corresponding to each face keypoint, and obtain the second model loss according to the difference mean value.

In one embodiment, as shown in fig. 9, there is provided a live webcast face image processing apparatus, where the apparatus 900 may include:

an image obtaining module 901, configured to obtain a face image to be processed in live webcasting;

a key point detection module 902, configured to obtain each face key point on the to-be-processed face image through detection of a trained face key point detection model; the face key point detection model is obtained by training by using the training device of the face key point detection model;

and the image processing module 903 is configured to apply a special effect to the to-be-processed face image based on each face key point on the to-be-processed face image.

The various modules in the related devices described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules may be embedded in a hardware form or may be independent of a processor in the electronic device, or may be stored in a memory in the electronic device in a software form, so that the processor calls and executes operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the electronic device is used for storing data such as a face image data set. The network interface of the electronic device is used for communicating with an external device through network connection. When the computer program is executed by a processor, a training method of a face key point detection model and a face image processing method of network live broadcast are realized.

In one embodiment, an electronic device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 11. The electronic device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for communicating with an external device in a wired or wireless manner, and the wireless manner can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to realize a live webcast face image processing method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

It will be understood by those skilled in the art that the configurations shown in fig. 10 and 11 are only block diagrams of a part of the configurations related to the present application, and do not constitute a limitation to the electronic devices to which the present application is applied, and a specific electronic device may include more or less components than those shown in the drawings, or combine some components, or have different arrangements of components.

In one embodiment, an electronic device is further provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims

1. A training method for a face key point detection model is characterized by comprising the following steps:

determining a principal component set of the face images in the face image data set based on principal component analysis and labeling positions, and obtaining a fitting coefficient corresponding to each principal component of each face image; different principal components in the principal component set respectively correspond to different form change dimensions of the human face;

2. The method according to claim 1, wherein the obtaining the predicted positions of the key points of the face on the transformed face image according to the prediction fitting coefficients and the principal components and average positions comprises:

obtaining the predicted position change of each face key point on the transformed face image according to the predicted fitting coefficient corresponding to each principal component of the transformed face image and each principal component;

and obtaining the predicted position of each face key point on the transformed face image according to the predicted position change and the average position.

3. The method of claim 1, wherein determining a set of principal components of a face image in the face image dataset comprises:

performing similarity transformation on the labeling positions of the face key points on each face image based on the average positions of the face key points to obtain the transformation positions of the face key points on each face image;

and performing principal component analysis according to the transformation position of each face key point on each face image to obtain the principal component set.

4. The method according to claim 3, wherein the performing principal component analysis according to the transformed positions of the face key points on each face image to obtain the principal component set comprises:

normalizing the conversion position of each face key point on each face image relative to the center of the face image to obtain the normalized conversion position of each face key point on each face image;

and performing principal component analysis according to the normalized transformation position of each face key point on each face image to obtain the principal component set.

5. The method according to claim 4, wherein the obtaining of the fitting coefficient corresponding to each face image in each principal component comprises:

and fitting the principal component set by utilizing the normalized transformation position of each face key point on each face image to obtain a fitting coefficient corresponding to each principal component of each face image.

6. The method according to any one of claims 1 to 5,

the obtaining a first model penalty characterizing a consistency of the prediction fitting coefficients and fitting coefficients includes: obtaining a first model loss according to the difference value between the prediction fitting coefficient and the fitting coefficient;

the obtaining of the second model loss representing the consistency of the predicted position and the labeled position of each face key point includes: and obtaining the difference value between the predicted position and the marked position corresponding to each face key point, obtaining the mean value of the difference values based on the difference value between the predicted position and the marked position corresponding to each face key point, and obtaining the loss of the second model according to the mean value of the difference values.

7. A live webcast face image processing method is characterized by comprising the following steps:

acquiring a face image to be processed in live webcasting;

detecting through a trained human face key point detection model to obtain each human face key point on the human face image to be processed; the face key point detection model is obtained by training according to the method of any one of claims 1 to 6;

8. An apparatus for training a face keypoint detection model, the apparatus comprising:

the principal component analysis module is used for determining a principal component set of the face images in the face image data set based on principal component analysis and labeling positions and acquiring a fitting coefficient corresponding to each principal component of each face image; different principal components in the principal component set respectively correspond to different morphological change dimensions of the human face;

the loss acquisition module is used for acquiring a first model loss representing the consistency of the predicted fitting coefficient and the fitting coefficient and acquiring a second model loss representing the consistency of the predicted position and the labeled position of each face key point;

9. A live webcast face image processing apparatus, the apparatus comprising:

the key point detection module is used for detecting and obtaining each face key point on the face image to be processed through the trained face key point detection model; the face key point detection model is obtained by training by using the device of claim 8;

10. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 7 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.