CN113221708B

CN113221708B - Training method and device for facial movement unit detection model

Info

Publication number: CN113221708B
Application number: CN202110484143.4A
Authority: CN
Inventors: 支瑞聪; 胡昕
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2023-11-10
Anticipated expiration: 2041-04-30
Also published as: CN113221708A

Abstract

The present disclosure relates to a training method and apparatus for a facial movement unit detection model, the apparatus comprising: acquiring a training sample set, and dividing face images in the training sample set into batches with preset quantity; training processes of preset times are carried out according to the training sample set, and each training process comprises the step of carrying out preset processing on all batches of face images; the preset processing process of each batch comprises the following steps: extracting feature vectors of each face motion unit of each face image in the current batch; constructing a plurality of relation units, and learning the correlation between the two corresponding face movement units through each relation unit; discarding at least one relation unit according to a preset proportion; and training the face motion unit detection model according to the correlation learned by each relation unit remained in each face image. The application discards part of relation units, suppresses complex co-adaptation relation between AUs while learning AU relation, and enables the model to learn more robust characteristics.

Description

Training method and device for facial movement unit detection model

Technical Field

The present application relates to the field of facial expression detection technologies, and in particular, to a facial motion unit detection model training method, apparatus, computer device, and storage medium.

Background

The expression is a word frequently mentioned in our daily life, and in interpersonal communication, people can strengthen the communication effect by controlling own facial expression. The human face expression is an important mode for transmitting human emotion information and coordinating interpersonal relationship, and the study of psychologist A.Mehrabaia shows that in daily communication of human beings, the information transmitted through the language only accounts for 7% of the total information, and the information transmitted through the human face expression reaches 55% of the total information. There are at least 21 facial expressions of human, and besides the usual 6 kinds of happiness, surprise, sadness, anger, aversion and fear, there are 15 kinds of distinguishable compound expressions of surprise (i.e. happiness + surprise), sadness (i.e. sadness + anger), etc., the facial expressions play an important role in conveying our mental state. As facial expressions play an increasingly important role in man-machine interaction, advertisement recommendation, pain detection, etc., a series of facial expression detection systems have been developed. Of these systems, facial motion coding systems (i.e., FACS) are widely used. Wherein the facial movement unit (i.e. AU) is a basic facial motion, the facial muscles of all people are almost identical, and the facial movement unit plays a fundamental role in formulating a plurality of facial expressions based on the movements of these muscles. The facial motion unit detection system will greatly facilitate analysis of complex facial movements or expressions.

Currently, facial movement unit detection is mostly performed by using a supervision mode. For example, baltrusatis et al devised a facial movement unit recognition system based on fusion of appearance and geometric features. A face-motion-unit detection scheme combining patch (i.e., patch) and multi-label learning is proposed by zhao et al. Shore et al propose an end-to-end deep learning framework for combining facial motion unit detection and face alignment, with alignment features for computation. Horses et al encode a priori knowledge to the RCNN (i.e., a convolutional neural network based target detection algorithm) for facial motion unit detection. Although these supervised methods greatly improve the performance of automatic face recognition, the generalization of the scheme is limited due to the limited data set with facial movement unit tags. Still other approaches attempt semi-supervised facial motion unit recognition by a large number of unlabeled face images or face images with emotion tags only. For example, peng et al train by generating pseudo-face motion unit tags from face images with emotion tags only, using a priori knowledge of face motion units and emotion. While these schemes do not require facial movement unit tags, other related tags are still required.

In addition, strong correlation exists between the face movement units, and the detection performance of the face movement units can be further improved through neighborhood knowledge. For example, cattle et al embed a priori knowledge existing between facial motion units through a graph convolution network for facial motion unit detection. Plums et al enhance the feature representation of facial motion units by semantic relationship propagation between facial motion units. However, these methods do not take into account the complex co-adaptation relationships that exist between facial motion units, while learning may instead inhibit the more robust features of network learning.

Based on the above analysis, current facial movement unit detection schemes suffer from the following disadvantages:

(1) At present, most of the multi-face motion unit detection methods are supervision methods, however, the number of face images with face motion unit labels is limited, most of the face images are collected in a laboratory, and the face motion unit detection methods are single in the aspects of posture, illumination and the like, so that a challenge is caused to constructing a robust face motion unit detection system. Moreover, in the real world, there are a large number of unlabeled face images in social media, the internet, video, etc., for which the task of labeling face motion units requires expert completion and is time consuming;

(2) There is a correlation between facial motion units, and some schemes exist to improve the performance of facial motion unit detection by exploiting a priori knowledge of the correlation. However, these schemes only consider the relationships existing between the simultaneous learning face movement units, and do not consider the complex co-adaptation relationships existing between the face movement units, while learning the face movement unit relationships instead suppresses the more robust features of the network learning.

Disclosure of Invention

To solve or at least partially solve the above technical problems, the present application provides a facial movement unit detection model training method, apparatus, computer device, and storage medium.

In a first aspect, the present application provides a training method for a facial movement unit detection model, including:

the method comprises the steps of obtaining a training sample set, dividing face images in the training sample set into batches with preset quantity, wherein the training sample set comprises first face images and second face images, the first face images are face images with a plurality of face motion unit labels, and the second face images are face images without labels;

carrying out a training process of preset times according to the training sample set, wherein each training process comprises the step of carrying out preset processing on all batches of face images;

The preset processing process of each batch comprises the following steps:

extracting feature vectors of each face motion unit of each face image in the current batch;

constructing a plurality of relation units according to the feature vectors of the face movement units in each face image, wherein each relation unit corresponds to one face movement unit pair; and learning, by each of the relationship units, a correlation between two face motion units in the corresponding pair of face motion units;

discarding at least one relation unit according to a preset proportion;

and training the face motion unit detection model according to the correlation learned by each relation unit remained in each face image.

In a second aspect, the present application provides a facial movement unit detection model training apparatus, comprising:

the system comprises a batch dividing module, a batch dividing module and a batch processing module, wherein the batch dividing module is used for obtaining a training sample set and dividing face images in the training sample set into batches with preset quantity, the training sample set comprises a first face image and a second face image, the first face image is a face image with a plurality of face motion unit labels, and the second face image is a label-free face image;

The batch processing module is used for carrying out training processes for preset times according to the training sample set, and each training process comprises the step of carrying out preset processing on all batches of face images; the preset processing process of each batch comprises the following steps: extracting feature vectors of each face motion unit of each face image in the current batch; constructing a plurality of relation units according to the feature vectors of the face movement units in each face image, wherein each relation unit corresponds to one face movement unit pair; and learning, by each of the relationship units, a correlation between two face motion units in the corresponding pair of face motion units; discarding at least one relation unit according to a preset proportion; training a face motion unit detection model according to the correlation learned by each relation unit remained in each face image

In a third aspect, the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

According to the training method and device for the face motion unit detection model, the relation units are designed to learn the AU relation in the processing process of each batch, and the relation units are discarded according to the preset proportion, namely the thought of drop-out is introduced, so that the complex co-adaptation relation between AUs is restrained while the AU relation is learned, and the model learns more robust characteristics.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1a is a schematic flow chart of a training method for a facial movement unit detection model according to an embodiment of the present application;

FIG. 1b is a schematic flow chart of a preset process for each lot according to an embodiment of the present application;

FIG. 2a is a schematic flow chart of a training method for a facial movement unit detection model according to an embodiment of the present application;

FIG. 2b is a flowchart illustrating a method for calculating a predicted probability value for a facial motion unit according to an embodiment of the present application;

FIG. 3a is a diagram showing the correlation of three facial motion units according to an embodiment of the present application;

FIG. 3b is a schematic diagram of the correlation of three facial motion units after discarding one relationship unit in FIG. 3 a;

FIG. 4 is a graph showing the performance of the relational unit corresponding to the model of the relational unit and the three different structures according to the embodiment of the application;

fig. 5 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In a first aspect, a training method for a facial movement unit detection model according to an embodiment of the present application, as shown in fig. 1a and fig. 1b, includes the following steps:

s100, acquiring a training sample set, and dividing face images in the training sample set into batches with preset quantity;

the training sample set comprises a first face image and a second face image, wherein the first face image is a face image with a plurality of face motion unit labels, and the second face image is a label-free face image.

For example, there are 3000 face images in the training sample set, and these face images include a small number of first face images and a large number of second face images. The 3000 face images were divided into 30 batches, each batch containing 100 face images.

The preset number of times may be set as required, for example, set to 40 times, that is, training of the model includes 40 training processes, and it is assumed that each training process includes 30 batches, that is, the entire training process includes 1200 batches. In each batch training process, training is required according to face images of the current batch, namely, a loss value of the batch is obtained after training according to one batch is completed, and then parameters in the model are adjusted according to the loss value, so that the loss value becomes smaller after the next batch training process is performed, until 1200 batches are all trained, namely, the training process is finished for 40 times.

It will be appreciated that it is assumed that there are C facial motion units (i.e. AUs) in a single face image, which are closely connected to the muscular structures that cause the facial expression to change. For example, the first facial movement unit (i.e., AU 1) is such that the inner side of the eyebrow is pulled up; the second facial movement unit (AU 2) is used for pulling up the outer side of the eyebrow; third facial motion unit (i.e., AU 3): lifting the whole eyebrow, and opening the eyes more greatly; fourth facial motion unit (i.e., AU 4): pressing down the eyebrows and gathering the eyebrows; there are of course many other facial movement units, etc.

It will be appreciated that there are C face motion units in a face image, if the ith face motion unit appears in the face image, the label corresponding to the ith face motion unit is 1, and if the jth face motion unit does not appear, the label corresponding to the jth face motion unit is 0, so that a face image corresponds to C labels, which are used to indicate whether the corresponding face motion unit appears.

S200, training processes of preset times are carried out according to the training sample set, wherein each training process comprises the step of carrying out preset processing on all batches of face images;

The preset processing process of each batch comprises the following steps:

s210, extracting feature vectors of face motion units of each face image in the current batch;

in particular implementations, a backbone model may be employed to extract feature vectors for individual facial motion units of each face image. Because convolutional neural networks such as ResNet have proven to have powerful feature generation capabilities, resNet-34 can be chosen as a model for feature extraction. In the application, the sigmoid function of the last layer in the traditional ResNet-34 is removed, and the complete connection layer of the last 1000 cores of the ResNet-34 is replaced by the complete connection layer of the C cores to match the number of the face motion units, so that the characteristic vectors of the C face motion units can be obtained by inputting a face image into the ResNet-34 instead of the prediction probability of the C face motion units.

In a specific implementation, before extracting the feature vector of the face motion unit of the face image, a certain preprocessing may be further performed on the face image, for example, if the face information and the background are included in the image, a face area is first extracted by using a face automatic recognition technology, and the most commonly used face automatic recognition technology is MTCNN (i.e. a Multi-task cascade convolutional neural network, multi-task Cascaded Convolutional Networks), and the face area in the image frame may be extracted by using the method. The present application can thus extract the face area using MTCNN and cut the face area to 224x224 size.

Of course, the extracted face area may be further preprocessed according to different situations, for example, the image is interfered by noise signals such as white noise and gaussian noise, and then the noise influence may be removed by adopting methods such as wavelet analysis and kalman filtering. For another example, if the image is affected by illumination, the influence of illumination unevenness is reduced by adopting methods such as light compensation, edge extraction, quotient image, gray scale normalization and the like.

S220, constructing a plurality of relation units according to the feature vectors of the face motion units in each face image, wherein each relation unit corresponds to one face motion unit pair; and learning, by each of the relationship units, a correlation between two face motion units in the corresponding pair of face motion units;

it will be appreciated that there is a strong correlation between facial motion units, depending on facial anatomy. The correlation can be divided into two parts, positive and negative. Positive correlation is that certain facial motion units are likely to be present at the same time, and negative correlation is that certain facial motion units are not present at the same time or are rarely present at the same time. In order to fully exploit the correlation between facial motion units, let P and N represent sets of positively and negatively correlated facial motion unit pairs, For the M pair AU relationship in S, the present application constructs M relationship units to learn the relationship between AUs, one AU pair for each relationship unit.

In order for the relationship units to learn the correlation between two facial motion units, each relationship unit may specifically comprise 4 fully connected layers,for fitting correlations between facial motion units. For example, let f= [ X ₁ ,X ₂ ,…,X _C ]Representing the feature vector of C AUs obtained by each face image through a trunk model, and the original AU pair X formed by the ith AU and the jth AU _i,j Can be expressed as:

new AU pairs obtained by learning the relationship between corresponding AU pairs by a relationship unitCan be expressed as

Where g () represents the ReLU activation function, W, added after each fully connected layer _k,i,j ＝[w _k,i,j b _k,i,j ]Is a parameter set by the relation unit for the kth full connection layer in order to learn the relation between the ith AU and the jth AU.

In practice we find that removing nonlinearities in the relational element can improve performance, which means that complex nonlinearities are not necessarily all good, i.e. the ReLU activation function is deleted after each fully connected layer, resulting in the first formula:

in the method, in the process of the application,x is the correlation between the ith and jth facial motion units _i Feature vector, X, for the ith facial motion unit _j Feature vector for jth facial motion unit, < >>Parameters of the kth full connection layer in the corresponding relation unit of the facial motion unit pair formed by the ith facial motion unit and the jth facial motion unit are k being [1,4]Positive integers in the range.

After removing the nonlinearity, an attempt is made to replace the relationship element with a single linear layer (2 x 2), but it contains too few parameters to fit the relationship between AUs well. The application thus ultimately uses a linear 4-layer fully connected layer to learn more parameters to fit the relationship between AUs.

Referring to fig. 4, the impact of different relationship unit structures on final performance is discussed, including nonlinear four-layer networks, linear four-layer networks, 2 x 2 single linear layers, to evaluate different AU relationship unit structures. A nonlinear four-layer network is first constructed as a relational element. The performance of the model was found to be improved after removal of the nonlinearity, indicating that complex nonlinearities are not necessarily all beneficial. And after replacing four linear layers with a single linear layer, the performance is degraded. It can be seen that a single linear layer of 2 x 2 contains too few parameters to fit well into the relationship between AUs. Therefore, the application finally selects a linear four-layer network as the structure of the relation unit.

Referring to fig. 4, to evaluate the necessity of a relationship unit, we deleted the relationship unit, the output of the backbone model was directly used for the calculation of the loss function, i.e. the link of learning the correlation between AUs with the relationship unit was omitted, drop-out was also directly used in the backbone model, but it was found that deleting the relationship unit would result in a significant degradation of the network performance, which suggests the necessity of the relationship unit in the present application.

In implementations, the relationship unit may learn, through relationship regularization, a correlation between two facial motion units in a corresponding facial motion unit pair.

S230, discarding at least one relation unit according to a preset proportion;

co-adaptation is often referred to in biology. Co-adaptation is the adaptation of two or more species into a pair or group according to the phenotypic trait of a gene or process. The features of such interactions are beneficial only when taken together, but sometimes result in increased interdependence. Hinton et al reduce complex co-adaptation relationships between neurons by dropping (i.e., drop-out) such that updating of weights is no longer dependent on the co-action of hidden nodes with fixed relationships, preventing the situation where certain features are only effective under other specific features.

Inspired by drop-out, the application applies the concept of drop-out to relation learning. Co-adaptation relationships exist between AUs as well, and the existing scheme does not consider that complex co-adaptation relationships exist between AUs. As shown in fig. 3a, it is assumed that a, B, C represents 3 AUs, and (a, B) and (B, C) are positively correlated and (a, C) are negatively correlated according to a priori knowledge. If such a relationship is learned simultaneously, the occurrence of a increases the probability of B, and the occurrence of B increases the probability of C, which equates to the occurrence of a increasing the probability of C, but a and C are inversely related, and the occurrence of a suppresses the occurrence of C. While learning can present a relationship conflict. Thus, if some of the relationship units (B, C) are deleted randomly, the network will not collide while learning other relationships. Therefore, the application analogizes the relation units to neurons, and randomly deletes the relation units at a certain ratio in the training process, thereby reducing the relation conflict. As shown in fig. 3b, the red box represents the discarded relationship element.

Wherein the preset ratio may be determined based on an empirical value, for example, 0.25.

S240, training a face motion unit detection model according to the correlation learned by each relation unit remained in each face image.

According to the training method for the face motion unit detection model, the relation units are designed to learn the AU relation in the processing process of each batch, and the relation units are discarded according to the preset proportion, namely the thought of drop-out is introduced, so that the complex co-adaptation relation among AU is restrained while the AU relation is learned, and the model learns more robust features.

In specific implementation, S240 may include the following steps:

s241, determining respective prediction probability values of each face motion unit in each face image according to the learned relativity of each relation unit remained in each face image;

in specific implementation, S241 may specifically include:

s241a, determining an updated feature vector of each face motion unit in each face motion unit pair according to the learned relativity of each relation unit remained in each face image;

in this step, a second formula may be used to calculate an updated feature vector for each facial motion unit in each facial motion unit pair, the second formula comprising:

in the method, in the process of the application,for the correlation between the ith and jth facial movement units, ++ >The feature vector updated for the ith facial motion unit,/->The feature vector updated for the j-th facial motion unit.

It can be seen that the AU relationship is determined from the first formula beforeThe new AU value is decomposed into an ith new AU value and a jth new AU value, namely, an ith facial motion unit updated feature vector and a jth facial motion unit updated feature vector.

S241b, calculating an average feature vector updated by each face motion unit in each face image according to the feature vectors updated by each face motion unit in each pair of face motion units in each face image;

it can be understood that, for each relationship unit in a face image, after the second formula decomposition, multiple repeated AUs are obtained, for example, relationship units (AU 1, AU 2), (AU 2, AU 6), (AU 2, AU 7) are decomposed to obtain multiple AU2 updated feature vectors, and the updated feature vectors of the several AUs 2 are averaged to obtain an average feature vector of the AU 2.

S241c, calculating a prediction probability value of each face motion unit in each face image according to the average feature vector updated by each face motion unit in each face image.

In particular implementations, this step may calculate the predicted probability value for each facial motion unit using a third formula comprising:

in the method, in the process of the invention,predictive probability values for the j-th facial motion unit in each face image, +.>And updating the average feature vector for each face motion unit in each face image.

It can be appreciated that σ () is a sigmoid function. By the third formula, a predicted probability value of each facial motion unit in a face image can be obtained, wherein the larger the probability value is, the greater the probability of the corresponding facial motion unit is.

It will be appreciated that any one of the face images (the first face image and the second face image) may calculate the predicted probability value for each AU in the above-described manner, and that the third face image mentioned below may calculate the predicted probability value using the above-described procedure.

S242, generating a plurality of corresponding pseudo labels for each face image according to the respective prediction probability values of the face motion units in each face image;

it will be appreciated that the pseudo tags are set for both the first face image and the second face image, i.e. the first face image is also regarded as a non-tagged face image, and a corresponding plurality of pseudo tags are generated. For the subsequent third face image, no pseudo tag needs to be generated.

In a specific implementation, a fourth formula may be used to generate the jth pseudo tag in each face image, where the fourth formula includes:

where τ is a preset threshold,predictive probability value pl for the jth face motion unit in each face image _j =1 means that pseudo tag 1, pl is generated for the jth face motion unit in each face image _j =0 means that pseudo tag 0, pl is generated for the jth face motion unit in each face image _j =null means that no corresponding pseudo tag is generated for the jth face motion unit in each face image.

For example, if τ is set to pseudo 0.2, then onlyAbove 0.7, a pseudo tag 1 can be generated for the jth facial motion unit; />Below 0.3, a pseudo tag 0 can be generated for the jth facial movement unit if +.>Between 0.3 and 0.7, no pseudo tag is generated for the jth facial motion unit. If a face image has C face motion units, C pseudo tags or less may be generated.

In practice, the pseudo labels generated by each batch for each face image may be stored, and when the present training process is finished, the pseudo labels generated by each batch are stored, so that after the one training process is finished, a plurality of pseudo labels generated for each face image (including the first face image and the second face image) in the training sample set are obtained and used as the calculation loss value in the next training process. For example, construct an EPM εR ^N*C (namely, an Epoch Pseudo-label Map, a Chinese Pseudo tag mapping table) and storing Pseudo tags by adopting an EPM.

S243, calculating a loss value of the current batch according to the predicted probability value of each face motion unit in each face image and the pseudo tag generated in the last training process;

in specific implementation, S243 may specifically include the following steps:

s243a, calculating a relation regularization loss value of each face image, and calculating a relation regularization loss value of the current batch according to the relation regularization loss value of each face image;

it will be appreciated that the calculation of the relationship regularization loss value is for a first face image and a second face image.

It can be understood that the present application employs M relationship units for learning the relationship between M AU pairs, mapping the original feature vector to the feature vector containing the relationship. The M relationship units are divided into positive and negative correlation groups according to their outputs. The purpose of the relationship regularization is: for AU pairs corresponding to each relation unit in the positive correlation group, the relation regularization can ensure that the AUs of two positive correlations are as close as possible in a prediction space; for AU pairs corresponding to each relationship unit in the negative correlation group, the relationship regularization can ensure that the two negative correlated AUs should be as far apart as possible in the prediction space.

In order to implement the above-mentioned relationship regularization, a fifth formula may be used to calculate a relationship regularization loss value for each face image, where the fifth formula includes:

wherein L is _RR For the relation to regularize the loss value,is a set of positively correlated relational elements, +.>Is a negative correlation relation unit set, mu is a preset threshold value, B is the number of face images in the current batch, and +.>For the average feature vector updated for the ith facial motion unit in each face image,/>And updating the average feature vector for the j-th face motion unit in each face image. Wherein the fifth formula is decomposed:

L _RR ＝L _RRN +L _RRP

that is, the relational regularization penalty value for each face image includes a positive correlation regularization penalty L _RRP And negatively correlated regularization loss L _RRN 。α _i Representing the prediction probability alpha corresponding to the mean value of the average feature vector updated by the ith facial motion unit in all face images in the batch _j And the prediction probability corresponding to the average value of the average feature vector updated by the jth face motion unit in all face images in the batch is obtained.

Regularization loss L in positive correlation _RRP In order to make the positive correlation regularization loss as small as possible, |α _i -α _j The i should be as small as possible, thus ensuring that the two positively correlated AUs should be as close as possible in prediction space; regularization loss L in negative correlation _RRN In order to make the negative correlation regularization loss as small as possible, |α _i -α _j The i should be as large as possible to ensure that the two negatively correlated AUs should be as far apart as possible in the prediction space. That is, the above-mentioned relation regularization loss is used as a part of the loss value of the batch, and the back propagation is used for adjusting the corresponding parameters, so as to achieve the purpose of the above-mentioned relation regularization.

The relation regularization loss value of each picture can be calculated through the formula, and the relation regularization loss values of all face images in the batch are summed to obtain the relation regularization loss value of the batch.

S243b, calculating a supervision loss value of each first face image according to the respective prediction probability value of each face motion unit in each first face image, and calculating the supervision loss value of the current batch according to the supervision loss value of each first face image in the current batch;

it will be appreciated that the supervised penalty value is for the first face image.

In particular implementations, the supervisory loss value may be calculated using a sixth formula comprising:

wherein L is _au C is the number of facial motion units in each face image and w is the supervision loss value _c To preset balance parameters, p _j For the label of the j-th facial motion unit in each first face image,and predicting probability values for the j-th facial motion unit in each first facial image.

It can be understood that the supervision loss value of each first face image can be obtained through the sixth formula, and then the supervision loss values of all the first face images in the batch are summed to obtain the supervision loss value of the batch.

S243c, enhancing each face image in the current batch to obtain a corresponding third face image; calculating the comprehensive unsupervised loss value of each face image and the corresponding third face image in the current batch according to the pseudo tag generated for each face image in the current batch and the respective prediction probability value of each face motion unit in the corresponding third face image in the last training process; according to the integrated unsupervised loss value of each face image and the corresponding third face image in the current batch, calculating the unsupervised loss value of the current batch;

for example, there are 30 face images in the current batch, including 10 first face images and 20 second face images, and then each of the 30 face images is enhanced to obtain a corresponding third face image.

It is understood that the face images in the current batch include a first face image and a second face image, and the third face image is a face image generated based on the first face image or the second face image in the current batch.

In particular implementations, the unsupervised loss value may be calculated using a seventh formula comprising:

wherein L is _u For the unsupervised loss value, t is the number of current training processes,pseudo tag generated for jth facial motion unit in each face image in current lot during t-1 th training process +.>And the predicted probability value of the j-th face motion unit in the corresponding third face image in the training process is obtained.

It can be understood that when the unsupervised loss value is calculated in the above formula, the pseudo tag generated for the face image in the current batch in the previous training process is considered, and the predicted probability value of the face motion unit of the third face image obtained by enhancing the face image in the current batch is considered, so that one unsupervised loss value is equivalent to the unsupervised loss value obtained by combining the two face images. And summing all the unsupervised loss values generated based on all the face images in the current batch and the corresponding third face images to obtain the unsupervised loss values of all the current batches.

It will be appreciated that what is actually here is a consistency regularization, which makes use of the second face image by relying on the following assumptions: i.e. the model should be flat in the vicinity of the input data, the output of the model can remain substantially unchanged even if the input data changes weakly.

Therefore, the application realizes semi-supervised AU detection based on consistency regularization and pseudo labels. Because of the difficulty in building high quality AU databases, most existing AU databases are collected in the laboratory, which presents challenges in building robust AU detection systems. The application improves the generalization capability of the model by combining a large number of untagged face images in reality with the existing AU data set.

S243d, calculating the loss value of the current batch according to the relation regularized loss value, the supervision loss value and the unsupervised loss value of the current batch.

Here the loss value of the current lot consists of three parts: the relation regularization loss value, the supervision loss value and the unsupervised loss value can realize regularization of the relation between two AUs corresponding to the relation unit based on the relation regularization loss value; the unsupervised loss value is calculated by combining a large number of untagged face images in reality with the existing AU data set, so that the generalization capability of the model can be improved.

S244, judging whether the current batch is the last batch in the training process:

if the current batch is the last batch in the training process, judging whether the current training times reach the preset times or not: if yes, training of the facial movement unit detection model is completed; otherwise, adjusting the facial movement unit detection model according to the loss value of the current batch, and performing the next training process;

if the current batch is not the last batch in the training process, model training is carried out according to the face image of the next batch.

Experimental analysis:

the detection model trained by the method provided by the application is compared with three other semi-supervised learning systems (Pseudo-Labeling, mean-teacher and MixMatch). The same backbone model (i.e., resNet-34) was also chosen for fair comparison during the experiment versus Pseudo-Labeling, mean-teacher and MixMatch. Meanwhile, four most advanced supervised learning AU detection methods (ROI, JAA-Net, LP-Net and AUR-CNN) are selected for comparison. Through comparison of experimental data and with reference to the following table 1, it can be found that the model trained by the application can significantly improve the identification accuracy of AU and is superior to other semi-supervised methods, namely Pseudo-Labeling, mean-teacher and MixMatch. When compared with ROI, JAA-Net, LP-Net and AUR-CNN, although the application uses unlabeled face images, the ResNet-34 network used does not need special design and additional information, and can realize better or equivalent average performance. In contrast, ROI, JAA-Net, LP-Net and AU R-CNN also use other information (e.g., facial marker points) that helps identify the AU of the small region. By contrast, the method provided by the application is overall superior to all the latest semi-supervised and supervised methods. The application realizes the best performance on most AUs, and the F1 average score of all AUs is very high, and the results show that the model trained by the method provided by the application has good generalization capability.

TABLE 1 comparison of different detection models

/>

For example, as shown in fig. 2a, each face image and the second face image in a batch are randomly enhanced to obtain a corresponding third face image, and then the first face image, the second face image and the third face image are input into a trunk model to obtain feature vectors of all AUs in the face images. Then building a relationship unit for each AU, and learning the correlation between AUs by the relationship unit, and then discarding the divided relationship units. And calculating the prediction probability of each AU in each face image according to the correlation learned by the relation unit. And then generating pseudo labels for the first face image and the second face image according to the prediction probability of all AUs in the first face image and the second face image, and storing the pseudo labels in the EPM. Calculating a supervision loss value based on the prediction probability of each AU in the first face image; and calculating an unsupervised loss value based on the pseudo tag generated in the last training process of the first face image and the second face image and the prediction probability of each AU in the third face image. And then, calculating the relation regularization loss of each face image, so as to obtain the loss value of the current batch.

As shown in FIG. 2b, every two AUs form an AU pair, corresponding to a relationship unit, each relationship unit contains 4 fully-connected layers, and the ith AU and the jth AU form an AU pair X _i,j Relationship unit pair X _i,j The relationship between the two AUs obtained after learning isAccording to the relationship->The feature vector corresponding to the i < th > AU after updating can be obtained->Feature vector corresponding to jth AU +.>And further obtaining the prediction probability of each AU according to the sigmoid function. Wherein, the relation unit corresponding to the solid line part is a discarded relation unit.

Therefore, the application provides a semi-supervised model training method, which is trained by a small number of face images with AU labels and a large number of untagged face images marked by experts, so that a model with strong generalization capability can be trained by utilizing a large number of untagged face images existing in the real world, such as social media, internet, video and the like, and combining a small number of face images with AU labels. And the application can guide supervised learning by extracting learning signals from unlabeled face images through consistency regularization and pseudo tags. Meanwhile, the application designs a relation unit, and the learning of the AU relation is guided through relation regularization. And inspired by drop-out, the application introduces the thought of drop-out, so that the application suppresses the complex co-adaptation relationship between AUs while learning the AU relationship, and enables the model to learn more robust features. Through experimental comparison, the method is superior to the existing semi-supervision and supervision method in two widely used public domain AU detection data sets (namely BP4D and DISFA).

In a second aspect, an embodiment of the present application provides a training device for a facial movement unit detection model, including:

the batch processing module is used for carrying out training processes for preset times according to the training sample set, and each training process comprises the step of carrying out preset processing on all batches of face images; the preset processing process of each batch comprises the following steps: extracting feature vectors of each face motion unit of each face image in the current batch; constructing a plurality of relation units according to the feature vectors of the face movement units in each face image, wherein each relation unit corresponds to one face movement unit pair; and learning, by each of the relationship units, a correlation between two face motion units in the corresponding pair of face motion units; discarding at least one relation unit according to a preset proportion; and training the face motion unit detection model according to the correlation learned by each relation unit remained in each face image.

In a third aspect, an embodiment of the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method provided in the first aspect when executing the computer program.

FIG. 5 illustrates an internal block diagram of a computer device in one embodiment. As shown in fig. 5, the computer device includes a processor, a memory, a network interface, an input device, a display screen, and the like, which are connected through a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by the processor, causes the processor to implement a facial motion unit detection model training method. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform a facial movement unit detection model training method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method provided in the first aspect.

It may be appreciated that, in the explanation, examples, beneficial effects, etc. of the content of the apparatus, the computer device, and the computer readable storage medium provided in the embodiments of the present application, reference may be made to corresponding parts in the first aspect, and details are not repeated here.

It is to be appreciated that any reference to memory, storage, database, or other medium used in the various embodiments provided herein can include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for training a facial movement unit detection model, comprising:

the preset processing process of each batch comprises the following steps:

discarding at least one relation unit according to a preset proportion;

training a face motion unit detection model according to the correlation learned by each relation unit remained in each face image;

The training of the facial motion unit detection model according to the correlation learned by each relation unit remained in each face image comprises the following steps:

determining respective prediction probability values of all face motion units in each face image according to the learned relativity of all the rest relation units in each face image;

generating a plurality of corresponding pseudo labels for each face image according to the respective prediction probability values of the face motion units in each face image;

calculating a loss value of the current batch according to the predicted probability value of each face motion unit in each face image and the pseudo tag generated in the last training process;

judging whether the current batch is the last batch in the training process;

if yes, judging whether the current training times reach the preset times, and if yes, finishing the training of the facial movement unit detection model; otherwise, adjusting the facial movement unit detection model according to the loss value of the current batch, and performing the next training process;

otherwise, performing model training according to the face images of the next batch;

wherein, the determining the respective prediction probability value of each face motion unit in each face image according to the correlation learned by each relation unit remaining in each face image comprises:

Determining an updated feature vector of each facial motion unit in each facial motion unit pair according to the learned correlation of the rest relation units in each facial image;

calculating an average feature vector updated by each face motion unit in each face image according to the feature vectors updated by each face motion unit in each face image;

calculating a prediction probability value of each face motion unit in each face image according to the average feature vector updated by each face motion unit in each face image;

wherein the updated feature vector of each facial motion unit in each facial motion unit pair is calculated using a second formula comprising:

in the method, in the process of the invention,for the correlation between the ith and jth facial movement units, ++>The feature vector updated for the ith facial motion unit,/->The feature vector updated for the j-th facial motion unit; and/or the number of the groups of groups,

calculating a predicted probability value for each facial motion unit using a third formula comprising:

in the method, in the process of the invention,predictive probability values for the j-th facial motion unit in each face image, +. >Movement list for each face in each face imageAnd (3) the average feature vector after element updating, wherein sigma () is a sigmoid function.

2. The method of claim 1, wherein each relationship unit comprises 4 fully connected layers, the relationship unit learning to determine a correlation between an ith facial motion unit and a jth facial motion unit using a first formula, the first formula representing:

in the method, in the process of the invention,x is the correlation between the ith and jth facial motion units _i Feature vector, X, for the ith facial motion unit _j Feature vector for jth facial motion unit, < >>Parameters of the kth full connection layer in the corresponding relation unit of the facial motion unit pair formed by the ith facial motion unit and the jth facial motion unit are k being [1,4]Positive integers in the range.

3. The method of claim 1, wherein a j-th pseudo tag in each face image is generated using a fourth formula comprising:

4. The method of claim 1, wherein the relationship unit learns a correlation between two facial motion units in a corresponding facial motion unit pair by relationship regularization; the calculating the loss value of the current batch comprises the following steps:

calculating a relation regularization loss value of each face image, and calculating a relation regularization loss value of the current batch according to the relation regularization loss value of each face image;

according to the respective prediction probability values of the face motion units in each first face image, calculating the supervision loss value of each first face image, and according to the supervision loss values of the first face images in the current batch, calculating the supervision loss value of the current batch;

each face image in the current batch is enhanced to obtain a corresponding third face image; calculating the comprehensive unsupervised loss value of each face image and the corresponding third face image in the current batch according to the pseudo tag generated for each face image in the current batch and the respective prediction probability value of each face motion unit in the corresponding third face image in the last training process; according to the integrated unsupervised loss value of each face image and the corresponding third face image in the current batch, calculating the unsupervised loss value of the current batch;

And calculating the loss value of the current batch according to the relation regularized loss value, the supervised loss value and the unsupervised loss value of the current batch.

5. The method of claim 4, wherein the relational regularization loss value is calculated using a fifth formula comprising:

wherein L is _RR For the relation to regularize the loss value,is a set of positively correlated relational elements, +.>Is a negative correlation relation unit set, mu is a preset threshold value, B is the number of face images in the current batch, and +.>For the average feature vector updated for the ith facial motion unit in each face image,/>The average feature vector updated by the j-th face motion unit in each face image; alpha _i Representing the prediction probability alpha corresponding to the mean value of the average feature vector updated by the ith facial motion unit in all face images in the batch _j The prediction probability corresponding to the average value of the average feature vector updated by the jth face motion unit in all face images in the batch is obtained;

and/or the number of the groups of groups,

calculating the monitor loss value using a sixth formula, the sixth formula comprising:

wherein L is _au C is the number of facial motion units in each face image and w is the supervision loss value _c To preset balance parameters, p _j For the label of the j-th facial motion unit in each first face image,a predicted probability value for a j-th facial motion unit in each first facial image; and/or

Calculating the unsupervised loss value using a seventh formula, the seventh formula comprising:

6. A facial movement unit detection model training device, comprising:

the batch processing module is used for carrying out training processes for preset times according to the training sample set, and each training process comprises the step of carrying out preset processing on all batches of face images; the preset processing process of each batch comprises the following steps: extracting feature vectors of each face motion unit of each face image in the current batch; constructing a plurality of relation units according to the feature vectors of the face movement units in each face image, wherein each relation unit corresponds to one face movement unit pair; and learning, by each of the relationship units, a correlation between two face motion units in the corresponding pair of face motion units; discarding at least one relation unit according to a preset proportion; training a face motion unit detection model according to the correlation learned by each relation unit remained in each face image;

judging whether the current batch is the last batch in the training process;

in the method, in the process of the invention,predictive probability values for the j-th facial motion unit in each face image, +. >For the average feature vector updated for each facial motion unit in each face image, σ () is a sigmoid function.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 5 when the computer program is executed.