CN112801006A

CN112801006A - Training method of expression representation model, and facial expression representation method and device

Info

Publication number: CN112801006A
Application number: CN202110166517.8A
Authority: CN
Inventors: 张唯; 冀先朋; 丁彧; 李林橙; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-05-14
Anticipated expiration: 2041-02-05
Also published as: CN112801006B

Abstract

The application provides a training method of an expression representation model, a facial expression representation method and a facial expression representation device, relates to the technical field of image processing, and solves the technical problem that the existing facial expression representation is low in accuracy. The method comprises the following steps: determining a sample set, wherein each sample in the sample set comprises a sample image and a sample label; the method comprises the steps of training an expression characterization model to be trained by utilizing a sample set to obtain a trained expression characterization model, wherein the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises the trained full-face characterization sub-model and the trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model.

Description

Training method of expression representation model, and facial expression representation method and device

Technical Field

The application relates to the technical field of image processing, in particular to a training method of an expression representation model, and a facial expression representation method and device.

Background

Human beings have the nature of perception expression, but this nature is that the machine does not possess, can promote the understanding of machine to human emotion through accurate expression representation mode, this provides important technological basis for establishing a friendly, intelligent, harmonious man-machine interaction system, and can help promoting the development of a plurality of relevant downstream tasks through rationally utilizing multiple expression representation modes, including expression image retrieval, emotion recognition, facial Action Unit (AU) discernment, facial expression production etc..

Currently, methods for detecting the characterization of facial expressions include AU-based characterization methods and similarity-based characterization methods, among others. However, the accuracy of these current methods of characterizing facial expressions is low.

Disclosure of Invention

The application aims to provide a training method of an expression representation model, a facial expression representation method and a facial expression representation device, so as to solve the technical problem that the accuracy of the existing facial expression representation method is low.

In a first aspect, an embodiment of the present application provides a method for training an expression representation model, where the method includes:

determining a sample set; wherein each sample in the set of samples comprises a sample image and a sample label;

training an expression characterization model to be trained by using the sample set to obtain a trained expression characterization model; the expression characterization model to be trained comprises a full-face characterization submodel to be trained and a well-trained identity characterization submodel, the well-trained expression characterization model comprises a well-trained full-face characterization submodel and a well-trained identity characterization submodel, and the output of the well-trained expression characterization model is determined based on the difference between the output of the well-trained full-face characterization submodel and the output of the well-trained identity characterization submodel.

In one possible implementation, the trained identity representation submodel and the full face representation submodel to be trained are the same model.

In one possible implementation, the trained expression representation model further comprises a dimension reduction layer;

and the dimension reduction layer is used for carrying out dimension reduction processing on the difference value to obtain an expression characteristic result output by the trained expression characterization model.

In a possible implementation, the fully connected neural network in the dimensionality reduction layer is used for performing dimensionality reduction on the difference value, and standardizing the difference value after dimensionality reduction through a two-norm to obtain the expression feature result.

In one possible implementation, the sample set includes: a first set of samples of a plurality of sets, the first set of samples of each set comprising a first reference sample, a first positive sample, and a first negative sample;

the sample set further comprises: a second set of samples, the second set of samples of each set including the first reference sample, a second positive sample;

wherein the second positive sample is a positive sample closest to the first reference sample among the plurality of first positive samples in the plurality of groups.

In one possible implementation, the second positive sample is a positive sample that is closest to the first reference sample and is obtained by sequentially comparing two adjacent first positive samples among a plurality of first positive samples corresponding to the same first reference sample.

In one possible implementation, a fully connected layer is also provided after the dimensionality reduction layer;

the full connection layer is used for performing annotation prediction based on the expression characteristic result to obtain annotation prediction results corresponding to a plurality of annotators respectively;

and the labeling prediction result is used as a correction sample to correct the trained expression characterization model.

In one possible implementation, the fully connected layer is to: comparing Euclidean distances between a plurality of expression features corresponding to a plurality of sample labels labeled by each label aiming at the same sample image, and obtaining a labeling prediction result corresponding to each label based on the comparison result of the Euclidean distances; wherein the plurality of exemplar labels includes a positive exemplar label, a negative exemplar label, and a reference exemplar label.

In one possible implementation, the predictive annotation result is used to: correcting the trained expression characterization model by using a loss function corresponding to the prediction labeling result through a gradient descent method and continuously iterating until the prediction deviation is gradually converged to obtain a corrected expression characterization model;

the loss function is used for representing the difference between the prediction annotation result and the actual annotation result of the annotator.

In a second aspect, an embodiment of the present application provides a method for representing an expression of a face, the method including:

acquiring a facial image to be characterized;

processing the facial image through the trained full-face representation model to obtain a full-face feature vector;

processing the facial image through a preset identity representation model to obtain an identity feature vector;

and subtracting the identity feature vector from the whole-face feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector.

In a possible implementation, the step of obtaining an expression feature result based on the expression feature vector includes:

and performing dimensionality reduction on the expression feature vector to obtain an expression feature result.

In a third aspect, an embodiment of the present application provides a training apparatus for an expression representation model, where the apparatus includes:

a determining module for determining a sample set; wherein each sample in the set of samples comprises a sample image and a sample label;

the training module is used for training the expression characterization model to be trained by using the sample set to obtain a trained expression characterization model; the expression characterization model to be trained comprises a full-face characterization submodel to be trained and a well-trained identity characterization submodel, the well-trained expression characterization model comprises a well-trained full-face characterization submodel and a well-trained identity characterization submodel, and the output of the well-trained expression characterization model is determined based on the difference between the output of the well-trained full-face characterization submodel and the output of the well-trained identity characterization submodel.

In a fourth aspect, an embodiment of the present application provides an expression characterization apparatus for a face, the apparatus including:

the acquisition module is used for acquiring a facial image to be characterized;

the first processing module is used for processing the facial image through the trained full-face representation model to obtain a full-face feature vector;

the second processing module is used for processing the facial image through a preset identity representation model to obtain an identity feature vector;

and the subtraction module is used for subtracting the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector.

In a fifth aspect, this application provides a computer device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor executes the computer program to implement the method of the first aspect or the second aspect.

In a sixth aspect, embodiments of the present application further provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to perform the method of the first or second aspect.

The embodiment of the application brings the following beneficial effects:

the embodiment of the application provides a training method of an expression characterization model, a facial expression characterization method and a facial expression characterization device, firstly determining a sample set, wherein each sample comprises a sample image and a sample label, then training the expression characterization model to be trained by using the sample set to obtain a trained expression characterization model, wherein the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full-face characterization sub-model and a trained identity characterization sub-model, the output of the trained expression characterization model is determined based on the difference between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model, in the scheme, the difference between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model is used, the identity information can be removed from the integrally trained expression characterization model, namely only information related to the expression characteristics is kept, the identity characteristics and the expression characteristics are decoupled, and the influence of personal identity information on facial expressions is removed, so that the accuracy rate can be improved by detecting the facial expression characterization through the trained expression characterization model, and the technical problem that the accuracy rate of the existing facial expression characterization method is low is solved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a training method for an expression representation model according to an embodiment of the present disclosure;

fig. 2 is a model schematic diagram of a training method for an expression representation model according to an embodiment of the present disclosure;

fig. 3 is a data sample diagram of a training method for an expression representation model according to an embodiment of the present disclosure;

fig. 4 is an overall logic diagram of a training method for an expression representation model according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a method for representing facial expressions according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a training device for an expression representation model according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an expression representation apparatus for a face according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "comprising" and "having," and any variations thereof, as referred to in the embodiments of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Humans have the property of perceiving expressions, but this property is not possessed by machines because it is difficult for machines to simulate the perception process of human abstraction by well-defined features. Therefore, how to reasonably represent expressions and promote the development of natural and harmonious human-computer interaction functions and provide technical support for multiple application fields is a popular research direction.

With the development of deep learning, the emotion classification task is often completed by using the structure of a recurrent Neural Network (CNN), and features of the last layer or the last but one layer Network are acquired as the characteristics of expressions and emotions. However, some early works of such methods often use traditional manual features such as Local Binary Pattern (LBP) features, Scale-invariant feature transform (SIFT), and the like to represent expression features, which easily neglects the huge differences of a large number of expressions included in a certain category of emotions, resulting in that the granularity of the obtained expression representations is not fine enough, the distribution is discrete, and it is difficult to accurately represent some complex and subtle expressions.

Or, the expression characterization may be detected based on similarity comparison, which often enables a finer-grained characterization. However, this method requires a large amount of data for comparison, and requires high labeling requirements for the data.

In addition, the facial expression data can also be represented by fixed and discrete AUs, namely, the face is divided into a plurality of AUs according to the muscle movement and the anatomical characteristics of the face, each AU represents the muscle movement of local positions of the face, and different expressions can be represented by linear combinations of different AUs, for example, an expression with happy emotion can be generally represented by a combination of AU6 and AU 12. However, this representation method cannot define all facial expression data in a clear semantic meaning, and the degree of freedom of representation of the representation method is not sufficient. Furthermore, the detection accuracy of the AU is low due to the close coupling of the AU and the identity information of the individual.

Based on this, the embodiment of the application provides a training method of an expression representation model, and a facial expression representation method and device, and the technical problem that the accuracy of the existing facial expression representation method is low can be solved through the method.

Embodiments of the present application are further described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a training method for an expression representation model according to an embodiment of the present disclosure. As shown in fig. 1, the method includes:

in step S110, a sample set is determined.

Wherein each sample in the sample set comprises a sample image and a sample label. Note that each sample image includes facial images, and the sample label is used to indicate expression information of the facial images.

And step S120, training the expression characterization model to be trained by using the sample set to obtain the trained expression characterization model.

The expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a well-trained identity characterization sub-model, the well-trained expression characterization model comprises a well-trained full-face characterization sub-model and a well-trained identity characterization sub-model, and the output of the well-trained expression characterization model is determined based on the difference between the output of the well-trained full-face characterization sub-model and the output of the well-trained identity characterization sub-model.

For example, as shown in fig. 2, the expression representation Model to be trained includes a full-Face representation sub-Model (Face Model) to be trained and a trained Identity representation sub-Model (Identity Model), and the output of the trained expression representation Model is based on the output (V) of the trained full-Face representation sub-Model_face) And the output (V) of the trained identity token submodel_id) Difference between (512-dimensional difference characteristic V output by development Module layer)_exp) And determining, namely, adopting the amount of deviation of the full-face features from the identity features thereof as the real facial expression features of the sample image.

In the embodiment of the application, through the difference between the output of the trained full-face representation submodel and the output of the trained identity representation submodel, the identity information can be removed from the integrally trained expression representation model, namely, only the information related to the expression features is kept, the decoupling between the identity features and the expression features is realized, the influence of the personal identity information on the facial expression is removed, and therefore, the accuracy can be improved by using the trained expression representation model to detect the facial expression representation. Furthermore, the facial expression characteristics are decoupled from the identity characteristics, so that the obtained facial expression characteristics are not influenced by personalized growth phase, the performance of robust performance is improved, and the accuracy of facial expression representation is improved.

The above steps are described in detail below.

In some embodiments, the full-face representation submodel to be trained may be a trained identity representation submodel. As an example, the trained identity token sub-model and the full face token sub-model to be trained are the same model.

For example, as shown in fig. 2, the full-Face token sub-Model (Face Model) to be trained and the Identity token sub-Model (Identity Model) to be trained are the same Model, and each output feature vector dimension is 512, wherein the parameters of the Identity token sub-Model (Identity Model) to be trained are fixed, and the parameters of the full-Face token sub-Model (Face Model) to be trained are updatable.

The trained identity representation submodel and the full-face representation submodel to be trained are set to be the same model, so that the output of the trained full-face representation submodel contains identity and expression information, the output of the trained identity representation submodel only contains identity information, the difference value between the trained full-face representation submodel and the trained identity representation submodel is simpler expression information, and the training efficiency of the expression representation model is improved.

In some embodiments, the difference values may be dimension reduced using a dimension reduction layer. As an example, the trained expression representation model further comprises a dimension reduction layer; and the dimension reduction layer is used for carrying out dimension reduction processing on the difference value to obtain an expression characteristic result output by the trained expression characterization model.

For example, as shown in FIG. 2, the difference feature V can be represented by a dimension-reducing layer (High-order Module)_expMapping 512-dimensional high-dimensional space to 16-dimensional low-dimensional model, and obtaining expression feature result E output by the trained expression characterization model through K-order multinominal feature_exp。

The dimension reduction processing can be carried out on the high-dimensional difference value through the dimension reduction layer so as to realize a more compact coding space and further improve the robustness of the model.

In some embodiments, the dimensionality reduction layer may normalize the difference values after the dimensionality reduction. As an example, a fully connected neural network in the dimension reduction layer is used to perform dimension reduction on the difference, and normalize the dimension-reduced difference by a two-norm to obtain an expression feature result.

For example, the fully-connected neural network in the dimensionality reduction layer can help to fit the mapping of the nonlinear vector, and the difference after dimensionality reduction can be normalized by using the two-norm, so that a 16-dimensional feature vector, namely an expression feature result E, is finally obtained_exp. Through carrying out standardized processing on the difference after the dimension reduction processing, the data sample can be collected more conveniently and efficiently, and the expression representation processing efficiency is improved.

In some embodiments, the labeling of the sample images in the sample set may be done in a variety of ways. As an example, a sample set includes: a first group of samples of the plurality of groups, the first group of samples of each group including a first reference sample, a first positive sample, and a first negative sample; the sample set further includes: a second sample group, the second sample group of each group comprising a first reference sample, a second positive sample; and the second positive sample is the positive sample which is closest to the first reference sample in the plurality of first positive samples in the plurality of groups.

For example, as shown in fig. 3, the first sample group of each group includes a first reference sample, a first Positive sample, and a first Negative sample, that is, any three sample images from the first sample group may be taken as one triple group of data, one sample image that is most dissimilar in expression to the other two sample images is taken as a first Negative sample (Negative, denoted as N), and the other two sample images are taken as a first reference sample (Anchor, denoted as a) and a first Positive sample (Positive, denoted as P), respectively.

Furthermore, on the basis of the current triplet scheme, for each reference sample, a positive sample which is the most similar to the reference sample is selected by comparison from the positive samples in all triples corresponding to the reference sample, and then negative samples in the triples which are lost by comparison of the positive samples are combined together to form some extra acquired triplet data. These additional sets of triple data, plus the previous sets of triple samples, are combined to form a data set D consisting of a large number of triples, so that the supervisory information is as close to the critical point as possible.

In the embodiment of the application, the adopted training sample data is from a FECNet public data set, and the data set comprises 449329 effective ternary groups of data. In addition, each group of sample groups comprises a first reference sample which repeatedly appears in different triples, so that the refinement of data can be continuously enhanced, and the accuracy of the training model is improved.

In some embodiments, the positive sample closest to the first reference sample may be obtained by continuously comparing adjacent first positive samples. As an example, the second positive sample is a positive sample closest to the first reference sample obtained by sequentially comparing two adjacent first positive samples among a plurality of first positive samples corresponding to the same first reference sample.

For example, as shown in fig. 3 and fig. 4, based on the above labels, a tree-structured comparison method may be established for the same first reference sample, first, a triplet set having the same first reference sample is referred to as a group, then, the first positive samples in different triples combined arbitrarily two by two in one group are compared, and a positive sample more similar to the first reference sample is selected to enter the next layer for comparison, and then, two adjacent positive samples are still combined in the next layer to complete the same comparison until the positive sample most similar to the first reference sample is selected. P can be obtained by the comparison₁Ratio P₂Closer to A, while P is known₂Than N₂Closer to a, a new triplet (a, P) may be obtained₁,N₂) The triplet is extrapolated without comparison.

Through a tree structure comparison mode, more critical samples with higher fine granularity can be supplemented, and the supervision effectiveness is improved.

In some embodiments, the annotation prediction data of different annotators can be used to further optimize the expression characterization model for subjective differences of different annotators. As an example, a fully connected layer is also provided after the dimensionality reduction layer; the full connection layer is used for performing annotation prediction based on the expression characteristic result to obtain annotation prediction results corresponding to a plurality of annotators respectively; and the labeling prediction result is used as a correction sample to correct the trained expression characterization model.

For example, as shown in fig. 2, a full-connected Layer (Crowd Layer) may be constructed for each Annotator, and N annotation prediction results (annotation N) corresponding to multiple annotators respectively can be obtained, where an input of each full-connected Layer is from a common expression feature result E_expThe output is the prediction result of the label corresponding to each label maker, and meanwhile, during the test and prediction, the expression characteristic result E before the full connection layer can be used only_expAs an expressive feature code.

The full-connection layer can be used for performing labeling prediction on expression feature results corresponding to a plurality of labels respectively, so that the labeling prediction results corresponding to the plurality of labels respectively are obtained, the subjective difference of different labels can be avoided, and the accuracy of the training model is improved.

In some embodiments, in the process of predicting the annotation of the annotator, the annotation result corresponding to each annotator can be predicted based on the comparison result of the euclidean distance. As an example, a fully connected layer is used to: comparing Euclidean distances between a plurality of expression features corresponding to a plurality of sample labels labeled by each label aiming at the same sample image, and obtaining a labeling prediction result corresponding to each label based on the comparison result of the Euclidean distances; wherein the plurality of exemplar labels includes a positive exemplar label, a negative exemplar label, and a reference exemplar label.

It should be noted that, the plurality of expression features corresponding to the plurality of sample labels labeled on the same sample image by each label maker are respectively compared according to the euclidean distance, and then the label prediction result corresponding to each label maker can be obtained based on the comparison result of the euclidean distances.

By utilizing the scheme of independent prediction and comparative correction of a plurality of annotators, a plurality of annotation data and sample labels can be effectively utilized, and labor cost is saved.

In some embodiments, in the process of further optimizing the expression representation model, the trained expression representation model may be further modified and optimized by using a loss function corresponding to the prediction labeling result and through a gradient descent method. As an example, the predictive annotation result is used to: correcting the trained expression characterization model by using a loss function corresponding to the prediction labeling result through a gradient descent method, and continuously iterating until the prediction deviation is gradually converged, so that a corrected expression characterization model can be obtained; the loss function is used for representing the difference between the prediction annotation result and the actual annotation result of the annotator.

For example, as shown in fig. 4, during training, the annotator on each piece of data will output respective prediction annotation results, then a loss function can be calculated by using the plurality of prediction annotation results, and the trained expression representation model is further modified by a gradient descent method and iterated continuously until the prediction deviation gradually converges, so that the modified expression representation model can be finally obtained.

For example, in the embodiment, the triple Loss function is used as the training basis of the model, the SGD optimizer is used to realize gradient calculation and parameter update, and first, a triple data is obtained

Into the network, wherein a_k,p_k,n_kRepresenting k annotator marksThe noted Anchor, Positive and Negative samples can respectively obtain the corresponding expression feature codes of the corresponding annotators after passing through the expression characterization model

Wherein the content of the first and second substances,

satisfy the requirement of

Are close to each other and both are

A distant relationship.

The triple Loss function is calculated as follows, where m represents the interval:

and correcting the trained expression representation model by a gradient descent method and continuously iterating until the prediction deviation is gradually converged to obtain the corrected expression representation model, and realizing a more compact coding space by dimension reduction.

Fig. 5 is a flowchart illustrating a method for representing an expression of a face according to an embodiment of the present application. As shown in fig. 5, the method includes:

in step S510, a face image to be characterized is acquired.

In this step, a face image to be characterized may be obtained and input into the characterization model.

And step S520, processing the face image through the trained full-face representation model to obtain a full-face feature vector.

For example, a 512-dimensional full-face feature vector may be obtained by processing the face image with the trained full-face representation model.

Step S530, processing the face image through a preset identity representation model to obtain an identity feature vector.

For example, the facial image is processed by using the trained preset identity representation model, so that a 512-dimensional identity feature vector can be obtained.

And step S540, subtracting the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector.

In this embodiment, the expression feature vector without the identity information can be obtained by subtracting the full-face feature vector from the identity feature vector, that is, only information related to the expression feature is retained, the influence of the personal identity information on the facial expression is removed, decoupling between the identity feature and the expression feature is realized, and the accuracy of facial expression representation is improved.

The above steps are described in detail below.

In some embodiments, the expression feature vector may be subjected to a dimension reduction process. As an example, the step S540 may include the following steps:

and a), performing dimensionality reduction on the expression feature vector to obtain an expression feature result.

It should be noted that, the expressive feature vector can be mapped from 512 dimensions to 16 dimensions by using the dimension reduction layer, and the expressive feature result can be obtained at the same time. By carrying out dimension reduction processing on the expression feature vectors, the pressure of processing sample data is relieved, and the robustness of the model is further improved.

The facial expression representation method provided by the embodiment of the application has the same technical characteristics as the facial expression representation model training method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

Fig. 6 provides a schematic structural diagram of a training device for an expression representation model. As shown in fig. 6, the apparatus 600 for training an expression representation model includes:

a determining module 601, configured to determine a sample set; wherein each sample in the sample set comprises a sample image and a sample label;

the training module 602 is configured to train an expression representation model to be trained by using a sample set to obtain a trained expression representation model; the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a well-trained identity characterization sub-model, the well-trained expression characterization model comprises a well-trained full-face characterization sub-model and a well-trained identity characterization sub-model, and the output of the well-trained expression characterization model is determined based on the difference between the output of the well-trained full-face characterization sub-model and the output of the well-trained identity characterization sub-model.

In some embodiments, the trained identity token submodel is the same model as the full face token submodel to be trained.

In some embodiments, the trained expression representation model further includes a dimension reduction layer;

In some embodiments, the fully-connected neural network in the dimension reduction layer is used for performing dimension reduction processing on the difference values, and normalizing the dimension-reduced difference values through a two-norm to obtain an expression feature result.

In some embodiments, the sample set comprises: a first group of samples of the plurality of groups, the first group of samples of each group including a first reference sample, a first positive sample, and a first negative sample;

the sample set further includes: a second sample group, the second sample group of each group comprising a first reference sample, a second positive sample;

and the second positive sample is the positive sample which is closest to the first reference sample in the plurality of first positive samples in the plurality of groups.

In some embodiments, the second positive sample is a positive sample that is closest to the first reference sample and is obtained by sequentially comparing two adjacent first positive samples among a plurality of first positive samples corresponding to the same first reference sample.

In some embodiments, a fully connected layer is also provided after the dimension reduction layer;

In some embodiments, the fully connected layer is used to: comparing Euclidean distances between a plurality of expression features corresponding to a plurality of sample labels labeled by each label aiming at the same sample image, and obtaining a labeling prediction result corresponding to each label based on the comparison result of the Euclidean distances; wherein the plurality of exemplar labels includes a positive exemplar label, a negative exemplar label, and a reference exemplar label.

In some embodiments, the predictive annotation result is used to: correcting the trained expression characterization model by using a loss function corresponding to the prediction labeling result through a gradient descent method, and continuously iterating until the prediction deviation is gradually converged to obtain a corrected expression characterization model; the loss function is used for representing the difference between the prediction annotation result and the actual annotation result of the annotator.

The training device for the expression representation model provided by the embodiment of the application has the same technical characteristics as the training method for the expression representation model and the facial expression representation method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

Fig. 7 provides a schematic structural diagram of an expression representation apparatus for a face. As shown in fig. 7, the facial expression representation apparatus 700 includes:

an obtaining module 701, configured to obtain a facial image to be characterized;

a first processing module 702, configured to process the facial image through the trained full-face representation model to obtain a full-face feature vector;

the second processing module 703 is configured to process the facial image through a preset identity representation model to obtain an identity feature vector;

and the subtraction module 704 is configured to subtract the full-face feature vector and the identity feature vector to obtain an expression feature vector, and obtain an expression feature result based on the expression feature vector.

In some embodiments, the subtraction module 704 is specifically configured to:

and performing dimension reduction processing on the expression feature vector to obtain an expression feature result.

The facial expression representation device provided by the embodiment of the application has the same technical characteristics as the facial expression representation method, the facial expression representation model training method and the facial expression representation model training device provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

As shown in fig. 8, the computer device 800 includes a processor 802 and a memory 801, where a computer program operable on the processor is stored in the memory, and the processor executes the computer program to implement the steps of the method provided in the foregoing embodiments.

Referring to fig. 8, the computer apparatus further includes: a bus 803 and a communication interface 804, the processor 802, the communication interface 804, and the memory 801 being connected by the bus 803; the processor 802 is used to execute executable modules, such as computer programs, stored in the memory 801.

The Memory 801 may include a high-speed Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 804 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

The bus 803 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but that does not indicate only one bus or one type of bus.

The memory 801 is used for storing a program, and the processor 802 executes the program after receiving an execution instruction, and the method performed by the apparatus defined by the process disclosed in any of the foregoing embodiments of the present application may be applied to the processor 802, or implemented by the processor 802.

The processor 802 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 802. The Processor 802 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 802 reads the information in the memory 801 and completes the steps of the method in combination with the hardware thereof.

Embodiments of the present application also provide a computer-readable storage medium storing machine executable instructions, which, when invoked and executed by a processor, cause the processor to execute the steps of the above-described method for training an expression characterization model and the method for expression characterization of a face.

The training device of the expression representation model and the expression representation device of the face provided by the embodiment of the application can be specific hardware on the device or software or firmware installed on the device, and the like. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

For another example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method for training an expression characterization model and the method for representing facial expressions according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the scope of the embodiments of the present application. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training an expression characterization model, the method comprising:

2. The method of claim 1, wherein the trained identity token submodel and the full face token submodel to be trained are the same model.

3. The method of claim 1, wherein the trained expression characterization model further comprises a dimension reduction layer;

4. The method according to claim 3, wherein a fully connected neural network in the dimensionality reduction layer is used for performing dimensionality reduction on the difference value, and the difference value after the dimensionality reduction is normalized by a two-norm to obtain the expression feature result.

5. The method of claim 1, wherein the sample set comprises: a first set of samples of a plurality of sets, the first set of samples of each set comprising a first reference sample, a first positive sample, and a first negative sample;

6. The method according to claim 5, wherein the second positive sample is a positive sample closest to the first reference sample obtained by sequentially comparing two adjacent first positive samples among a plurality of first positive samples corresponding to the same first reference sample.

7. The method according to claim 3, characterized in that a fully connected layer is also provided after the dimensionality reduction layer;

8. The method of claim 7, wherein the fully connected layer is used to: comparing Euclidean distances between a plurality of expression features corresponding to a plurality of sample labels labeled by each label aiming at the same sample image, and obtaining a labeling prediction result corresponding to each label based on the comparison result of the Euclidean distances; wherein the plurality of exemplar labels includes a positive exemplar label, a negative exemplar label, and a reference exemplar label.

9. The method of claim 7, wherein the prediction annotation result is used to: correcting the trained expression characterization model by using a loss function corresponding to the prediction labeling result through a gradient descent method and continuously iterating until the prediction deviation is gradually converged to obtain a corrected expression characterization model;

10. A method for characterizing an expression of a face, the method comprising:

acquiring a facial image to be characterized;

11. The method of claim 10, wherein the step of deriving an expression feature result based on the expression feature vector comprises:

12. An apparatus for training an expression characterization model, the apparatus comprising:

13. An expression characterization apparatus for a face, the apparatus comprising:

14. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 11 when executing the computer program.

15. A computer readable storage medium having stored thereon computer executable instructions which, when invoked and executed by a processor, cause the processor to execute the method of any of claims 1 to 11.