CN112801006B - Training method of expression representation model, and facial expression representation method and device - Google Patents

Training method of expression representation model, and facial expression representation method and device Download PDF

Info

Publication number
CN112801006B
CN112801006B CN202110166517.8A CN202110166517A CN112801006B CN 112801006 B CN112801006 B CN 112801006B CN 202110166517 A CN202110166517 A CN 202110166517A CN 112801006 B CN112801006 B CN 112801006B
Authority
CN
China
Prior art keywords
sample
model
expression
trained
characterization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110166517.8A
Other languages
Chinese (zh)
Other versions
CN112801006A (en
Inventor
张唯
冀先朋
丁彧
李林橙
范长杰
胡志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202110166517.8A priority Critical patent/CN112801006B/en
Publication of CN112801006A publication Critical patent/CN112801006A/en
Application granted granted Critical
Publication of CN112801006B publication Critical patent/CN112801006B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a training method of an expression representation model, a facial expression representation method and a facial expression representation device, relates to the technical field of image processing, and solves the technical problem of low accuracy of existing facial expression representation. The method comprises the following steps: determining a sample set, wherein each sample in the sample set comprises a sample image and a sample label; training an expression characterization model to be trained by using a sample set to obtain a trained expression characterization model, wherein the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full-face characterization sub-model and a trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference value between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model.

Description

Training method of expression representation model, and facial expression representation method and device
Technical Field
The application relates to the technical field of image processing, in particular to a training method of an expression representation model, and an expression representation method and device of a face.
Background
The human beings have the nature of perceived expressions, but the nature is not possessed by a machine, the understanding of the machine to human emotion can be promoted through an accurate expression representation mode, an important technical basis is provided for building a friendly, intelligent and harmonious human-computer interaction system, and the development of a plurality of related downstream tasks can be assisted by reasonably utilizing a plurality of expression representation modes, including expression image retrieval, emotion recognition, facial Action Unit (AU) recognition, facial expression generation and the like.
Currently, the method for detecting the facial expression representation comprises an AU-based representation method, a similarity-based representation method and the like. However, these current methods of characterizing facial expressions have low accuracy.
Disclosure of Invention
The application aims to provide a training method of an expression representation model, a facial expression representation method and a facial expression representation device, so as to solve the technical problem that the accuracy of the existing facial expression representation method is low.
In a first aspect, an embodiment of the present application provides a training method for an expression characterization model, where the method includes:
determining a sample set; wherein each sample in the set of samples comprises a sample image and a sample label;
Training the expression characterization model to be trained by using the sample set to obtain a trained expression characterization model; the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full-face characterization sub-model and the trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference value between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model.
In one possible implementation, the trained identity representation sub-model and the full-face representation sub-model to be trained are the same model.
In one possible implementation, the trained expression characterization model further includes a dimension reduction layer;
the dimension reduction layer is used for carrying out dimension reduction processing on the difference value to obtain an expression characteristic result output by the trained expression characterization model.
In one possible implementation, the fully connected neural network in the dimension reduction layer is used for performing dimension reduction on the difference value, and normalizing the difference value after the dimension reduction by using a two-norm to obtain the expression characteristic result.
In one possible implementation, the sample set includes: a plurality of first sample groups, the first sample group of each group comprising a first reference sample, a first positive sample, and a first negative sample;
the sample set further comprises: a second sample group, the second sample group of each group comprising the first reference sample, a second positive sample;
the second positive samples are the positive samples closest to the first reference samples in the first positive samples in a plurality of groups.
In one possible implementation, the second positive samples are positive samples which are closest to the first reference samples and obtained after sequentially comparing two adjacent first positive samples in the plurality of first positive samples corresponding to the same first reference sample.
In one possible implementation, a fully-connected layer is further provided after the dimension reduction layer;
the full-connection layer is used for carrying out annotation prediction based on the expression characteristic result to obtain annotation prediction results corresponding to a plurality of annotators respectively;
and the labeling prediction result is used as a correction sample to correct the trained expression characterization model.
In one possible implementation, the fully connected layer is configured to: comparing Euclidean distances among a plurality of expression features corresponding to a plurality of sample labels marked by each marker for the same sample image, and obtaining marking prediction results corresponding to each marker respectively based on the comparison results of the Euclidean distances; wherein the plurality of sample tags includes a positive sample tag, a negative sample tag, and a reference sample tag.
In one possible implementation, the prediction labeling result is used to: correcting the trained expression representation model by using a loss function corresponding to the prediction labeling result through a gradient descent method, and iterating until the prediction deviation gradually converges to obtain a corrected expression representation model;
the loss function is used for representing a gap between the predicted labeling result and an actual labeling result of the labeling person.
In a second aspect, an embodiment of the present application provides a method for representing an expression of a face, the method including:
acquiring a face image to be characterized;
processing the facial image through the trained full-face representation model to obtain a full-face feature vector;
processing the facial image through a preset identity characterization model to obtain an identity feature vector;
and subtracting the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector.
In one possible implementation, the step of obtaining an expression feature result based on the expression feature vector includes:
and performing dimension reduction processing on the expression feature vector to obtain an expression feature result.
In a third aspect, an embodiment of the present application provides a training device for an expression characterization model, where the device includes:
a determining module for determining a sample set; wherein each sample in the set of samples comprises a sample image and a sample label;
the training module is used for training the expression characterization model to be trained by using the sample set to obtain a trained expression characterization model; the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full-face characterization sub-model and the trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference value between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model.
In a fourth aspect, an embodiment of the present application provides an expression characterization apparatus for a face, the apparatus including:
the acquisition module is used for acquiring the face image to be characterized;
the first processing module is used for processing the facial image through the trained full-face representation model to obtain a full-face feature vector;
The second processing module is used for processing the facial image through a preset identity characterization model to obtain an identity feature vector;
and the subtraction module is used for subtracting the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector.
In a fifth aspect, an embodiment of the present application further provides a computer device, including a memory, and a processor, where the memory stores a computer program that can be executed by the processor, and the processor executes the method according to the first aspect or the second aspect.
In a sixth aspect, embodiments of the present application further provide a computer readable storage medium storing computer executable instructions that, when invoked and executed by a processor, cause the processor to perform the method of the first or second aspect described above.
The embodiment of the application has the following beneficial effects:
according to the training method and the facial expression characterization device for the expression characterization model, a sample set is firstly determined, each sample comprises a sample image and a sample label, then the sample set is utilized to train the expression characterization model to be trained to obtain the trained expression characterization model, the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises the trained full-face characterization sub-model and the trained identity characterization sub-model, the output of the trained expression characterization model is determined based on the difference between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model, in the scheme, the identity information can be removed from the whole trained expression characterization model, namely only the information related to the expression characteristics is reserved, decoupling between the identity characteristics and the identity information is achieved, the influence of personal identity information on the facial characterization is removed, and therefore the facial expression characterization accuracy rate can be improved by utilizing the facial expression characterization model to be detected well.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a training method of an expression characterization model according to an embodiment of the present application;
fig. 2 is a schematic diagram of a training method of an expression characterization model according to an embodiment of the present application;
fig. 3 is a schematic diagram of a data sample of a training method of an expression characterization model according to an embodiment of the present application;
fig. 4 is an overall logic schematic diagram of a training method of an expression characterization model according to an embodiment of the present application;
fig. 5 is a flowchart of a method for representing an expression of a face according to an embodiment of the present application;
Fig. 6 is a schematic structural diagram of a training device for expression characterization model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an expression characterization apparatus for a face according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "comprising" and "having" and any variations thereof, as used in the embodiments of the present application, are intended to cover non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed but may optionally include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Humans have the nature of perceiving expressions, but this nature is not possessed by machines because machines have difficulty in simulating human abstract perceiving processes by well-defined features. Therefore, how to reasonably represent expressions and promote the development of natural and harmonious man-machine interaction functions and provide technical support for a plurality of application fields are popular research directions.
With the development of deep learning, the structure of recurrent neural networks (Convolutional Neural Network, CNN) is often used to accomplish the emotion classification task and to obtain the characteristics of the last or penultimate layer of network as expression and emotion characterization. However, some early works of such methods often borrowed traditional manual features such as local binary pattern ((Local Binary Pattern, LBP) features, scale-invariant feature transform (SIFT), etc. to represent expression features, and large differences in a large number of expressions contained in a certain class of emotions are easily ignored, resulting in insufficient granularity, relatively discrete distribution of the obtained expression representations, and difficulty in accurately representing certain complex and subtle expressions.
Alternatively, expression tokens may be detected based on a similarity comparison, which often enables relatively fine-grained tokens. However, this method needs to use a large amount of data for comparison, and the requirement for labeling the data is high.
In addition, facial expression data can be represented by using fixed and discrete AUs, that is, a face is divided into a plurality of AUs according to muscle movement and anatomical characteristics of the face, each AU represents muscle movement of a local position of the face, and different expressions can be represented by using linear combinations of different AUs, for example, expressions with happy emotion can be generally represented by using combinations of AU6 and AU 12. However, this representation cannot define all the facial expression data in a definite semantic meaning, resulting in insufficient degrees of freedom in representation. Moreover, due to the close coupling of the AU with the personal identity information, the accuracy of detection of the AU is low.
Based on the above, the embodiment of the application provides a training method of an expression representation model, a facial expression representation method and a facial expression representation device, and the technical problem that the accuracy of the existing facial expression representation method is low can be solved by the method.
Embodiments of the present application are further described below with reference to the accompanying drawings.
Fig. 1 is a flow chart of a training method of an expression characterization model according to an embodiment of the present application. As shown in fig. 1, the method includes:
step S110, a sample set is determined.
Wherein each sample in the sample set includes a sample image and a sample label. It should be noted that, each sample image includes facial images, and the sample label is used to represent expression information of the facial images.
And step S120, training the expression characterization model to be trained by using the sample set to obtain a trained expression characterization model.
The expression characterization model to be trained comprises a full face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full face characterization sub-model and a trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference value between the output of the trained full face characterization sub-model and the output of the trained identity characterization sub-model.
For example, as shown in FIG. 2, the expression characterization Model to be trained includes a full-Face characterization sub-Model (Face Model) to be trained and a trained Identity characterization sub-Model (Identity Model), the output of the trained expression characterization Model being based on the output (V face ) And the output of the trained identity token submodel (V id ) Difference between (512-dimensional difference feature V of the device Module layer output) exp ) The amount of the full-face feature deviating from the identity feature is determined, namely taken as the real facial expression feature of the sample image.
According to the embodiment of the application, through the difference value between the output of the trained full-face representation sub-model and the output of the trained identity representation sub-model, the identity information can be removed from the overall trained expression representation model, namely, only the information related to the expression characteristics is reserved, the decoupling between the identity characteristics and the expression characteristics is realized, and the influence of personal identity information on the facial expression is removed, so that the accuracy can be improved by utilizing the trained expression representation model to detect the facial expression representation. Moreover, the facial expression characteristics are not influenced by personalized long phases by decoupling the expression characteristics and the identity characteristics, so that the performance of robustness is improved, and the accuracy of facial expression representation is improved.
The above steps are described in detail below.
In some embodiments, the full-face representation sub-model to be trained may be a trained identity representation sub-model. As an example, the trained identity representation sub-model is the same model as the full-face representation sub-model to be trained.
For example, as shown in fig. 2, the full-Face representation sub-Model (Face Model) to be trained and the Identity representation sub-Model (Identity Model) to be trained are the same Model, each of which has an output feature vector dimension of 512, wherein the parameters of the Identity representation sub-Model (Identity Model) to be trained are fixed, and the parameters of the full-Face representation sub-Model (Face Model) to be trained are updatable.
The training identity representation sub-model and the full face representation sub-model to be trained are set to be the same model, so that the output of the training full face representation sub-model contains identity and expression information, the output of the training identity representation sub-model only contains identity information, and further, the difference value between the training full face representation sub-model and the training identity representation sub-model is more pure expression information, and the training efficiency of the expression representation model is improved.
In some embodiments, the difference may be dimension-reduced using a dimension-reduction layer. As an example, the trained expression characterization model further includes a dimension reduction layer; the dimension reduction layer is used for carrying out dimension reduction processing on the difference value to obtain an expression characteristic result output by the trained expression characterization model.
For example, as shown in FIG. 2, the difference feature V may be obtained by using a dimension reduction layer (High-order model) exp The expression characteristic result E output by the trained expression characterization model can be obtained through K-order Polynomial feature from the high-dimensional space mapping of 512 dimensions to the low-dimensional model of 16 dimensions exp
The dimension reduction layer can be used for carrying out dimension reduction treatment on the high-dimension difference value so as to realize a tighter coding space and further improve the robustness of the model.
In some embodiments, the dimension reduction layer may normalize the difference value after the dimension reduction process. As an example, the fully connected neural network in the dimension reduction layer is used for dimension reduction processing of the difference value, and the difference value after the dimension reduction processing is normalized through the two norms, so as to obtain the expression characteristic result.
For example, the fully connected neural network in the dimension reduction layer can help fit the mapping of the nonlinear vector, and the difference value after the dimension reduction processing can be normalized by using the two norms, so as to finally obtain the 16-dimension feature vector, namely the expression feature result E exp . Through carrying out standardized processing to the difference after the dimension reduction processing, can collect the data sample more conveniently and efficiently, improve expression characterization processing's efficiency.
In some embodiments, labeling of sample images in a sample set may take a variety of forms. As one example, the sample set includes: a plurality of first sample groups, the first sample group of each group comprising a first reference sample, a first positive sample, and a first negative sample; the sample set further comprises: a second sample group, the second sample group of each group comprising a first reference sample, a second positive sample; the second positive samples are the positive samples closest to the first reference samples in the first positive samples in the plurality of groups.
For example, as shown in fig. 3, the first sample group of each group includes a first reference sample, a first Positive sample and a first Negative sample, that is, three sample images can be taken from the first sample group as a piece of triplet data, one of the sample images that is the least similar to the other two sample images in expression is taken as a first Negative sample (N), and the other two sample images are taken as a first reference sample (Anchor, a) and a first Positive sample (P), respectively.
Furthermore, on the basis of the current triplet scheme, for each reference sample, from positive samples in all triples corresponding to the reference sample, a positive sample most similar to the reference sample is picked by comparison, and negative samples in triples which are contrasted and lost by the positive samples are added together to be used as some additionally obtained triplet data. These additional triples of data, plus the previous sets of triples of samples, are combined together to form a data set D of a large number of triples, so that the supervision information is as close to the critical point as possible.
In the embodiment of the application, the training sample data is obtained from a FECNet public data set, and the data set comprises 449329 valid triples. In addition, each sample group comprises a first reference sample repeatedly appearing in different triples, so that the refinement of data can be continuously enhanced, and the accuracy of a training model is improved.
In some embodiments, the positive sample that is closest to the first reference sample may be obtained by continuously comparing adjacent first positive samples. As an example, the second positive sample is a positive sample closest to the first reference sample obtained by sequentially comparing two adjacent first positive samples among the plurality of first positive samples corresponding to the same first reference sample.
For example, as shown in fig. 3 and fig. 4, based on the above labeling, a tree structure comparison manner may be established for the same first reference sample, first, a triplet set with the same first reference sample is referred to as a group, then, first positive samples in different triples formed by combining any two groups are compared in a group, one positive sample closer to the first reference sample is selected to enter the next layer for comparison, and then, the next layer still combines two adjacent positive samples to complete the same comparison until the positive sample closest to the first reference sample is selected. P can be obtained by the above comparison 1 Ratio P 2 Closer to A, due to the known P 2 Ratio N 2 Closer to A, a new triplet (A, P 1 ,N 2 ) The triples can be deduced without comparison.
The tree structure comparison mode can be beneficial to supplementing more critical samples with higher granularity, and the supervision effectiveness is improved.
In some embodiments, the annotation prediction data of different annotators may be utilized to further optimize the representation model for subjective differences of different annotators. As an example, a fully connected layer is also provided after the dimension reduction layer; the full-connection layer is used for carrying out annotation prediction based on the expression characteristic results to obtain annotation prediction results corresponding to a plurality of annotators respectively; the labeling prediction result is used as a correction sample to correct the trained expression characterization model.
For example, as shown in fig. 2, a full-link Layer (crowed Layer) may be built for each marker to obtain N marker prediction results (Annostor N) corresponding to multiple markers, where the input of each full-link Layer is from a common expression feature result E exp Outputting label prediction results corresponding to each label person respectively, and simultaneously, only using expression characteristic results E before the full connection layer in the test and prediction exp As an expression feature code.
The expression characteristic results corresponding to the plurality of annotators can be annotated and predicted through the full-connection layer, so that the annotation prediction results corresponding to the plurality of annotators can be obtained, subjective differences of different annotators can be avoided, and the accuracy of the training model is improved.
In some embodiments, in the process of predicting the labeling of the labeling person, the labeling result corresponding to each labeling person can be predicted based on the comparison result of the euclidean distance. As an example, the full connectivity layer is to: comparing Euclidean distances among a plurality of expression features corresponding to a plurality of sample labels marked by each marker for the same sample image, and obtaining marking prediction results respectively corresponding to each marker based on the comparison results of the Euclidean distances; wherein the plurality of sample tags includes a positive sample tag, a negative sample tag, and a reference sample tag.
It should be noted that, according to the comparison of the expression features corresponding to the sample labels of the same sample image by each annotator according to the euclidean distance, the annotation prediction result corresponding to each annotator can be obtained based on the comparison result of the euclidean distance.
By utilizing the scheme of independent prediction, comparison and correction of a plurality of annotators, a plurality of annotation data and sample labels can be effectively utilized, and labor cost is saved.
In some embodiments, in the process of further optimizing the expression characterization model, a loss function corresponding to the prediction labeling result can be utilized, and the trained expression characterization model can be further corrected and optimized through a gradient descent method. As one example, the prediction annotation result is used to: correcting the trained expression representation model by using a loss function corresponding to the prediction labeling result through a gradient descent method and iterating until the prediction deviation gradually converges, so that a corrected expression representation model can be obtained; the loss function is used for representing the difference between the predicted labeling result and the actual labeling result of the labeling person.
For example, as shown in fig. 4, during training, the annotators on each piece of data will output respective prediction annotation results, then a loss function can be calculated by using a plurality of prediction annotation results, further the trained expression characterization model is corrected by a gradient descent method and iterated continuously until the prediction deviation gradually converges, and finally the corrected expression characterization model can be obtained.
In this embodiment, a triple Loss function is used as a training basis of a model, and an SGD optimizer is used to implement gradient calculation and parameter update, and a triple data is first usedInto a network, wherein a k ,p k ,n k Representing Anchor, positive, negative samples marked by k markers, and then respectively obtaining expression feature codes corresponding to the corresponding markers through an expression representation model>Wherein (1)>Satisfy->The distance between them is relatively close and both are equal to +.>Remote relationship.
The triple Loss function is calculated as follows, where m represents the interval:
the trained expression representation model is corrected by a gradient descent method and iterated continuously until the prediction deviation gradually converges, the corrected expression representation model is obtained, and a more compact coding space can be realized by dimension reduction.
Fig. 5 is a flowchart of a facial expression representation method according to an embodiment of the present application. As shown in fig. 5, the method includes:
step S510, a face image to be characterized is acquired.
In this step, a face image to be characterized may be acquired and input into the characterization model.
And step S520, processing the facial image through the trained full-face representation model to obtain a full-face feature vector.
Illustratively, the face image is processed by using the trained full-face representation model, so that 512-dimensional full-face feature vectors can be obtained.
Step S530, processing the facial image through a preset identity characterization model to obtain an identity feature vector.
Illustratively, the face image is processed by using the trained preset identity characterization model, so that a 512-dimensional identity feature vector can be obtained.
Step S540, subtracting the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector.
In this embodiment, the facial feature vector of the identity removal information can be obtained by subtracting the full-face feature vector from the identity feature vector, that is, only the information related to the expression feature is reserved, the influence of the personal identity information on the facial expression is removed, decoupling between the identity feature and the expression feature is realized, and the accuracy of the facial expression representation is improved.
The above steps are described in detail below.
In some embodiments, the surface feature vectors may be dimensionality reduced. As an example, the step S540 may include the steps of:
and a step a), performing dimension reduction processing on the expression feature vector to obtain an expression feature result.
It should be noted that, the dimension reduction layer can be utilized to map the expression feature vector from 512 dimensions to 16 dimensions, and meanwhile, the expression feature result can be obtained. The pressure for processing the sample data is relieved by performing dimension reduction processing on the surface feature vector, and the robustness of the model is further improved.
The facial expression representation method provided by the embodiment of the application has the same technical characteristics as the training method of the expression representation model provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
Fig. 6 provides a schematic structural diagram of a training device for expression characterization models. As shown in fig. 6, the training apparatus 600 of the expression characterization model includes:
a determining module 601, configured to determine a sample set; wherein each sample in the sample set comprises a sample image and a sample label;
the training module 602 is configured to train the expression characterization model to be trained by using the sample set, so as to obtain a trained expression characterization model; the expression characterization model to be trained comprises a full face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full face characterization sub-model and a trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference value between the output of the trained full face characterization sub-model and the output of the trained identity characterization sub-model.
In some embodiments, the trained identity representation sub-model is the same model as the full-face representation sub-model to be trained.
In some embodiments, the trained expression characterization model further includes a dimension reduction layer;
the dimension reduction layer is used for carrying out dimension reduction processing on the difference value to obtain an expression characteristic result output by the trained expression characterization model.
In some embodiments, the fully connected neural network in the dimension reduction layer is used for dimension reduction processing of the difference value, and the dimension-reduced difference value is normalized through the two norms to obtain the expression characteristic result.
In some embodiments, the sample set comprises: a plurality of first sample groups, the first sample group of each group comprising a first reference sample, a first positive sample, and a first negative sample;
the sample set further comprises: a second sample group, the second sample group of each group comprising a first reference sample, a second positive sample;
the second positive samples are the positive samples closest to the first reference samples in the first positive samples in the plurality of groups.
In some embodiments, the second positive samples are positive samples which are closest to the first reference samples and are obtained after sequentially comparing two adjacent first positive samples in the plurality of first positive samples corresponding to the same first reference sample.
In some embodiments, a fully-connected layer is also provided after the dimension-reduction layer;
the full-connection layer is used for carrying out annotation prediction based on the expression characteristic results to obtain annotation prediction results corresponding to a plurality of annotators respectively;
the labeling prediction result is used as a correction sample to correct the trained expression characterization model.
In some embodiments, the fully connected layer is to: comparing Euclidean distances among a plurality of expression features corresponding to a plurality of sample labels marked by each marker for the same sample image, and obtaining marking prediction results respectively corresponding to each marker based on the comparison results of the Euclidean distances; wherein the plurality of sample tags includes a positive sample tag, a negative sample tag, and a reference sample tag.
In some embodiments, the prediction labeling result is used to: correcting the trained expression characterization model by using a loss function corresponding to the prediction labeling result through a gradient descent method and iterating until the prediction deviation gradually converges to obtain a corrected expression characterization model; the loss function is used for representing the difference between the predicted labeling result and the actual labeling result of the labeling person.
The training device of the expression characterization model provided by the embodiment of the application has the same technical characteristics as the training method of the expression characterization model and the facial expression characterization method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
Fig. 7 provides a schematic structural view of an expression characterization apparatus for a face. As shown in fig. 7, the expression characterization apparatus 700 of a face includes:
an acquisition module 701, configured to acquire a face image to be characterized;
a first processing module 702, configured to process the facial image through the trained full-face representation model to obtain a full-face feature vector;
a second processing module 703, configured to process the facial image through a preset identity characterization model to obtain an identity feature vector;
and the subtracting module 704 is configured to subtract the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtain an expression feature result based on the expression feature vector.
In some embodiments, the subtraction module 704 is specifically configured to:
and performing dimension reduction processing on the expression feature vector to obtain an expression feature result.
The facial expression representation device provided by the embodiment of the application has the same technical characteristics as the facial expression representation method, the training method of the expression representation model and the training device of the expression representation model provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
The embodiment of the present application further provides a computer device, as shown in fig. 8, where the computer device 800 includes a processor 802 and a memory 801, where the memory stores a computer program that can be run on the processor, and the processor implements the steps of the method provided in the above embodiment when executing the computer program.
Referring to fig. 8, the computer apparatus further includes: a bus 803 and a communication interface 804, the processor 802, the communication interface 804, and the memory 801 being connected by the bus 803; the processor 802 is configured to execute executable modules, such as computer programs, stored in the memory 801.
The memory 801 may include a high-speed random access memory (Random Access Memory, simply referred to as RAM), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. Communication connection between the system network element and at least one other network element is achieved through at least one communication interface 804 (which may be wired or wireless), and the internet, wide area network, local network, metropolitan area network, etc. may be used.
Bus 803 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 8, but not only one bus or type of bus.
The memory 801 is configured to store a program, and the processor 802 executes the program after receiving an execution instruction, and a method executed by the apparatus for defining a process according to any of the foregoing embodiments of the present application may be applied to the processor 802, or implemented by the processor 802.
The processor 802 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 802. The processor 802 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a digital signal processor (Digital Signal Processing, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 801 and the processor 802 reads the information in the memory 801 and in combination with its hardware performs the steps of the above method.
The embodiment of the application also provides a computer readable storage medium, which stores machine executable instructions that, when invoked and executed by a processor, cause the processor to execute the steps of the training method of the expression characterization model and the expression characterization method of the face.
The training device of the expression representation model and the expression representation device of the face provided by the embodiment of the application can be specific hardware on equipment or software or firmware installed on the equipment. The device provided by the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brevity, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned. It will be clear to those skilled in the art that, for convenience and brevity, the specific operation of the system, apparatus and unit described above may refer to the corresponding process in the above method embodiment, which is not described in detail herein.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
As another example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments provided in the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or partly in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the training method of the expression characterization model and the expression characterization method of the face according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be noted that: like reference numerals and letters in the following figures denote like items, and thus once an item is defined in one figure, no further definition or explanation of it is required in the following figures, and furthermore, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit of the corresponding technical solutions. Are intended to be encompassed within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (13)

1. A method for training an expression characterization model, the method comprising:
determining a sample set; wherein each sample in the set of samples comprises a sample image and a sample label;
training the expression characterization model to be trained by using the sample set to obtain a trained expression characterization model; the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full-face characterization sub-model and the trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference value between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model;
the sample set includes: a plurality of first sample groups, the first sample group of each group comprising a first reference sample, a first positive sample, and a first negative sample;
the sample set further comprises: a second sample group, the second sample group of each group comprising the first reference sample, a second positive sample;
wherein the second positive samples are positive samples closest to the first reference sample among the plurality of first positive samples in a plurality of groups;
And the second positive samples are positive samples which are obtained by sequentially comparing two adjacent first positive samples in a plurality of first positive samples corresponding to the same first reference sample and are closest to the first reference sample.
2. The method of claim 1, wherein the trained identity representation sub-model and the full-face representation sub-model to be trained are the same model.
3. The method of claim 1, wherein the trained expression characterization model further comprises a dimension reduction layer;
the dimension reduction layer is used for carrying out dimension reduction processing on the difference value to obtain an expression characteristic result output by the trained expression characterization model.
4. The method of claim 3, wherein the fully connected neural network in the dimension reduction layer is configured to perform dimension reduction on the difference, and normalize the dimension-reduced difference by a two-norm, to obtain the expression feature result.
5. A method according to claim 3, characterized in that a fully connected layer is also provided after the dimension reduction layer;
the full-connection layer is used for carrying out annotation prediction based on the expression characteristic result to obtain annotation prediction results corresponding to a plurality of annotators respectively;
And the labeling prediction result is used as a correction sample to correct the trained expression characterization model.
6. The method of claim 5, wherein the fully-connected layer is configured to: comparing Euclidean distances among a plurality of expression features corresponding to a plurality of sample labels marked by each marker for the same sample image, and obtaining marking prediction results corresponding to each marker respectively based on the comparison results of the Euclidean distances; wherein the plurality of sample tags includes a positive sample tag, a negative sample tag, and a reference sample tag.
7. The method of claim 5, wherein the predictive annotation result is used to: correcting the trained expression representation model by using a loss function corresponding to the prediction labeling result through a gradient descent method, and iterating until the prediction deviation gradually converges to obtain a corrected expression representation model;
the loss function is used for representing a gap between the predicted labeling result and an actual labeling result of the labeling person.
8. A method of expression characterization of a face, the method comprising:
Acquiring a face image to be characterized;
processing the facial image through the trained full-face representation model to obtain a full-face feature vector;
processing the facial image through a preset identity characterization model to obtain an identity feature vector;
subtracting the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector;
the sample set for training the full-face representation model comprises: a plurality of first sample groups, the first sample group of each group comprising a first reference sample, a first positive sample, and a first negative sample; the sample set further comprises: a second sample group, the second sample group of each group comprising the first reference sample, a second positive sample; wherein the second positive samples are positive samples closest to the first reference sample among the plurality of first positive samples in a plurality of groups; and the second positive samples are positive samples which are obtained by sequentially comparing two adjacent first positive samples in a plurality of first positive samples corresponding to the same first reference sample and are closest to the first reference sample.
9. The method of claim 8, wherein the step of deriving an expression signature result based on the expression signature vector comprises:
and performing dimension reduction processing on the expression feature vector to obtain an expression feature result.
10. A training device for expression characterization models, the device comprising:
a determining module for determining a sample set; wherein each sample in the set of samples comprises a sample image and a sample label;
the training module is used for training the expression characterization model to be trained by using the sample set to obtain a trained expression characterization model; the expression characterization model to be trained comprises a full-face characterization sub-model to be trained and a trained identity characterization sub-model, the trained expression characterization model comprises a trained full-face characterization sub-model and the trained identity characterization sub-model, and the output of the trained expression characterization model is determined based on the difference value between the output of the trained full-face characterization sub-model and the output of the trained identity characterization sub-model;
the sample set includes: a plurality of first sample groups, the first sample group of each group comprising a first reference sample, a first positive sample, and a first negative sample;
The sample set further comprises: a second sample group, the second sample group of each group comprising the first reference sample, a second positive sample;
wherein the second positive samples are positive samples closest to the first reference sample among the plurality of first positive samples in a plurality of groups;
and the second positive samples are positive samples which are obtained by sequentially comparing two adjacent first positive samples in a plurality of first positive samples corresponding to the same first reference sample and are closest to the first reference sample.
11. An expression characterization apparatus for a face, the apparatus comprising:
the acquisition module is used for acquiring the face image to be characterized;
the first processing module is used for processing the facial image through the trained full-face representation model to obtain a full-face feature vector;
the second processing module is used for processing the facial image through a preset identity characterization model to obtain an identity feature vector;
the subtracting module is used for subtracting the full-face feature vector from the identity feature vector to obtain an expression feature vector, and obtaining an expression feature result based on the expression feature vector;
The sample set for training the full-face representation model comprises: a plurality of first sample groups, the first sample group of each group comprising a first reference sample, a first positive sample, and a first negative sample; the sample set further comprises: a second sample group, the second sample group of each group comprising the first reference sample, a second positive sample; wherein the second positive samples are positive samples closest to the first reference sample among the plurality of first positive samples in a plurality of groups; and the second positive samples are positive samples which are obtained by sequentially comparing two adjacent first positive samples in a plurality of first positive samples corresponding to the same first reference sample and are closest to the first reference sample.
12. A computer device comprising a memory, a processor, the memory having stored therein a computer program executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the method of any of the preceding claims 1 to 9.
13. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to perform the method of any one of claims 1 to 9.
CN202110166517.8A 2021-02-05 2021-02-05 Training method of expression representation model, and facial expression representation method and device Active CN112801006B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110166517.8A CN112801006B (en) 2021-02-05 2021-02-05 Training method of expression representation model, and facial expression representation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110166517.8A CN112801006B (en) 2021-02-05 2021-02-05 Training method of expression representation model, and facial expression representation method and device

Publications (2)

Publication Number Publication Date
CN112801006A CN112801006A (en) 2021-05-14
CN112801006B true CN112801006B (en) 2023-09-05

Family

ID=75814585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110166517.8A Active CN112801006B (en) 2021-02-05 2021-02-05 Training method of expression representation model, and facial expression representation method and device

Country Status (1)

Country Link
CN (1) CN112801006B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113269145B (en) * 2021-06-22 2023-07-25 中国平安人寿保险股份有限公司 Training method, device, equipment and storage medium of expression recognition model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685106A (en) * 2018-11-19 2019-04-26 深圳博为教育科技有限公司 A kind of image-recognizing method, face Work attendance method, device and system
CN110765873A (en) * 2019-09-19 2020-02-07 华中师范大学 Facial expression recognition method and device based on expression intensity label distribution
CN111259745A (en) * 2020-01-09 2020-06-09 西安交通大学 3D face decoupling representation learning method based on distribution independence
CN112052789A (en) * 2020-09-03 2020-12-08 腾讯科技(深圳)有限公司 Face recognition method and device, electronic equipment and storage medium
CN112200236A (en) * 2020-09-30 2021-01-08 网易(杭州)网络有限公司 Training method of face parameter recognition model and face parameter recognition method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685106A (en) * 2018-11-19 2019-04-26 深圳博为教育科技有限公司 A kind of image-recognizing method, face Work attendance method, device and system
CN110765873A (en) * 2019-09-19 2020-02-07 华中师范大学 Facial expression recognition method and device based on expression intensity label distribution
CN111259745A (en) * 2020-01-09 2020-06-09 西安交通大学 3D face decoupling representation learning method based on distribution independence
CN112052789A (en) * 2020-09-03 2020-12-08 腾讯科技(深圳)有限公司 Face recognition method and device, electronic equipment and storage medium
CN112200236A (en) * 2020-09-30 2021-01-08 网易(杭州)网络有限公司 Training method of face parameter recognition model and face parameter recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Disentangled Representation Learning for 3D Face Shape;ZiHang Jiang,et al.;2019 CVRP;第11949-11958页 *

Also Published As

Publication number Publication date
CN112801006A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
US11170257B2 (en) Image captioning with weakly-supervised attention penalty
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
WO2019075967A1 (en) Enterprise name recognition method, electronic device, and computer-readable storage medium
CN110851641A (en) Cross-modal retrieval method and device and readable storage medium
CN116015837A (en) Intrusion detection method and system for computer network information security
CN113657098B (en) Text error correction method, device, equipment and storage medium
CN113076720B (en) Long text segmentation method and device, storage medium and electronic device
CN112801006B (en) Training method of expression representation model, and facial expression representation method and device
CN116127019B (en) Dynamic parameter and visual model generation WEB 2D automatic modeling engine system
CN111967383A (en) Age estimation method, and training method and device of age estimation model
CN111046786A (en) Generation method and device of age estimation neural network and electronic equipment
CN116933137A (en) Electroencephalogram cross-tested emotion recognition method, device, equipment and medium
CN109657710B (en) Data screening method and device, server and storage medium
CN113688243B (en) Method, device, equipment and storage medium for labeling entities in sentences
CN115599392A (en) Code processing method, device, medium and electronic equipment
CN114067362A (en) Sign language recognition method, device, equipment and medium based on neural network model
CN114120074A (en) Training method and training device of image recognition model based on semantic enhancement
CN114155387A (en) Similarity Logo discovery method by utilizing Logo mark graphic and text information
US20230368571A1 (en) Training Method of Facial Expression Embedding Model, Facial Expression Embedding Method and Facial Expression Embedding Device
CN109657623B (en) Face image similarity calculation method and device, computer device and computer readable storage medium
CN109558582B (en) Visual angle-based sentence emotion analysis method and device
CN112287723A (en) In-vivo detection method and device based on deep learning and storage medium
CN111860662B (en) Training method and device, application method and device of similarity detection model
CN111325068A (en) Video description method and device based on convolutional neural network
CN116049446B (en) Event extraction method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant