CN114998960A

CN114998960A - Expression recognition method based on positive and negative sample comparison learning

Info

Publication number: CN114998960A
Application number: CN202210595007.7A
Authority: CN
Inventors: 文贵华; 诸俊浩
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-05-28
Filing date: 2022-05-28
Publication date: 2022-09-02
Anticipated expiration: 2042-05-28
Also published as: CN114998960B

Abstract

The invention discloses an expression recognition method based on positive and negative sample comparison learning, S1, collecting facial images; s2, inputting the facial image into a trained machine learning model to identify the expression in the facial image; s3, outputting the expression type of the facial image; in the process of machine learning model, a negative sample is introduced in the structural similarity contrast learning method, and the distance between the positive sample and the negative sample in a batch of samples is increased. Meanwhile, in a batch of samples, the imbalance between the quantities of the positive samples and the negative samples is considered, the negative samples are subjected to difficult sample discovery, so that the model increases and restricts the structural similarity of the positive samples, reduces and restricts the negative samples most similar to the positive samples, further exerts the effect of contrast learning, and improves the accuracy of expression recognition.

Description

Expression recognition method based on positive and negative sample comparison learning

Technical Field

The invention relates to the technical field of expression recognition, in particular to an expression recognition method based on positive and negative sample comparison learning.

Background

The facial expression recognition technology is used for recognizing the emotion of a person in an image through a computer, and has important practical application values including aspects of intelligent medical treatment, safe driving detection, online education, psychological coaching, entertainment interaction and the like. However, in view of the expression recognition accuracy in an actual scene, the accuracy of facial expression recognition at present has not reached the level that human beings can reach.

The expression recognition task in natural scenes is more challenging than that in controlled experimental scenes because the captured images in natural conditions contain more differences in environmental factors, such as lighting, resolution and more extensive pose changes. Secondly, because the acquisition conditions are not uniform, the expression recognition task in the natural scene shows a larger problem of large difference in similarity between classes, for example, the expression is also a 'surprised' expression, and the appearance shows a large difference because of the difference of the shooting angle, the illumination condition and the expression mode of the character. Also, images from different classes have similar appearances, showing large inter-class similarity. Another example is the category "happy", which is the same category for smiling and laughing due to different acquisition conditions, but their images have large differences in appearance, while for the two categories "sick" and "fear", the differences in appearance are small. Finally, the expression data set collected in the natural scene usually needs a labeling person to manually label, this action needs to consume a large amount of manpower, and more importantly, different people may make different judgments for the same image due to the influence of human subjectivity, so that the label of the expression has a large ambiguity.

The key of the human face expression recognition task in a natural scene is to extract features with expression significances, and a plurality of methods for learning and optimizing local features and a method based on metric learning have good effects. Although many methods have been previously proposed to address the problems of inter-class similarity, large intra-class variance and label ambiguity, they are all trained in a traditional supervised manner. In a natural scene, the problem of label ambiguity of the expression recognition task is not negligible, the original label is simply utilized, wrong information can be introduced, the ambiguous label can mislead the network model to learn the expression characteristics, and therefore supervised learning cannot completely dig out fine expression distinguishing characteristics.

Therefore, how to provide an expression recognition method based on positive and negative sample comparison learning, which can effectively improve the expression recognition accuracy rate, is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides an expression recognition method based on positive and negative sample comparison learning, and aims to provide a more accurate expression recognition method.

In order to achieve the purpose, the invention adopts the following technical scheme:

an expression recognition method based on positive and negative sample comparison learning comprises the following steps:

s1, collecting a face image;

s2, inputting the facial image into a trained machine learning model to identify the expression in the facial image;

s3, outputting the expression type of the facial image;

the training method of the machine learning model in the S2 specifically includes:

s21, generating a plurality of new samples for each sample;

for a batch of samples

After strong data enhancement

After weak data enhancement

Wherein x is _i As an image of a human face, y _i Is x _i A corresponding emoticon label; obtaining a current weak enhancement sample x ″ _i And its strong enhancement sample x ″ _i As a positive sample pair; will be associated with the current sample x ″ _i The enhanced samples of all other samples in different classes are taken as negative sample pairs, and the negative sample pair set of the ith sample is expressed as

S22, inputting a deep neural network, and extracting the characteristics of each sample;

for weak enhancement sample x ″ _i Sum strong enhancement sample x ″ _i After feature extraction, respectively obtaining feature representations: u ═ U _i |i＝1,2,……HW},V＝{v _j 1,2, … … HW }, where u _i ,v _j As feature vector representations of the feed point and the destination point, respectively;

s23, calculating a loss function of the deep neural network;

defining the negative sample similarity set of the ith sample as

The aggregate size is K _i ；

The total loss function is:

L＝L _cls +βL _hard

wherein β is a hyperparameter representing an equilibrium coefficient; l is _cls Is a softmax classification loss function, L _hard Adaptive weighting function for negative samples:

wherein N is the number of samples in the batch;

for the structure similarity transformation probability, gamma is the scaling factor, s _i,j Representing the structural similarity between the ith sample and the jth enhanced sample; the sum of the probabilities for all samples is 1;

and S24, optimizing parameters of the deep neural network according to the loss function.

Preferably, in S23, the structural similarity S between the ith feature representation and the jth enhanced feature _i,j The specific calculation method comprises the following steps:

s _i,j ＝-s(U _i ,V _j )

where H and W are the height and width of convolution signatures U and V, respectively, c _ij Representing the transmission cost, U, between the source supply-side node i to the destination node j _i Convolution feature map, V, representing the ith sample _j A convolution signature representing the jth sample,

the optimal transportation scheme is represented, and the optimal transportation scheme is obtained by calculating the following formula:

subject to f _ij ≥0,i＝1,…,m,j＝1,…,k

wherein, the first and the second end of the pipe are connected with each other,

the local weights representing the source node and the destination node are calculated by the following formula:

wherein G is _avg Representing a global average pooling operation.

Preferably, c _ij The calculation method is as follows:

wherein u is _i ，v _j Are the feature vector representations of the feed point and the destination point.

Preferably, in S23, the relative magnitude p of the structural similarity between the ith sample and the kth enhanced sample _i,k The specific calculation method comprises the following steps:

when p is _i,k The larger the value, the greater the learning weight given to the corresponding partial sample.

Compared with the prior art, the invention discloses and provides an expression recognition method based on positive and negative sample comparison learning, and the method has the following beneficial effects:

1. according to the method, the EMD distance is acted on the feature map, the model can be guided to pay attention to the region related to the expression, the regions which are irrelevant to expression such as a noisy background are eliminated, and therefore the expression region is paid attention to effectively.

2. The method adopts the self-supervision idea to design the structural similarity constraint loss, and utilizes the contrast learning of positive and negative samples of the enhanced image to optimize the positive and negative samples and the classification loss under the condition of not depending on an original label, so as to learn the more generalized expression characteristics.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of an expression recognition method based on positive and negative sample comparison learning according to the present invention;

FIG. 2 is a schematic diagram of a training method of a machine learning model in an expression recognition method based on positive and negative sample comparison learning according to the present invention;

fig. 3 is a schematic diagram of adaptive weighting of negative samples in the expression recognition method based on positive and negative sample comparison learning according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses an expression recognition method based on positive and negative sample comparison learning, which comprises the following steps:

s1, collecting a face image;

s3, outputting the expression type of the facial image;

the training method of the machine learning model in the S2 specifically comprises the following steps:

s21, generating a plurality of new samples for each sample;

for a batch of samples

After strong data enhancement

After weak data enhancement

Wherein x is _i As an image of a human face, y _i Is x _i A corresponding emoticon label; obtaining a current weak enhancement sample x ″ _i And its strongly enhanced sample x _i As a positive sample pair; will be associated with the current sample x ″ _i The enhanced samples of all other samples in different classes are taken as negative sample pairs, and the negative sample pair set of the ith sample is expressed as

Wherein images of two different viewing angles are obtained by weak enhancement and strong enhancement. The weak enhancement means that the image is zoomed in and flipped to a small degree, and the strong enhancement means that the data enhancement mode also comprises rotation and color transformation.

for weak enhancement sample x ″) _i And a strongly enhanced sample x _i After feature extraction, respectively obtaining feature representations: u ═ U _i |i＝1,2,……HW},V＝{v _j 1,2, … … HW }, where u _i ,v _j As feature vector representations of the feed point and the destination point, respectively;

s23, calculating a loss function of the deep neural network;

defining the negative sample similarity set of the ith sample as

The aggregate size is K _i ；

The total loss function is:

L＝L _cls +βL _hard

wherein β is a hyperparameter representing an equilibrium coefficient; l is a radical of an alcohol _cls Is a softmax classification loss function, L _hard Adaptive weighting function for negative samples:

wherein N is the number of samples in the batch;

the probability is converted for the structural similarity, gamma is a scaling coefficient, and gamma influences the balance degree between the sample classification probabilities. The larger the value of gamma, the larger the difference in classification probability, and the smaller gamma, the closer the classification probability among samples. When the gamma value region is at zero, the probability distribution degenerates to a uniform distribution; s _i,j Representing the structural similarity between the ith sample and the jth enhanced sample; the sum of the probabilities for all samples is 1; wherein s is _i,j Representing the structural similarity between the ith sample and the jth enhanced sample.

In order to further implement the above technical solution, in S23, the structural similarity S between the ith feature representation and the jth enhanced feature _i,j The specific calculation method comprises the following steps:

s _i,j ＝-s(U _i ,V _j )

wherein H and W are each a convolutionHeight and width of feature maps U and V, c _ij Representing the transmission cost, U, between the source supply-side node i to the destination node j _i Convolution feature map, V, representing the ith sample _j A convolution signature representing the jth sample,

subject to f _ij ≥0,i＝1,…,m,j＝1,…,k

wherein G is _avg Representing a global average pooling operation.

To further implement the above solution, c _ij The calculation method is as follows:

To further implement the above technical solution, in S23, the relative size p of the structural similarity between the ith sample and the kth enhanced sample _i,k The specific calculation method comprises the following steps:

p _i,k The relative magnitude of the structural similarity between the ith sample and the kth enhanced sample is shown, reflecting the degree of similarity of the two view angle features. p is a radical of _i,k The larger the value, the more similar the sample pair, the more difficult the enhancement sample is to distinguish, and the model should give greater learning weight to this portion of the sample.

It should be noted that:

for sample x _i The positive sample is defined as the enhancement sample x ″ _i . The negative sample should maintain a larger structureSimilarity differences exist, samples originally labeled as the same class may exist in a batch, the samples may have higher structural similarity, and more error information is introduced when the enhanced samples of the same class are regarded as negative samples. For correctness, the invention considers the negative sample as the enhanced sample of other classes, and the set is defined as

I.e. enhancement samples of all other samples of a different class than the current sample. This guarantees to some extent the correctness of the negative examples.

The loss function is calculated as follows, taking into account the comparative learning of negative examples, and adding adaptive weights to the difficult negative examples.

The difficulty of the sample is a measure of the problem. And regarding each sample as a separate category, and converting each negative sample into a category probability through a softamx function according to the structural similarity score.

In this embodiment, β is a balance coefficient, which needs to be adjusted as a hyper-parameter, and β takes a value of 0.6 in this embodiment.

The deep neural network model of the present embodiment is run on an Nvid ia T itan 3090GPU server that mounts a Pytorch (v1.7) deep learning framework. This embodiment scales all images to 256 x 256 size for a ResNet-18 based backbone network. In the training stage, the input image is cut to 224 × 224 size at random position, and in the testing stage, the center position is cut to 224 × 224 size for testing. There is weak enhancement and strong enhancement in the data enhancement stage: when the image is weakly enhanced, the image is turned over with a probability of 0.5; when the enhancement is strong, color random dithering is performed with a probability of 0.2 in addition to the weak enhancement.

The video memory consumption in the structural similarity calculation process is considered to be too large, and particularly in the step of calculating the structural similarity between every two video memories in the batch, the video memory occupation ratio is large, so that the batch size is set to be 32. The model is optimized by using a random gradient descent method with momentum, the learning rate is 0.01, the momentum is 0.9, and the weight decay rate is 0.0001. The EMD distance calculation module needs to calculate the corresponding optimal transportation solution, which is calculated by the OpenCV library function.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An expression recognition method based on positive and negative sample comparison learning is characterized by comprising the following steps:

s1, collecting a face image;

s3, outputting the expression type of the facial image;

s21, generating a plurality of new samples for each sample;

for a batch of samples

After strong data enhancement

After weak data enhancement

Wherein x is _i As an image of a human face, y _i Is x _i A corresponding emoticon label; obtaining a current weak enhancement sample x ″ _i And its strong enhancement sample x ″ _i As a positive sample pair; will be associated with the current sample x ″ _i The enhanced samples of all other samples in different classes are taken as negative sample pairs, and the set of negative sample pairs of the ith sample is expressed as

for weak enhancement sample x ″) _i And a strongly enhanced sample x ″ _i After feature extraction, respectively obtaining feature representations: u ═ U _i |i＝1,2,……HW},V＝{v _j 1,2, … … HW }, where u _i ,v _j As feature vector representations of a supply point and a destination point, respectively;

s23, calculating a loss function of the deep neural network;

defining the negative sample similarity set of the ith sample as

The aggregate size is K _i ；

The total loss function is:

L＝L _cls +βL _hard

wherein N is the number of samples in the batch;

2. The expression recognition method based on positive-negative sample contrast learning of claim 1, wherein in S23, the structural similarity S between the ith feature representation and the jth enhanced feature _i,j The specific calculation method comprises the following steps:

s _i,j ＝-s(U _i ,V _j )

where H and W are the height and width of convolution signatures U and V, respectively, c _ij Representing the transmission cost, U, between the source supply-side node i to the destination node j _i Convolution feature map, V, representing the ith sample _j A convolution signature graph representing the jth sample,

subject to f _ij ≥0,i＝1,…,m,j＝1,…,k

wherein the content of the first and second substances,

wherein G _avg Representing a global average pooling operation.

3. The expression recognition method based on positive and negative sample comparison learning as claimed in claim 2, wherein c is _ij The calculation method is as follows:

wherein u _i ，v _j Are the feature vector representations of the feed point and the destination point.

4. The expression recognition method based on positive-negative sample contrast learning of claim 2, wherein in S23, the relative size p of the structural similarity between the ith sample and the kth enhanced sample _i,k The specific calculation method comprises the following steps: