CN114998960B

CN114998960B - Expression recognition method based on positive and negative sample contrast learning

Info

Publication number: CN114998960B
Application number: CN202210595007.7A
Authority: CN
Inventors: 文贵华; 诸俊浩
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-05-28
Filing date: 2022-05-28
Publication date: 2024-03-26
Anticipated expiration: 2042-05-28
Also published as: CN114998960A

Abstract

The invention discloses an expression recognition method based on positive and negative sample contrast learning, which comprises the following steps of S1, collecting facial images; s2, inputting the facial image into a trained machine learning model to identify the expression in the facial image; s3, outputting expression categories of the facial images; in the process of machine learning a model, a negative sample is introduced in the structure similarity comparison learning method, and the distance between a positive sample and a negative sample in a batch of samples is considered to be zoomed out. Meanwhile, in consideration of unbalance between data volumes of positive samples and negative samples in a batch of samples, the negative samples are subjected to difficult sample mining, so that the model reduces constraint on the negative samples which are most similar to the positive samples while increasing constraint on structural similarity of the positive samples, and further plays a comparison learning effect, and the expression recognition accuracy is improved.

Description

Expression recognition method based on positive and negative sample contrast learning

Technical Field

The invention relates to the technical field of expression recognition, in particular to an expression recognition method based on positive and negative sample contrast learning.

Background

The facial expression recognition technology is used for recognizing the emotion of a person in an image through a computer, and has important practical application values, including aspects of intelligent medical treatment, safe driving detection, online education, psychological coaching, entertainment interaction and the like. But in view of the expression recognition accuracy in the actual scene, the accuracy of facial expression recognition does not reach the level of human being at present.

Expression recognition in natural scenes is more challenging than the expression recognition task of controlled experimental scenes because the acquired images in natural conditions contain more environmental factor differences such as illumination, resolution and larger amplitude of pose changes. Secondly, because the acquisition conditions are not uniform, the expression recognition task in the natural scene presents a larger problem of large difference in similar classes, such as 'surprise' expression, and the external appearance presents a large difference because of different shooting angles, illumination conditions and character expression modes. Also, images from different classes have similar appearances, exhibiting large inter-class similarities. Another example is the "happy" category, which is the same for smiles and laughter due to different acquisition conditions, but has a large difference in image appearance, and a small difference in appearance for images of the "wounded" and "fear" categories. Finally, the expression data set collected by the natural scene usually needs labeling personnel to be labeled manually, which consumes a great deal of manpower, and more importantly, different people can make different judgments on the same image due to subjective influence of the people, so that the label of the expression has large ambiguity.

The key of the facial expression recognition task in the natural scene is to extract the characteristics with expression meaning, and many methods for learning and optimizing the local characteristics and methods based on measurement learning achieve good effects. Although many methods have been previously proposed to solve the problems of large intra-class variance and tag ambiguity for inter-class similarity, they are all trained in a traditional supervised manner. Under a natural scene, the problem that the label is ambiguous in the expression recognition task is not negligible, and the original label is simply utilized to possibly introduce wrong information, so that the ambiguous label can mislead the learning of the expression characteristics by the network model, and the supervised learning can not fully mine the fine expression discrimination characteristics.

Therefore, how to propose an expression recognition method based on positive and negative sample contrast learning, which can effectively improve the expression recognition accuracy, is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides an expression recognition method based on positive and negative sample contrast learning, and aims to provide a more accurate expression recognition method.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an expression recognition method based on positive and negative sample contrast learning comprises the following steps:

s1, collecting a face image;

s2, inputting the facial image into a trained machine learning model to identify the expression in the facial image;

s3, outputting expression categories of the facial images;

the training method of the machine learning model in the S2 specifically comprises the following steps:

s21, generating a plurality of new samples for each sample;

for a batch of samplesObtaining ∈10 after performing strong data enhancement>Obtaining ∈10 after weak data enhancement>Wherein x is _i Is a face image, y _i Is x _i Corresponding expression labels; obtaining a current weak enhancement sample x _i ' and strong enhancement sample x _i Considered as a positive sample pair; will be identical to the current sample x _i Enhancement samples of all other samples of the' different classes are taken as negative sample pairs, the set of negative sample pairs of the ith sample is denoted +.>

S22, inputting a deep neural network, and extracting the characteristics of each sample;

for weak enhancement sample x' _i And strong enhancement sample x _i After feature extraction, feature representations are obtained respectively: u= { U _i |i＝1,2,……HW},V＝{v _j I j=1, 2, … … HW }, where u _i ,v _j A feature vector representation as a supply point and a destination point, respectively;

s23, calculating a loss function of the deep neural network;

defining the negative sample similarity set of the ith sample asThe aggregate size is K _i ；

The total loss function is:

L＝L _cls +βL _hard

wherein beta is a hyper-parameter representing a balance coefficient; l (L) _cls Classifying loss functions for softmax, L _hard Adaptive weighting function for negative samples:

wherein N is the number of samples in the batch;converting probability for structural similarity, gamma is scaling coefficient, s _i,j Representing the structural similarity between the ith sample and the jth enhanced sample; the sum of probabilities of all samples is 1;

s24, optimizing parameters of the deep neural network according to the loss function.

Preferably, in S23, the structural similarity S between the ith feature representation and the jth enhanced feature _i,j The specific calculation method of (a) is as follows:

s _i,j ＝-s(U _i ,V _j )

wherein H and W are the height and width of convolution feature maps U and V, respectively, c _ij Representing the transmission cost between source supply side node i to destination node j, U _i A convolution characteristic diagram representing the ith sample, V _j A convolution signature representing the jth sample,representing an optimal transportation scheme, wherein the optimal transportation scheme obtains an optimal solution by calculating the following formulaThe method comprises the following steps:

subject to f _ij ≥0,i＝1,…,m,j＝1,…,k

wherein,the local weights of the source node and the destination node are represented, and the calculation mode is solved by the following formula:

wherein G is _avg Representing a global average pooling operation.

Preferably, c _ij The calculation mode of (a) is as follows:

wherein u is _i ，v _j Is a feature vector representation of the feed point and the destination point.

Preferably, in S23, the relative magnitude p of the structural similarity between the ith sample and the kth enhanced sample _i,k The specific calculation method of (a) is as follows:

when p is _i,k The larger the value, the greater the learning weight given to the corresponding portion of the sample.

Compared with the prior art, the invention discloses an expression recognition method based on positive and negative sample contrast learning, which has the following beneficial effects:

1. according to the invention, the EMD distance is acted on the feature map, so that the model can be guided to pay attention to the region related to the expression, and the region such as noisy background and the like which is irrelevant to the expression is eliminated, thereby effectively paying attention to the expression region.

2. The invention designs the constraint loss of the structural similarity by adopting the self-supervision thought, and utilizes the contrast learning of the positive and negative samples of the enhanced image to optimize the model and the classification loss together under the condition of not depending on the original label so as to learn the expression characteristics with more generalization.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an expression recognition method based on positive and negative sample contrast learning provided by the invention;

FIG. 2 is a schematic diagram of a training method of a machine learning model in an expression recognition method based on positive and negative sample contrast learning;

fig. 3 is a schematic diagram of self-adaptive weighting of negative samples in an expression recognition method based on positive and negative sample contrast learning.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses an expression recognition method based on positive and negative sample contrast learning, which is shown in fig. 1-3 and comprises the following steps:

s1, collecting a face image;

s3, outputting expression categories of the facial images;

the training method of the machine learning model in S2 specifically comprises the following steps:

s21, generating a plurality of new samples for each sample;

for a batch of samplesObtaining ∈10 after performing strong data enhancement>Obtaining ∈10 after weak data enhancement>Wherein x is _i Is a face image, y _i Is x _i Corresponding expression labels; obtaining a current weak enhancement sample x' _i And its strong enhancement sample x _i Considered as a positive sample pair; will be identical to the current sample x' _i Enhancement samples of all other samples of different classes are taken as negative sample pairs, the set of negative sample pairs of the ith sample is denoted +.>

Wherein images of two different viewing angles are obtained by weak enhancement and strong enhancement. Weak enhancement refers to the fact that the image is scaled and flipped to a small extent, and the strong enhancement data enhancement mode also comprises rotation and color transformation.

s23, calculating a loss function of the deep neural network;

The total loss function is:

L＝L _cls +βL _hard

wherein N is the number of samples in the batch;for the structural similarity transition probabilities, γ is a scaling factor, and γ affects the degree of equalization between sample classification probabilities. The larger the value of γ, the larger the difference in classification probabilities, and the smaller γ, the closer the classification probabilities between samples. When the gamma value region is zero, the probability distribution is degraded into uniform distribution; s is(s) _i,j Representing the structural similarity between the ith sample and the jth enhanced sample; the sum of probabilities of all samples is 1; wherein s is _i,j Representing the structural similarity between the ith sample and the jth enhanced sample.

In order to further implement the above technical solution, in S23, the i-th feature represents the structural similarity S between the j-th enhanced feature _i,j The specific calculation method of (a) is as follows:

s _i,j ＝-s(U _i ,V _j )

wherein H and W are the height and width of convolution feature maps U and V, respectively, c _ij Representing the transmission cost between source supply side node i to destination node j, U _i A convolution characteristic diagram representing the ith sample, V _j A convolution signature representing the jth sample,the optimal transportation scheme is represented, and the optimal solution method is obtained by calculating the following formula:

subject to f _ij ≥0,i＝1,…,m,j＝1,…,k

wherein G is _avg Representing a global average pooling operation.

To further implementThe technical proposal, c _ij The calculation mode of (a) is as follows:

In order to further implement the above technical solution, in S23, the relative size p of the structural similarity between the ith sample and the kth enhanced sample _i,k The specific calculation method of (a) is as follows:

p _i,k The relative magnitudes of the structural similarities between the ith sample and the kth enhanced sample are shown, reflecting the degree of similarity of the two view angle features. P is p _i,k The larger the value, the more similar the pair of samples, the more difficult the enhanced sample is to distinguish, and the model should give more learning weight to that portion of the sample.

It should be noted that:

for sample x _i The positive samples are defined as enhanced samples x _i . The negative samples should maintain a larger structural similarity difference, and there may be samples originally labeled as the same class in one batch, and this part of samples may have a higher structural similarity, and regarding the enhanced samples of the same class as the negative samples may introduce more error information. For correctness, the invention regards negative samples as enhancement samples of other classes, the set being defined asI.e. enhanced samples of all other samples of a different class than the current sample. This ensures the correctness of the negative samples to some extent.

The loss function is calculated as follows, taking into account the contrast learning of the negative samples and adding adaptive weights to the difficult negative samples.

Difficulty in measuring samples. Each sample is considered as a separate class, and each negative sample is converted to a class probability by a softamx function according to the structural similarity score.

In this embodiment, β is a balance coefficient, and as a super parameter, adjustment is required, and β in this embodiment takes a value of 0.6.

The deep neural network model of this embodiment runs on an nvidia T itan 3090GPU server that mounts a Pytorch (v 1.7) deep learning framework. The present embodiment is a ResNet-18 based backbone network with all images scaled to 256 by 256 sizes. In the training phase, the input image is cut into 224×224 sizes through random positions, and in the testing phase, the input image is cut into 224×224 sizes through center positions for testing. There are weak and strong enhancements in the data enhancement phase: at weak enhancement, the image is flipped with a probability of 0.5; when the enhancement is strong, the color random dithering is performed with a probability of 0.2 in addition to the weak enhancement.

Since the display memory consumption in the structural similarity calculation process is considered to be too large, particularly in the step of calculating the structural similarity between every two in a batch, the display memory occupation ratio is large, and the batch size is set to be 32. The model was optimized using a random gradient descent method with momentum, a learning rate of 0.01, a momentum of 0.9, and a weight decay rate of 0.0001. The EMD distance calculation module needs to calculate the corresponding optimal transportation scheme, which is calculated by OpenCV library functions.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The expression recognition method based on positive and negative sample contrast learning is characterized by comprising the following steps of:

s1, collecting a face image;

s3, outputting expression categories of the facial images;

s21, generating a plurality of new samples for each sample;

s23, calculating a loss function of the deep neural network;

The total loss function is:

L＝L _cls +βL _hard

2. The expression recognition method based on positive and negative sample contrast learning as claimed in claim 1, wherein in S23, structural similarity S between the ith feature representation and the jth enhanced feature _i,j The specific calculation method of (a) is as follows:

s _i,j ＝-s(U _i ,V _j )

subject to f _ij ≥0,i＝1,…,m,j＝1,…,k

wherein G is _avg Representing a global average pooling operation.

3. The expression recognition method based on positive and negative sample contrast learning as claimed in claim 2, wherein c _ij The calculation mode of (a) is as follows:

4. The method for recognizing expression based on positive and negative sample contrast learning according to claim 2, wherein in S23, the relative magnitude p of the structural similarity between the ith sample and the kth enhanced sample _i,k The specific calculation method of (a) is as follows: