CN117115505A

CN117115505A - Emotion enhancement continuous training method combining knowledge distillation and contrast learning

Info

Publication number: CN117115505A
Application number: CN202310712211.7A
Authority: CN
Inventors: 毋立芳; 邢乐豪; 石戈; 邓斯诺; 李雪芬
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-11-24

Abstract

A emotion enhancement continuous training method combining knowledge distillation and contrast learning belongs to the field of computer vision. Firstly, screening an existing text emotion classification model and an emotion dictionary to obtain a large number of graph-text pairs with obvious emotion tendencies to form a large-scale emotion graph-text pair data set; training a teacher network by using a large-scale emotion image-text to obtain a teacher network with strong generalization, using an emotion natural language supervision mode, and simultaneously, mining multi-granularity emotion information in supervision signals and integrating the multi-granularity emotion information into picture characterization so as to enhance emotion expression capacity of a visual encoder; the obtained teacher network vision module is used for initializing a chemical generation network, providing pseudo tag data for a student network, and carrying out training on a student model by a design task to further mine detail emotion information in pictures; and applying the student network to a downstream image emotion classification task. The method solves the problems of low prediction accuracy and poor applicability of the emotion analysis model.

Description

Emotion enhancement continuous training method combining knowledge distillation and contrast learning

Technical Field

The invention belongs to the field of computer vision, and particularly relates to deep learning, emotion analysis and other technologies.

Background

With the development of the Internet, social approaches are increased, people tend to share own experiences by publishing various modal information such as text images on friend circles, microblogs, twitter, and a face book social platform, and express own experiences. Through the data containing emotion information, people can express happiness or release dissatisfaction. The emotional trends are extracted from the massive pictures, so that the personal preference and emotional state of people can be known, and the method plays an important role in many application scenes in real life, such as social public opinion monitoring, personalized recommendation or potential psychological disease prevention.

The emotion expression medium is mainly text, image and video. Along with the acceleration of life rhythm, the social platform becomes spacious, and in order to quickly and widely share the emotion of the user, people attach more importance to the characteristic that pictures can efficiently transfer emotion. Today, people tend to express their emotion by distributing a photo or photos together with a simple sentence or just pictures. Research into image emotion is of paramount importance. How to quickly, accurately and automatically identify and analyze images containing emotion information naturally becomes a research hot spot.

At present, all image emotion analysis methods are initially selected from the existing main network, then a fine-tuning network is designed to train to obtain a model with emotion analysis capability, and the method can be summarized into a training model of the main network and the fine-tuning network. The main network is mainly in a model structure, the main network used in the traditional method is obtained through object recognition task training, the labels of objects in pictures are used as supervision signals, and the object recognition task is used for training, however, the method only learns limited shallow picture semantics, the main network with limited knowledge only has the task of processing target recognition, the generalization of the model is reduced, the pre-training model for label supervision of the deep semantic understanding task of image emotion analysis is difficult to complete, and meanwhile, the pre-training model needs expensive training cost and most training data are marked manually.

Recently, large-scale language pre-training models (LPMs) have been proposed to mine knowledge from text and have achieved incredible results. Some work has shown that these models can serve as a starting point for processing downstream tasks, with greatly improved final experimental results. Among these models, the CLIP model has enjoyed tremendous success in the visual and linguistic fields. The CLIP takes the information from the text as a visual supervision signal, so that the fusion of the text information and the picture information is realized, and meanwhile, the CLIP has strong generalization capability. However, we find that applying a pre-training model of such natural language supervision directly solves the problem that emotion analysis tasks are not ideal due to domain offset, and analysis is considered to be due to omitting specific knowledge for tasks, such as emotion knowledge, in the training process. A backbone network pair such as that shown in fig. 1.

In order to solve the problems, the invention provides a emotion enhancement continuous training method combining knowledge distillation and contrast learning. Screening and collecting a large number of image-text pairs with obvious emotion polarities, and training and reducing gaps between domains through a large number of data based on a knowledge distillation method. Specific model training is based on contrast learning, and emotion semantic information in pictures is deeply understood by mining multi-granularity emotion knowledge in a visual mode and a text mode in an image-text space, so that the pre-training model is helped to output accurate emotion characterization, and image emotion in a downstream task is accurately identified.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a emotion enhancement continuous training method combining knowledge distillation and contrast learning, which fully utilizes multi-mode emotion knowledge in the field. The method mainly aims to solve the problems of low prediction accuracy and poor applicability of emotion analysis models.

The invention designs an emotion enhancement continuous training method combining knowledge distillation and contrast learning. The method comprises 4 stages, teacher training data acquisition, teacher network training, student training data acquisition, student network training and downstream task testing.

The invention comprises the following steps:

s1, acquiring teacher network training data; screening and obtaining image-text pairs with emotion polarities in the CC12M data set through the existing text emotion classification model, taking text emotion as a picture pseudo tag, and marking emotion words in the text by using an emotion dictionary. The final dataset was designated SR-CC12M (Sentiment Rich-CC 12M) as shown in FIG. 2. The first behavior picture, the text corresponding to the second behavior, the third behavior picture and text have obvious emotion polarity to the emotion label and the marked word in the fourth behavior text.

S2, preprocessing teacher network training data; and carrying out format unification processing on the pictures and the texts in the training data, and carrying out emotion masking on the texts to obtain a text original sample, a text mask sample and a picture sample.

S3, constructing a teacher model; based on the input data, contrast learning and emotion knowledge learning are performed on the teacher model. The whole structure of the teacher model is shown in fig. 3. The student model is initialized by the teacher model visual encoder.

S4, acquiring student network training data; and taking the teacher network as a picture emotion classification tool, analyzing emotion polarity of each small block in a picture, recording the position and emotion type of the picture block with obvious emotion polarity to obtain a picture block original sample, and carrying out emotion masking on the picture block to obtain a picture block mask sample.

S5, constructing a student model; and based on the picture block sample, carrying out picture emotion knowledge learning on the student model. The overall structure of the student model is shown in fig. 4. And taking the student model as an image emotion analysis pre-training model.

S6, downstream task testing: and (3) preprocessing the images in the data set to be detected in the same step as S2. The image encoder obtained by the student network training is applied to experiments of emotion two classification and multi-classification in the downstream 3 common image emotion analysis data sets FI, twitter and EmotionROI. And obtaining corresponding experimental results under zero sample, linear probe and supervised setting.

Optionally, the processing procedure of the teacher network training data in S2 is:

step 1: the input picture is scaled, and the clipping operation results in a three-channel matrix of pixel values of 224X 224. And equally dividing the pixel values of the picture into small matrixes, and then transmitting the small matrixes into a visual enabling layer to obtain corresponding picture block codes serving as picture samples.

Step 2: and encoding the input text into a token through a text enabling layer, setting mask probability, obtaining the number of masks based on the mask probability and the text token, and randomly sampling the input text token according to the number of masks to obtain the mask position. Masking the token corresponding to the masking position; the input text token is a text original sample, and the text after masking is a mask sample.

Optionally, the process of masking the token corresponding to the mask position is as follows:

step 1: determining a token needing masking in the input text token according to the masking position;

step 2: and replacing the token needing masking in the input text token with the corresponding masking token according to the token needing masking and the masking strategy.

Optionally, the masking strategy of step 2 includes:

(1) There is a 80% probability that tokens requiring masking in the input text tokens are replaced with MASK tokens.

(2) There is a 10% probability that tokens requiring masking in the input text tokens are replaced with random tokens in the CLIP pre-training model vocabulary.

(3) There is a 10% probability that tokens requiring masking in the input text tokens will remain intact.

Alternatively, the teacher-model visual encoder and the text encoder are each composed of a 12-layer transducer model.

Optionally, in S3, based on the input data, a process of performing contrast learning and emotion knowledge learning on the teacher model is:

step 1: contrast learning; taking a text original sample corresponding to the graph sample as a training positive example, and taking the rest texts in the same batch as negative samples; obtaining characteristics of pictures and text samples through a transformer network, and calculating similarity between the picture samples in the same training batch and positive examples and negative examples respectively, wherein a calculation formula is as follows:

sim(f ^I ,f ^T )＝cos(f ^I ,f ^T )

where sim represents a similarity function;representing picture sample characteristics; />Representing text sample features; cos represents a cosine similarity function.

Step 2: calculating contrast loss according to the similarity between the original sample and the positive example and the negative example respectively; the calculation formula is as follows:

wherein L is _cl And (3) representing contrast learning loss, wherein N is the number of training sample pictures and texts in one batch, and sim represents a similarity function.

Step 3: learning text emotion knowledge; taking the text mask sample as a training example, and obtaining mask token characteristics through a transducer network; fully connecting the last layer of feature vectors of the transformers, mapping the feature vectors to a CLIP dictionary space to obtain a prediction token tag of a mask position, and calculating cross entropy loss of the prediction token tag and a real token tag;

step 4: fully connecting token features (except for a first-dimensional category feature vector) output by the last layer of the transformer, mapping the token features to an emotion space to obtain a predicted emotion tag of a mask position, and calculating cross entropy loss of the predicted emotion tag and a real emotion tag;

optionally, the processing procedure of the student network training data in S4 is:

step 1: and (3) processing the picture to obtain a picture sample in accordance with the operation of the step S2.

Step 2: and generating a mask picture. Inputting the picture samples into a teacher network visual encoder to obtain emotion tendency scores of each picture block, setting a threshold value to be 0.8, recording positions of the picture blocks with obvious emotion polarities, and setting random mask probabilities to obtain mask positions. Masking the picture block corresponding to the masking position; the input picture is an original image sample, and the picture after masking is an image mask sample;

step 1: determining a picture block to be masked in the input picture block coding according to the mask position;

step 2: and replacing the coding position of the mask required in the input picture block coding with the corresponding mask coding according to the picture block required to be masked and the masking strategy (masking 75% of emotion picture blocks).

Optionally, the student model includes a visual encoder initialized by the teacher model visual encoder and a decoder that is a simple convolutional layer.

Optionally, based on the input data, in S5, the process of learning emotion knowledge on the student model is:

step 1: learning image emotion knowledge; taking the image mask sample as a training example, and obtaining mask prediction characteristics through a transducer network; transmitting the predicted feature vector of the last layer of mask of the transducer to a decoder to obtain predicted pixels, and calculating the L1 loss of the predicted pixels and the real pixels;

wherein L is _min Representing the loss of image mask reconstruction, y _M Representing predicted pixels, x _M Representing the real pixel, |·| representing the inter-feature distance, Ω (x) _M ) Representing the number of elements.

Step 2: and (3) fully connecting the feature vectors (except the feature vectors of the first-dimension category) of the last layer of picture block of the transformer, mapping the feature vectors to an emotion space to obtain a predicted emotion label of a mask position, and calculating the cross entropy loss of the predicted emotion label and the real emotion label.

Drawings

Fig. 1 is a comparison of backbone networks.

FIG. 2 is an example of an SR-CC12M dataset.

FIG. 3 is a teacher network structure

FIG. 4 is a diagram of a student network structure

Detailed Description

The invention discloses a emotion enhancement continuous training method combining knowledge distillation and contrast learning aiming at an image emotion classification task. The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The specific implementation steps of the invention are as follows:

step 1: acquiring teacher network training data

In the first step, an image-text pair with emotion polarity in the CC12M data set is obtained through screening of the existing text emotion classification model, text emotion is used as a picture pseudo tag, and emotion words in the text are marked by using an emotion dictionary. The first specific step can be divided into two parts, namely sentence-level data arrangement and word-level data arrangement. For sentence-level data, a TWEETEVAL model is applied to screen and sort out texts T and pictures I corresponding to the texts with emotion polarities (positive or negative) in the CC12M image-text data set, and emotion labels are given, wherein the positive is 1, the negative is 0, and the texts are used as weak mark image-text data sets for model training. Meanwhile, designing a template generator, wherein a text template is 'it is a photo of [ sensor label ]', and placing emotion tags positive and negative in corresponding positions to obtain a tag sentence L. In order to obtain fine-granularity emotion information, word-level data are arranged, independent words W in a text are extracted by using a word segmentation device, emotion polarities of the independent words are judged through an emotion dictionary, emotion labels are given, and the emotion labels are negative to 0, neutral to 1 and positive to 2. The final data set has 7.3 ten thousand image-text pairs, 6.1 ten thousand are positive and 1.2 ten thousand are negative, and the data set is taken for model training directly.

Step 2: teacher network data preprocessing and model parameter setting

And (3) carrying out data format unification processing on the input picture and the text, scaling the image, cutting to obtain a result with the three-channel size of 224X224, and transmitting the result to an image encoder. The text is converted into corresponding token codes through a token processor, the length of the result is trimmed to be 77, if the length is smaller than 77, 0 is subsequently supplemented, and the excessive part is deleted and then is transmitted into a text encoder. Model input data batch size is set to 128, and training is carried out for 10 rounds; setting each 1500-step model to carry out gradient feedback and carrying out parameter updating; the learning rate is initially set to 10 ^-6 The method comprises the steps of carrying out a first treatment on the surface of the 3-wheel damping to 10 ^-7 . Setting an AdamW optimizer to perform parameter learning on the image-text encoder in the model by using a gradient descent scheduler. And two AdamW optimizers are independently arranged to perform parameter learning on the full connection layer in the model.

Step 3 teacher network design

Step 3.1 feature extraction

Each training step has N image-text pairs as model input, image input visual encoder and text input text encoder. Via a learning visual emmbeddThe ing layer and the self-attention module obtain the integral image characteristic tensor f ^I ∈R ^N×512 The token number corresponding to the text is obtained through a token module, and the text feature tensor f is obtained through a learnable text enabling layer and an attention module ^T ∈R ^N×512 Text token feature tensor f ^Tok ∈R ^N×77×512 Picture emotion tag sentence f ^L ∈R ^2×512 。

Step 3.2 teacher network training

Step 3.2.1 contrast learning

On a large-scale image-text data set, a teacher network takes a comparison learning mode as a leading mode, and training is carried out by taking emotion natural language as a supervision signal under the guidance of a fine-granularity text emotion knowledge reasoning task. The image features and the text features are mapped into the same space in a contrast learning mode, emotion information in the text is integrated into the image, and semantic similarity is increased by minimizing the distance between the two modes, so that the image can naturally contain the information of the text. The method is specifically realized by increasing the semantic similarity between the picture and the positive example, and simultaneously reducing the similarity between the picture and the negative example, for example, in N pairs of picture-text pairs, the distance between the picture of the j-th pair and the corresponding text is pulled and the distance between the picture of the j-th pair and the rest N-1 texts in the N pairs of picture-text pairs is pulled. The specific training mode is as follows

Wherein sim (f) ^I ,f ^T )＝cos(f ^I ,f ^T ) For the cosine similarity calculation,representing the picture characteristics of the j-th pair of graphics and text pairs, < >>Representing the text characteristics of the j-th pair of teletext pairs. The contrast learning approach is consistent with the CLIP approach, which helps the encoder to keep the mutual information between the real pairs to a maximum.

3.2.2 text Fine grained emotion knowledge reasoning

Mapping the graph and text into the same space, knowledge mining of one modality in space can also facilitate characterization of another modality. Therefore, the invention designs a text emotion knowledge mining task, pays attention to text fine granularity emotion knowledge, and further interprets emotion information in pictures. This task is referred to as the emotion text mask reasoning task,

the purpose trains the learning of the encoding capability of the model text encoder and extends the encoding capability of the encoding layer to correspondingly encode new MASK tokens.

The specific manipulation of the input data is handled as building a mask at the text word level. Firstly, 15% of all tokens need to be masked, and words with emotion polarity are masked, but because a plurality of tokens can appear after the words are segmented, all tokens corresponding to one word need to be masked, in particular, the original token number is replaced by' MASK]The corresponding numbers are recorded, the positions and the original token numbers are used as labels, 10% of all emotion words are masked, if the number of finally masked emotion word tokens is less than 15% of all tokens, the remaining tokens are masked, and the probability that the masking method is changed to 80% is replaced by' MASK]"corresponding number, 10% probability is replaced by random token in dictionary, 10% probability remains original token unchanged. After input data is input into an enabling layer for feature coding, knowledge reasoning is carried out on the MASK position through the context by a self-attention mechanism in a transducer, namely, the context features are calculated to obtain the features of the MASK position through weight addition and the like. Outputting the MASK position result to a softmax layer to obtain a normalized probability distribution vector of the prediction features of the MASK on the whole vocabularyThis vector is of a size of a vocabulary length, and each dimension represents the likelihood that the predicted word is the corresponding word in the vocabulary by reducing the probability distribution with the real word tags (w _i ) The differences are distributed among the models to train the model encoder.

To obtain a characteristic representation of a word, a word is compiled from textExtracting token characteristics corresponding to emotion words from the last layer of the encoder, noting that one word may correspond to a plurality of tokens, summing the token characteristics and averaging the token characteristics to obtain word characteristics g of a final text _φ (x ^W ) Into a learnable fully-connected layer, e.g.Features representing the g-th word of the j-th sentence in the input text, the size being [1×1×512 ]]The emotion probability distribution of the word is h _ψ (g _φ (x ^W ) Characterizing the likelihood that the word expresses positive as well as negative emotions, where ψ is a parameter of the full connection layer h. By reducing probability distribution and emotion signature y _g Model training is performed by the inter-distribution difference in the following manner:

wherein H is the number of words in the text of the jth sentence, N represents the number of image-text pairs of input data, and the number of words contained in different sentences is inconsistent.

Step 4: student network training data generation and model parameter setting

And taking the teacher network as a picture emotion classification tool, analyzing emotion polarity of each small block in a picture, recording the position and emotion type of the picture block with obvious emotion polarity to obtain a picture block original sample, and carrying out emotion masking on the picture block to obtain a picture block mask sample. Specifically, to guarantee model size and training time, a visual encoder module of the teacher's network is used to initiate the chemical generation network. And taking the pictures in the SR-CC12M and the corresponding emotion labels as training data, outputting the first branch as the emotion characteristics of the whole picture and the emotion characteristics of all the picture blocks through the last layer of characteristics of the visual encoder by the teacher model, and carrying out emotion classification on the emotion characteristics of all the picture blocks. The teacher model has excellent emotion classification capability, so that the reliability of classification results can be ensured, the classification results are the probability of expressing positive emotion and the probability of expressing negative emotion of corresponding picture blocks, and pictures with emotion tendency higher than 0.7 are screenedAnd the block records the corresponding positions and the corresponding pseudo emotion labels, and transmits the positions and the corresponding labels as training data to the student network. The second branch directly transmits training data to the student network. The remaining model parameters were as follows: model input data batch size is set to 256, and training is performed for 10 rounds; setting each 1500-step model to carry out gradient feedback and carrying out parameter updating; the learning rate is initially set to 10 ^-5 The method comprises the steps of carrying out a first treatment on the surface of the 3-wheel damping to 10 ^-6 . Setting an AdamW optimizer to perform parameter learning on a picture encoder in a model by using a gradient descent scheduler. And two AdamW optimizers are independently arranged to perform parameter learning on the full connection layer in the model.

Step 5: student network design

Step 5.1 feature extraction

N pictures are taken as model input in each training step, the pictures are input into a visual encoder, and coding characteristic tensors f epsilon R are obtained through convolution calculation in a learnable visual embedding layer ^N×50×768 Obtaining the integral characteristic f of the picture through the self-attention module ^I ∈R ^N×1×512 And picture block feature f ^P ∈R ^N×49×512 。

Step 5.2 student network training

The invention designs two training tasks to train the image global and local emotion information extraction capacity of the student network. In a student network, unlike the conventional method for carrying out emotion classification by applying an FC layer, the method carries out classification by calculating cosine similarity between the characteristics of each picture and the characteristics of 2-dimensional emotion label sentences, and helps model learning to obtain the global emotion capturing capability of the image by reducing the emotion distribution probability and the distribution difference between the picture emotion labels.

Meanwhile, in order to enable the model to have the capability of capturing the image local emotion information, the invention designs image area emotion mask reconstruction and emotion prediction reconstruction tasks. Because the picture information is rich, a large amount of areas are required to be covered, masking is carried out on the emotion picture blocks in the invention, and the method is realized by randomly selecting 75% of all emotion picture blocks, recording the positions and the RGB pixel values of the original picture as labels, and randomly initializing a learnable pictureThe picture block feature encoding matrix is masked, in particular randomly initialized to a size of 1 x 768 tensors. Knowledge reasoning is carried out on the mask position by context through a self-attention mechanism in a transducer to obtain a picture block predicted emotion feature f ^P . And outputting the characteristics of the mask position through the encoder, inputting the characteristics to the decoder to obtain corresponding predicted pixel values, and realizing that the decoder is a convolution layer. Calculating L1 loss of the predicted pixel and the real pixel;

wherein L is _mim Representing the loss of image mask reconstruction, y _M Representing predicted pixels, x _M Representing the real pixel, |·| representing the inter-feature distance, Ω (x) _M ) Representing the number of elements.

Predicting emotion characteristics f of picture blocks ^P And a learnable full-connection layer is transmitted to obtain a picture block prediction label, so that the distribution of the picture block prediction label is close to the distribution of the picture block emotion real label, and the picture block prediction label is specifically realized as cross entropy loss.

The model can realize the understanding capability of the region by increasing the similarity of the predicted pixel value and the label pixel value, and meanwhile, the capturing capability of the model for the image emotion details can be further trained by carrying out emotion prediction on the mask position.

Step 6, downstream task test

And (3) inputting the images in the data set to be tested into the model trained in the step (4) after the preprocessing step which is the same as the step (4), and obtaining a corresponding experimental result under zero-shot and supervision setting. The best results of the emotion two and six classifications of the EmotionRO dataset and the emotion two classifications of the FI dataset published today are published in the journal IEEE TRANSACTIONS ON MULTIMEDIA of high level in 2020, WSCNet: weakly Supervised Coupled Networks for Visual Sentiment Classification and Detection, respectively, reaching 0.8510,0.6041,0.9097; the best result of emotion eight classification in FI dataset was 0.7546, published in 2019, multi-level region-based convolutional neural network for image emotion classification; the best outcome for classification in the Twitter dataset was as published in Discovering sentimental interaction via graph convolutional network for visual sentiment prediction, 2021, reaching 0.8965. The model of the invention has the advantages that the EmotionROI emotion two-classification and six-classification results reach 0.8805 and 0.6751 on the several data sets, the FI data set emotion two-classification and eight-classification results reach 0.9375 and 0.7833, and the Twitter data set classification result reaches 0.9016.

Claims

1. The emotion enhancement continuous training method combining knowledge distillation and contrast learning is characterized by comprising the following steps of:

s1, acquiring teacher network training data; screening to obtain image-text pairs with emotion polarity in the CC12M data set through an existing text emotion classification model, taking text emotion as a picture pseudo tag, and marking emotion words in the text by using an emotion dictionary; the final dataset was named SR-CC12M; the first behavior picture, the text corresponding to the second behavior, the third behavior picture-text pair emotion labels, and the marked words in the fourth behavior text have obvious emotion polarities;

s2, preprocessing teacher network training data; carrying out format unification processing on pictures and texts in the training data, and carrying out emotion masking on the texts to obtain a text original sample, a text masking sample and a picture sample;

s3, constructing a teacher model; based on the input data, performing contrast learning and emotion knowledge learning on the teacher model; the student model is initialized by a teacher model visual encoder;

s4, acquiring student network training data; taking a teacher network as a picture emotion classification tool, analyzing emotion polarity of each small block in a picture, recording a picture block position and emotion type with obvious emotion polarity to obtain a picture block original sample, and carrying out emotion masking on the picture block to obtain a picture block mask sample;

s5, constructing a student model; based on the picture block sample, carrying out picture emotion knowledge learning on the student model; taking the student model as an image emotion analysis pre-training model;

s6, downstream task testing: the image in the data set to be tested is preprocessed in the same S2; the image encoder obtained by student network training is applied to experiments of emotion two classification and multi-classification in 3 downstream common image emotion analysis data sets FI, twitter and EmotionROI; and obtaining corresponding experimental results under zero sample, linear probe and supervised setting.

2. The method according to claim 1, characterized in that: the process of processing the teacher network training data in the S2 is as follows:

step 1: scaling the input picture, and cutting to obtain a three-channel pixel value matrix with 224X 224; equally dividing pixel values of the picture into small matrixes, and then transmitting the small matrixes into a visual enabling layer to obtain corresponding picture block codes serving as picture samples;

step 2: encoding an input text into a token through a text enabling layer, setting mask probability, obtaining mask quantity based on the mask probability and the text token, and randomly sampling in the input text token according to the mask quantity to obtain mask positions; masking the token corresponding to the masking position; the input text token is a text original sample, and the text after masking is a mask sample;

the process of masking the token corresponding to the mask position comprises the following steps:

step 1.1: determining a token needing masking in the input text token according to the masking position;

step 1.2: according to the token needing masking and the masking strategy, replacing the token needing masking in the input text token with a corresponding masking token; the mask strategy comprises the following steps:

(1) The token needing MASK in the input text token is replaced by the [ MASK ] token with 80% probability;

(2) The method comprises the steps that tokens needing masking in input text tokens are replaced by random tokens in a CLIP pre-training model vocabulary with 10% probability;

3. The method according to claim 1, characterized in that: the teacher-model visual encoder and the text encoder are each composed of a 12-layer transducer model.

4. The method according to claim 1, characterized in that: s3, based on input data, the process of comparing and learning the teacher model and learning emotion knowledge is as follows:

sim(f ^I ,f ^T )＝cos(f ^I ,f ^T )

where sim represents a similarity function;representing picture sample characteristics; />Representing text sample features; cos represents a cosine similarity function;

wherein L is _cl Representing contrast learning loss, wherein N is the number of training sample pictures and texts in one batch, and sim represents a similarity function;

step 4: and fully connecting token characteristics output by the last layer of the transformer, mapping the token characteristics to an emotion space to obtain a predicted emotion label of a mask position, and calculating cross entropy loss of the predicted emotion label and a real emotion label.

5. The method according to claim 1, wherein the processing of the student network training data in S4 is:

step 1: consistent with the operation of the step S2, processing the picture to obtain a picture sample;

step 2: generating a mask picture; inputting a picture sample into a teacher network visual encoder to obtain emotion tendency scores of each picture block, setting a threshold value to be 0.8, recording positions of the picture blocks with obvious emotion polarities, and setting random mask probabilities to obtain mask positions; masking the picture block corresponding to the masking position; the input picture is an original image sample, and the picture after masking is an image mask sample;

step 4.1: determining a picture block to be masked in the input picture block coding according to the mask position;

step 4.2: according to the picture blocks needing the mask and a mask strategy, namely masking 75% of emotion picture blocks, replacing the coding positions needing the mask in the input picture block coding with corresponding mask codes;

the student model includes a visual encoder initialized by the teacher model visual encoder and a decoder that is a convolutional layer.

6. The method of claim 1, wherein the learning of emotion knowledge for the student model based on the input data in S5 is:

wherein L is _mim Representing the loss of image mask reconstruction, y _M Representing predicted pixels, x _M Representing the real pixel, |·| representing the inter-feature distance, Ω (x) _M ) Representing the number of elements;

step 2: and fully connecting the feature vector of the last layer of picture block of the transformer, mapping the feature vector to an emotion space to obtain a predicted emotion label of the mask position, and calculating the cross entropy loss of the predicted emotion label and the real emotion label.