CN114969338A

CN114969338A - Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation

Info

Publication number: CN114969338A
Application number: CN202210580293.XA
Authority: CN
Inventors: 孙新; 李瑾仪; 任翔渝
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-30

Abstract

The method comprises the steps of extracting features of original image-text data to obtain feature vectors of words and pictures; fusing the pictures and texts by using a Transformer encoder to obtain output variables which are spliced into a guide vector; coding the word feature vector based on an attention mechanism to obtain a sentence feature vector; coding the sentence feature vector based on the attention mechanism and the guide vector to obtain a text feature vector; and combining the output vectors of the two encoders and the text feature vector to carry out emotion classification. According to the image-text emotion classification method and system provided by the invention, on one hand, the function of a text mode is considered, on the other hand, the influence of each sentence on the text emotion is considered, and on the other hand, the vector fused by the symmetric translation module is used for guiding and generating the text vector, so that the problem of heterogeneity of the guide vector and the text vector is solved, and meanwhile, the fusion of the text and the image is considered again in the text vector generation process instead of a single fusion mode, so that the image-text fusion effect is further improved.

Description

Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation

Technical Field

The invention relates to the technical field of natural language processing and deep learning, in particular to a method and a system for classifying image-text emotions based on heterogeneous fusion and symmetric translation.

Background

With the rapid development of the internet and computer science, a large number of social platforms and commercial websites gradually emerge, such as microblogs, trembles, small red books, beauty groups and the like, various video websites including the B station and shopping platforms such as Taobao and the like are widely used, and the life and communication modes of people are greatly influenced. The expression modes of people on various platforms are more and more diversified, the user can express the view of the user through pictures and videos, for example, the user can upload the video of the user on a B station, attach some characters, and evaluate the commodities used by the user in the form of pictures and characters on a shopping website. The consumer can decide whether to buy according to the existing commodity evaluation, and the merchant can judge the preference of the user to the commodity and the popularity of the commodity according to the evaluation of the user, decide how to better serve the user, and the like.

Each source or form of information may be referred to as a modality, with text, pictures, and audio constituting the three most common modalities in reality. With the increasing multi-modal content on the internet, the task of multi-modal sentiment analysis comes up. The multi-modal emotion analysis aims to analyze user emotion contained in text, pictures, audio and other multi-modal content by fully utilizing complementarity among the text, the pictures, the audio and the other multi-modal content. In the past, emotion analysis is limited to a single mode, such as text emotion analysis, only by paying attention to the emotion contained in a mined and inferred text, multi-mode emotion analysis needs to process data of multiple modes, and therefore a lot of challenges are brought. The multi-modal emotion analysis task is an important research content in the fields of social computing and emotion analysis, and has become a research hotspot in recent years.

The method has the main innovation points that the VistaNet model and the Huang model are used for leading the image feature vector to guide the generation of the text feature vector, the problem that the two modal vector spaces are inconsistent, namely data are heterogeneous is solved, the direct associated information of the text and the image can be learnt invisibly, and the fusion effect is better than that of the previous model. However, this model only uses the picture feature vector to guide the text feature vector, and although the implicit fusion can be performed, the picture feature vector used for guidance is still heterogeneous to the text feature vector, which also limits the fusion effect. The depth multi-mode fusion model firstly uses two independent attention mechanism modules to respectively extract the characteristics of the text and the picture, then uses the multi-mode attention mechanism module to fuse the text and the picture, and carries out final emotion classification. However, the effect of the text mode is not fully considered by the model, and the emotion classification effect is general.

Disclosure of Invention

The invention provides a method and a system for classifying image-text emotions based on heterogeneous fusion and symmetric translation, and aims to solve the problems that the existing image-text emotion classification method is insufficient in feature extraction capability, heterogeneous in mode, and single in image-text fusion mode, and the effect of text mode is not fully considered.

In order to achieve the above object, according to a first aspect of the present invention, there is provided a method for classifying graphics context emotion based on heterogeneous fusion and symmetric translation, the method comprising:

inputting texts and pictures, and obtaining emotion classification through an emotion classification model, wherein the emotion classification model training method comprises the following steps:

s1, extracting the characteristics of the text and the picture data in the data set to respectively obtain the characteristic vector representation of the words in the text and the characteristic vector representation of the picture;

s2, coding the word feature vectors to obtain sentence feature vectors based on the attention mechanism;

s3, based on a Transformer framework, setting sentences and pictures in the text into a source mode and a target mode respectively for coding, and splicing fusion vectors output by a Transformer coder to serve as guide vectors;

and S4, coding the sentence characteristic vector based on the attention mechanism and the guide vector to obtain a text characteristic vector.

And S5, splicing the output vector of the Transformer encoder and the feature vector of the document to obtain final vector representation, and performing emotion classification and parameter adjustment of the emotion classification model.

Further, in the step S1, the feature vector representation x of the word is obtained by using the Bert model _i,t Obtaining a feature vector representation P of a picture using VGG16 _j 。

Further, the process of encoding in step S2 includes the following steps:

s21, selecting the word feature vector x _i,t Using bidirectional LSTM to respectively pair x from front to back _i,t Coding is carried out, the hidden layer vectors obtained from two directions are spliced to obtain the final hidden layer vector h _i,t ；

S22, for hidden layer vector h _i,t Calculation is performed using an attention mechanism, and then normalization processing is performed:

v _i,t ＝V·tanh(W _v h _i,t +b _v )

wherein V represents a randomly initialized matrix, V _i,t Representing the weight value of a word for the entire sentence, tanh representing the activation function, exp representing the exponential function, W _v And b _v Is a randomly initialized value, automatically adjusted during training, alpha _i,t Representing the weight value, s, of the normalized word for the entire sentence _i Representing the resulting sentence feature vector.

In step S3, the transform encoder is configured to:

ε _s→p ＝f _s→p (X _s )

ε _p→s ＝f _p→s (X _p )

wherein, X _s Representing text as a source modality, X _p Representing pictures as source modality, epsilon _s→p And ε _p→s Is two output vectors of the encoder, f represents the activation function of the transform architecture,

representing the stitching operation and epsilon representing the guide vector.

Further, in step S4, the encoding the sentence feature vector includes the following steps:

s41, selecting sentence characteristic vector S _i Using bidirectional LSTM to pair s from front to back _i Coding is carried out, the hidden layer vectors obtained from the two directions are spliced to obtain a sentence hidden layer vector h _i ；

S42, using tanh activation function to guide vector epsilon and sentence hiding layer vector h _i Mapping to the same vector space in a nonlinear way, thereby obtaining a feature vector representation f of a guide vector generated after image-text fusion, and a text feature vector representation g _i ；

S43, f and g based on attention mechanism _i For more text factors, add g alone _i To obtain a vector u representing the attention weight of a sentence _i The calculation formula is as follows:

u _i ＝U·(f⊙g _i +g _i )

wherein, U represents a parameter matrix obtained by random initialization, and represents inner product operation;

s44, weighting vector u _i Performing normalization calculation, and normalizing the calculated weight gamma _i And sentence hiding layer vector h _i Weighted solutionAnd obtaining a text feature vector d; wherein h is _i Representing a sentence hidden layer vector, gamma _i And the normalized weight value of the ith sentence relative to the whole text is represented.

Further, in step S5, the obtained guidance vector and the text feature vector are first spliced to obtain a final vector representation, then the final vector representation is classified using a full-connection network, and parameters of the emotion classification model are adjusted by a back propagation algorithm.

According to a second aspect of the invention, a graphic context emotion classification system based on heterogeneous fusion and symmetric translation is provided, the system comprises an emotion classification model and a training module, wherein the training module is used for inputting pictures and texts in a data set into the emotion classification model, comparing the emotion classifications corresponding to the pictures and texts in the data set with the classifications output by the emotion classification model, and adjusting parameters of the emotion classification model through a back propagation algorithm;

the emotion classification model comprises: the modal characteristic extraction module is used for extracting characteristics of data of two modes in the data set to respectively obtain characteristic vectors of words in the text and characteristic vectors of pictures;

the heterogeneous fusion attention module is used for generating a feature vector of a sentence based on an attention mechanism and combined with the influence of a word on the emotion polarity of the sentence, and generating a feature vector of a text based on the attention mechanism and combined with the influence of the sentence on the emotion polarity of the text;

the symmetrical translation and fusion module is used for respectively setting sentences and pictures as a source mode and a target mode, translating the source mode into the target mode by using a Transformer, and taking the output of the encoder as a vector representation after the two modes are fused;

and the emotion classification module is used for splicing the output vectors of the two transform encoders and the feature vector of the text to obtain final vector representation and outputting emotion classification through a full-connection network.

Compared with the existing image-text emotion classification method and system, the image-text emotion classification method and system based on heterogeneous fusion and symmetric translation have the following beneficial effects:

1. the invention considers the effect of text mode more, and uses heterogeneous fusion attention mechanism to consider the influence of each word in the sentence on the emotion polarity and the influence of each sentence in the text on the emotion polarity.

2. The invention uses the vector guide generated by the fusion of the symmetrical translation module to solve the problem of the heterogeneity of the guide vector and the text vector, and simultaneously considers the fusion of the text and the picture in the text vector generation process instead of a single fusion mode, thereby further improving the image-text fusion effect.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flowchart of a method for classifying image-text sentiments based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to another embodiment of the present invention;

FIG. 3 is a flowchart illustrating step S2 of the method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating step S3 of the method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating step S4 of the method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;

fig. 6 is a schematic system structure diagram of a teletext emotion classification system based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention provides a graph and text emotion classification method based on heterogeneous fusion and symmetric translation, which comprises the following steps of:

s1, extracting the characteristics of the text and picture data in the data set to respectively obtain word characteristic vectors x _i,t And picture feature vector P _j . Wherein i represents the ith sentence in the selected text, t represents the tth word in the ith sentence, assuming that the ith sentence contains C words in total, and j represents the jth picture in the sample.

S2, based on the attention mechanism, for the selected word feature vector x _i,t Coding to obtain the characteristic vector s of each sentence _i . Wherein i represents the ith sentence in the selected text, and L sentences are assumed to be contained in the selected text.

S3, based on Transformer, the sentence feature vector S _i And picture feature vector P _j Respectively set as a source mode and a target mode for coding, and respectively take epsilon output by a coder _s→p And ε _p→s The concatenation is performed as a guide vector epsilon.

S4, sentence feature vector S based on attention mechanism and guide vector epsilon _i And coding to obtain a text feature vector d.

Obtaining the feature vector representation x of the word using the Bert model in step S1 _i,t Obtaining a feature vector representation P of a picture using VGG16 _j 。

The specific process is as follows: and performing feature extraction on the input text and picture data to respectively obtain a word feature vector and a picture feature vector.

Word feature vector is expressed in x _i,t Representing, picture feature vectors by P _j And representing, wherein i represents the ith sentence in the selected text, t represents the tth word in the ith sentence, and j represents the jth picture in the sample under the assumption that the ith sentence contains C words in total. Taking the ith sentence "the y has a large selection of words to a white area selectable", the word vector x _i,t Is a vector representation of the t-th word in the sentence.

In step S2, in the encoding process, for each word feature vector x to be input _i,t Calculating from the front and back directions by using bidirectional LSTM, and splicing the hidden layer vectors in the two directions at each moment to obtain the final hidden layer vector output h _i,t 。

Then, the vector after bidirectional LSTM coding is calculated through an attention mechanism, the purpose is to obtain the weight of each word relative to the whole sentence, the weight value represents the influence of the current word on the whole sentence, and the final vector representation s of the current sentence is calculated after the weight value is obtained _i As shown in fig. 3.

For example, "the y have a large selection of words to a white hand area selectable," a weight of each word in the sentence with respect to the whole sentence can be obtained after the encoding is completed, and the weight represents the influence of the current word on the emotion polarity of the whole sentence. The specific coding method comprises the following steps:

s21, selecting the word feature vector x _i,t Using bidirectional LSTM to respectively pair x from front to back _i,t And (3) encoding:

wherein x is _i,t A feature vector representing the t word of the ith sentence,

and

respectively represent x _i,t Hidden layer vectors in both left-to-right and right-to-left directions, h _i,t And representing the final hidden layer vector obtained by splicing the two hidden layer vectors. At each instant of time, the current hidden layer vector h _i,t All depend on the hidden layer vector h at the last moment _i,t-1 And the current input x _i,t 。

S22, obtaining a hidden layer vector h _i,t A calculation of the attention mechanism is performed, after which the final feature vector representation of the current sentence is calculated:

v _i,t ＝V·tanh(W _v h _i,t +b _v )

where V represents a randomly initialized matrix and V _i,t Representing the weight value of a word for the entire sentence, tanh representing the activation function, exp representing the exponential function, W _v And b _v Is a randomly initialized value, automatically adjusted during training, alpha _i,t The weight value of the normalized word to the whole sentence is represented, and the value range is [0,1 ]]，s _i Representing the resulting sentence feature vector.

The sentence vector obtained after coding considers the emotional influence of each word in the sentence on the whole sentence, and is not the sentence vector obtained by simply splicing.

In step S3, the sentence feature vector and the picture feature vector in the selected text are set as a source mode and a target mode, respectively, then the source mode is translated into the target mode by using a decoder framework of the encoder, two output vectors of the encoder are spliced to serve as a guide vector to guide the generation of the text vector in the next step, so that the problem of heterogeneity of the guide vector and the text vector is solved, and meanwhile, the fusion of the text and the picture is considered again in the text vector generation process, so that the image-text fusion effect is further improved, as shown in fig. 4. Translation refers to encoding and decoding, and the process is as follows:

(1) determining sentence vectors and picture vectors: sentence feature vector S obtained by encoding in S2 _i And the picture feature vector P obtained in S1 _j 。

(2) And (3) calculating a fusion vector: using two transformers to set the two eigenvectors as a source mode and a target mode respectively for coding to obtain a fusion vector epsilon _s→p And ε _p→s ：

ε _s→p ＝f _s→p (X _s )

ε _p→s ＝f _p→s (X _p )

Wherein, X _s Representing text as a source modality, X _p Representing pictures as source modality, epsilon _s→p And ε _p→s Is the fused vector of the output of the encoder, and f represents the activation function in the transform architecture.

(3) The process of obtaining the guide vector: splicing the two fusion vectors to obtain a guidance vector epsilon:

wherein the content of the first and second substances,

indicating a splicing operation.

And (4) obtaining an image-text fusion vector epsilon through the steps (1) to (3), and guiding the next step to work by using the image-text fusion vector epsilon.

In step S4, the sentence feature vector S is used _i Generating a text feature vector d by combining the guide vector epsilon, as shown in fig. 5, the specific process is as follows:

s41 sentence-pair vector S _i And (3) encoding: using bidirectional LSTM to pair s from front to back _i And (3) encoding:

wherein s is _i A feature vector representing the ith sentence,

and

hidden layer vectors in the ith sentence from left to right and from right to left, h _i Representing the final hidden layer vector obtained by splicing the two hidden layer vectors obtained by the bidirectional LSTM.

S42, coding the hidden layer vector and the guide vector: and mapping the guide vector and the sentence hiding layer vector to the same vector space in a nonlinear mode by using a tanh activation function, wherein the calculation formula is as follows:

f＝tanh(W _f ε+b _f )

g _i ＝tanh(W _g h _i +b _g )

wherein, W _f 、W _g 、b _f And b _g Is a randomly initialized value, is automatically adjusted in training, f represents the characteristic vector representation of a guide vector generated after image-text fusion, g _i Representing a text feature vector representation.

S43, calculating the vector u of the sentence attention weight _i : f and g _i For more text factors, add g alone _i To obtain a vector u representing the attention weight of a sentence _i The calculation formula is as follows:

u _i ＝U·(f⊙g _i +g _i )

wherein U represents a parameter matrix, f and g _i And performing inner product operation to show the similarity between the two. Due to f and g _i The inner product result only represents the interaction between them, not the information of the text, therefore, g needs to be added _i Thereby taking into account more of the effects of the text. The vector u obtained above _i And representing the attention weight of the ith sentence, wherein the larger the weight is, the greater the influence of the sentence on the text emotion is.

S44, weighting vector u _i Carrying out normalization calculation to obtain text representation, and firstly carrying out weighting vector u _i Carrying out normalization calculation, and then normalizing the calculated weight gamma _i And sentence hiding layer vector h _i And weighting and summing to obtain a text feature vector d, wherein the specific process is as follows:

wherein gamma is _i And expressing the attention scores, and performing weighted summation on the attention scores and respective sentence hiding layer vectors to obtain a final text vector expression which is used as one of the sources of the emotion classification characterization vectors. This text vector contains not only the information of the text but also the information of the fusion of the text and the picture because in the second layer attention mechanism, the fusion vector epsilon generated in the symmetric translation is used to guide the generated attention weight.

In step S5, the output vectors of the two transform encoders and the text feature vector are first spliced to obtain a final vector representation, and then the vector representation is classified using a full-connection network, which includes the following specific processes:

a＝concat(d,ε _s→p ,ε _p→s )

where d represents the text feature vector, ε _s→p And ε _p→s Representing the two output vectors of the encoder, concat represents the splicing operation, Linear represents the fully-connected network,

and expressing the finally obtained emotion classification result, and then adjusting the parameters of the emotion classification model through the self-carried classification and back propagation algorithm of the picture and the text.

The invention also provides a graphic context emotion classification system based on heterogeneous fusion and symmetric translation, which comprises a training module and an emotion classification model, wherein the training module is used for inputting the pictures and texts in the data set into the emotion classification model, comparing the emotion classifications corresponding to the pictures and texts in the data set with the output classifications of the emotion classification model, and adjusting the parameters of the emotion classification model through a back propagation algorithm;

the emotion classification model comprises a modal feature extraction module, a feature extraction module and a feature extraction module, wherein the modal feature extraction module is used for extracting features of data of two modes in the data set to respectively obtain feature vectors of words in the text and feature vectors of the pictures; see step S1 above specifically.

And the heterogeneous fusion attention module generates a feature vector of the sentence based on the attention mechanism and the influence of the word on the emotion polarity of the sentence, and generates a feature vector of the text based on the attention mechanism and the influence of the sentence on the emotion polarity of the text, specifically referring to the above steps S2 and S4.

The symmetrical translation and fusion module is used for respectively setting sentences and pictures as a source mode and a target mode, translating the source mode into the target mode by using a Transformer, and taking the output of the encoder as a vector representation after the two modes are fused; see step S3 above specifically.

And the emotion classification module is used for splicing the two output vectors of the transform encoder and the feature vector of the text to obtain a final vector representation for emotion classification, which is specifically referred to as the step S5.

Technical contents not described in detail in the present invention belong to the well-known techniques of those skilled in the art.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A graphics context emotion classification method based on heterogeneous fusion and symmetric translation is characterized by comprising the following steps:

s4, coding the sentence characteristic vectors to obtain text characteristic vectors based on the attention mechanism and the guide vectors;

2. The method according to claim 1, wherein the step S1 uses BertModel-derived feature vector representation x of words _i，t Obtaining a feature vector representation P of a picture using VGG16 _j 。

3. The method according to claim 1, wherein the process encoded in step S2 includes the following steps:

s21, selecting the word feature vector x _i，t Using bidirectional LSTM to respectively pair x from front to back _i，t Coding is carried out, the hidden layer vectors obtained from the two directions are spliced, and a final hidden layer vector h is obtained _i，t ；

S22, for hidden layer vector h _i，t Calculation is performed using an attention mechanism, and then normalization processing is performed:

v _i，t ＝V·tanh(W _v h _i，t +b _v )

where V represents a randomly initialized matrix and V _i，t Representing the weight value of a word for the entire sentence, tanh representing the activation function, exp representing the exponential function, W _v And b _v Is a randomly initialized value, automatically adjusted during training, alpha _i，t Representing the weight value, s, of the normalized word for the entire sentence _i Representing the resulting sentence feature vector.

4. The method according to claim 1, wherein in step S3, the transform encoder encodes in a manner that:

ε _s→p ＝f _s→p (X _s )

ε _p→s ＝f _p→s (X _p )

representing the stitching operation and epsilon representing the guide vector.

5. The method according to claim 2, wherein in the step S4, the encoding the sentence feature vector comprises the following steps:

s41, selecting sentence characteristic vector S _i Using bidirectional LSTM to pair s from front to back _i Coding is carried out, hidden layer vectors obtained from two directions are spliced to obtain a sentence hidden layer vector h _i ；

S42, hiding layer vector h for the guide vector epsilon and the sentence by using tanh activation function _i Mapping to the same vector space in a nonlinear way, thereby obtaining a feature vector representation f of a guide vector generated after image-text fusion, and a text feature vector representation g _i ；

S43, f and g based on attention mechanism _i Performing inner product operation, and adding g separately for considering more text factors _i To obtain a vector u representing the attention weight of a sentence _i The calculation formula is as follows:

u _i ＝U·(f⊙g _i +g _i )

s44, weighting vector u _i Performing normalization calculation, and normalizing the calculated weight gamma _i And sentence hiding layer vector h _i Weighted summation to obtain text feature directionAn amount d; wherein h is _i Representing a sentence hidden layer vector, gamma _i And the normalized weight value of the ith sentence relative to the whole text is represented.

6. The method according to claim 1, wherein in step S5, the obtained guidance vector and the text feature vector are first concatenated to obtain a final vector representation, then the final vector representation is classified using a full-connection network, and parameters of the emotion classification model are adjusted by a back-propagation algorithm.

7. A picture and text sentiment classification system based on heterogeneous fusion and symmetric translation is characterized by comprising a sentiment classification model and a training module, wherein the training module is used for inputting pictures and texts in a data set into the sentiment classification model, comparing the sentiment classifications corresponding to the pictures and the texts in the data set with the output classifications of the sentiment classification model, and adjusting parameters of the sentiment classification model through a back propagation algorithm;

and the emotion classification module is used for splicing the two output vectors of the Transformer encoder and the feature vector of the text to obtain final vector representation and outputting emotion classification through a full-connection network.