CN114969338A - Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation - Google Patents

Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation Download PDF

Info

Publication number
CN114969338A
CN114969338A CN202210580293.XA CN202210580293A CN114969338A CN 114969338 A CN114969338 A CN 114969338A CN 202210580293 A CN202210580293 A CN 202210580293A CN 114969338 A CN114969338 A CN 114969338A
Authority
CN
China
Prior art keywords
vector
text
sentence
vectors
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210580293.XA
Other languages
Chinese (zh)
Inventor
孙新
李瑾仪
任翔渝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210580293.XA priority Critical patent/CN114969338A/en
Publication of CN114969338A publication Critical patent/CN114969338A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The method comprises the steps of extracting features of original image-text data to obtain feature vectors of words and pictures; fusing the pictures and texts by using a Transformer encoder to obtain output variables which are spliced into a guide vector; coding the word feature vector based on an attention mechanism to obtain a sentence feature vector; coding the sentence feature vector based on the attention mechanism and the guide vector to obtain a text feature vector; and combining the output vectors of the two encoders and the text feature vector to carry out emotion classification. According to the image-text emotion classification method and system provided by the invention, on one hand, the function of a text mode is considered, on the other hand, the influence of each sentence on the text emotion is considered, and on the other hand, the vector fused by the symmetric translation module is used for guiding and generating the text vector, so that the problem of heterogeneity of the guide vector and the text vector is solved, and meanwhile, the fusion of the text and the image is considered again in the text vector generation process instead of a single fusion mode, so that the image-text fusion effect is further improved.

Description

Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation
Technical Field
The invention relates to the technical field of natural language processing and deep learning, in particular to a method and a system for classifying image-text emotions based on heterogeneous fusion and symmetric translation.
Background
With the rapid development of the internet and computer science, a large number of social platforms and commercial websites gradually emerge, such as microblogs, trembles, small red books, beauty groups and the like, various video websites including the B station and shopping platforms such as Taobao and the like are widely used, and the life and communication modes of people are greatly influenced. The expression modes of people on various platforms are more and more diversified, the user can express the view of the user through pictures and videos, for example, the user can upload the video of the user on a B station, attach some characters, and evaluate the commodities used by the user in the form of pictures and characters on a shopping website. The consumer can decide whether to buy according to the existing commodity evaluation, and the merchant can judge the preference of the user to the commodity and the popularity of the commodity according to the evaluation of the user, decide how to better serve the user, and the like.
Each source or form of information may be referred to as a modality, with text, pictures, and audio constituting the three most common modalities in reality. With the increasing multi-modal content on the internet, the task of multi-modal sentiment analysis comes up. The multi-modal emotion analysis aims to analyze user emotion contained in text, pictures, audio and other multi-modal content by fully utilizing complementarity among the text, the pictures, the audio and the other multi-modal content. In the past, emotion analysis is limited to a single mode, such as text emotion analysis, only by paying attention to the emotion contained in a mined and inferred text, multi-mode emotion analysis needs to process data of multiple modes, and therefore a lot of challenges are brought. The multi-modal emotion analysis task is an important research content in the fields of social computing and emotion analysis, and has become a research hotspot in recent years.
The method has the main innovation points that the VistaNet model and the Huang model are used for leading the image feature vector to guide the generation of the text feature vector, the problem that the two modal vector spaces are inconsistent, namely data are heterogeneous is solved, the direct associated information of the text and the image can be learnt invisibly, and the fusion effect is better than that of the previous model. However, this model only uses the picture feature vector to guide the text feature vector, and although the implicit fusion can be performed, the picture feature vector used for guidance is still heterogeneous to the text feature vector, which also limits the fusion effect. The depth multi-mode fusion model firstly uses two independent attention mechanism modules to respectively extract the characteristics of the text and the picture, then uses the multi-mode attention mechanism module to fuse the text and the picture, and carries out final emotion classification. However, the effect of the text mode is not fully considered by the model, and the emotion classification effect is general.
Disclosure of Invention
The invention provides a method and a system for classifying image-text emotions based on heterogeneous fusion and symmetric translation, and aims to solve the problems that the existing image-text emotion classification method is insufficient in feature extraction capability, heterogeneous in mode, and single in image-text fusion mode, and the effect of text mode is not fully considered.
In order to achieve the above object, according to a first aspect of the present invention, there is provided a method for classifying graphics context emotion based on heterogeneous fusion and symmetric translation, the method comprising:
inputting texts and pictures, and obtaining emotion classification through an emotion classification model, wherein the emotion classification model training method comprises the following steps:
s1, extracting the characteristics of the text and the picture data in the data set to respectively obtain the characteristic vector representation of the words in the text and the characteristic vector representation of the picture;
s2, coding the word feature vectors to obtain sentence feature vectors based on the attention mechanism;
s3, based on a Transformer framework, setting sentences and pictures in the text into a source mode and a target mode respectively for coding, and splicing fusion vectors output by a Transformer coder to serve as guide vectors;
and S4, coding the sentence characteristic vector based on the attention mechanism and the guide vector to obtain a text characteristic vector.
And S5, splicing the output vector of the Transformer encoder and the feature vector of the document to obtain final vector representation, and performing emotion classification and parameter adjustment of the emotion classification model.
Further, in the step S1, the feature vector representation x of the word is obtained by using the Bert model i,t Obtaining a feature vector representation P of a picture using VGG16 j
Further, the process of encoding in step S2 includes the following steps:
s21, selecting the word feature vector x i,t Using bidirectional LSTM to respectively pair x from front to back i,t Coding is carried out, the hidden layer vectors obtained from two directions are spliced to obtain the final hidden layer vector h i,t
S22, for hidden layer vector h i,t Calculation is performed using an attention mechanism, and then normalization processing is performed:
v i,t =V·tanh(W v h i,t +b v )
Figure BDA0003662110980000031
Figure BDA0003662110980000032
wherein V represents a randomly initialized matrix, V i,t Representing the weight value of a word for the entire sentence, tanh representing the activation function, exp representing the exponential function, W v And b v Is a randomly initialized value, automatically adjusted during training, alpha i,t Representing the weight value, s, of the normalized word for the entire sentence i Representing the resulting sentence feature vector.
In step S3, the transform encoder is configured to:
ε s→p =f s→p (X s )
ε p→s =f p→s (X p )
Figure BDA0003662110980000033
wherein, X s Representing text as a source modality, X p Representing pictures as source modality, epsilon s→p And ε p→s Is two output vectors of the encoder, f represents the activation function of the transform architecture,
Figure BDA0003662110980000034
representing the stitching operation and epsilon representing the guide vector.
Further, in step S4, the encoding the sentence feature vector includes the following steps:
s41, selecting sentence characteristic vector S i Using bidirectional LSTM to pair s from front to back i Coding is carried out, the hidden layer vectors obtained from the two directions are spliced to obtain a sentence hidden layer vector h i
S42, using tanh activation function to guide vector epsilon and sentence hiding layer vector h i Mapping to the same vector space in a nonlinear way, thereby obtaining a feature vector representation f of a guide vector generated after image-text fusion, and a text feature vector representation g i
S43, f and g based on attention mechanism i For more text factors, add g alone i To obtain a vector u representing the attention weight of a sentence i The calculation formula is as follows:
u i =U·(f⊙g i +g i )
wherein, U represents a parameter matrix obtained by random initialization, and represents inner product operation;
s44, weighting vector u i Performing normalization calculation, and normalizing the calculated weight gamma i And sentence hiding layer vector h i Weighted solutionAnd obtaining a text feature vector d; wherein h is i Representing a sentence hidden layer vector, gamma i And the normalized weight value of the ith sentence relative to the whole text is represented.
Further, in step S5, the obtained guidance vector and the text feature vector are first spliced to obtain a final vector representation, then the final vector representation is classified using a full-connection network, and parameters of the emotion classification model are adjusted by a back propagation algorithm.
According to a second aspect of the invention, a graphic context emotion classification system based on heterogeneous fusion and symmetric translation is provided, the system comprises an emotion classification model and a training module, wherein the training module is used for inputting pictures and texts in a data set into the emotion classification model, comparing the emotion classifications corresponding to the pictures and texts in the data set with the classifications output by the emotion classification model, and adjusting parameters of the emotion classification model through a back propagation algorithm;
the emotion classification model comprises: the modal characteristic extraction module is used for extracting characteristics of data of two modes in the data set to respectively obtain characteristic vectors of words in the text and characteristic vectors of pictures;
the heterogeneous fusion attention module is used for generating a feature vector of a sentence based on an attention mechanism and combined with the influence of a word on the emotion polarity of the sentence, and generating a feature vector of a text based on the attention mechanism and combined with the influence of the sentence on the emotion polarity of the text;
the symmetrical translation and fusion module is used for respectively setting sentences and pictures as a source mode and a target mode, translating the source mode into the target mode by using a Transformer, and taking the output of the encoder as a vector representation after the two modes are fused;
and the emotion classification module is used for splicing the output vectors of the two transform encoders and the feature vector of the text to obtain final vector representation and outputting emotion classification through a full-connection network.
Compared with the existing image-text emotion classification method and system, the image-text emotion classification method and system based on heterogeneous fusion and symmetric translation have the following beneficial effects:
1. the invention considers the effect of text mode more, and uses heterogeneous fusion attention mechanism to consider the influence of each word in the sentence on the emotion polarity and the influence of each sentence in the text on the emotion polarity.
2. The invention uses the vector guide generated by the fusion of the symmetrical translation module to solve the problem of the heterogeneity of the guide vector and the text vector, and simultaneously considers the fusion of the text and the picture in the text vector generation process instead of a single fusion mode, thereby further improving the image-text fusion effect.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flowchart of a method for classifying image-text sentiments based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of a method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to another embodiment of the present invention;
FIG. 3 is a flowchart illustrating step S2 of the method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating step S3 of the method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating step S4 of the method for classifying teletext emotion based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention;
fig. 6 is a schematic system structure diagram of a teletext emotion classification system based on heterogeneous fusion and symmetric translation according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention provides a graph and text emotion classification method based on heterogeneous fusion and symmetric translation, which comprises the following steps of:
inputting texts and pictures, and obtaining emotion classification through an emotion classification model, wherein the emotion classification model training method comprises the following steps:
s1, extracting the characteristics of the text and picture data in the data set to respectively obtain word characteristic vectors x i,t And picture feature vector P j . Wherein i represents the ith sentence in the selected text, t represents the tth word in the ith sentence, assuming that the ith sentence contains C words in total, and j represents the jth picture in the sample.
S2, based on the attention mechanism, for the selected word feature vector x i,t Coding to obtain the characteristic vector s of each sentence i . Wherein i represents the ith sentence in the selected text, and L sentences are assumed to be contained in the selected text.
S3, based on Transformer, the sentence feature vector S i And picture feature vector P j Respectively set as a source mode and a target mode for coding, and respectively take epsilon output by a coder s→p And ε p→s The concatenation is performed as a guide vector epsilon.
S4, sentence feature vector S based on attention mechanism and guide vector epsilon i And coding to obtain a text feature vector d.
And S5, splicing the output vector of the Transformer encoder and the feature vector of the document to obtain final vector representation, and performing emotion classification and parameter adjustment of the emotion classification model.
Obtaining the feature vector representation x of the word using the Bert model in step S1 i,t Obtaining a feature vector representation P of a picture using VGG16 j
The specific process is as follows: and performing feature extraction on the input text and picture data to respectively obtain a word feature vector and a picture feature vector.
Word feature vector is expressed in x i,t Representing, picture feature vectors by P j And representing, wherein i represents the ith sentence in the selected text, t represents the tth word in the ith sentence, and j represents the jth picture in the sample under the assumption that the ith sentence contains C words in total. Taking the ith sentence "the y has a large selection of words to a white area selectable", the word vector x i,t Is a vector representation of the t-th word in the sentence.
In step S2, in the encoding process, for each word feature vector x to be input i,t Calculating from the front and back directions by using bidirectional LSTM, and splicing the hidden layer vectors in the two directions at each moment to obtain the final hidden layer vector output h i,t
Then, the vector after bidirectional LSTM coding is calculated through an attention mechanism, the purpose is to obtain the weight of each word relative to the whole sentence, the weight value represents the influence of the current word on the whole sentence, and the final vector representation s of the current sentence is calculated after the weight value is obtained i As shown in fig. 3.
For example, "the y have a large selection of words to a white hand area selectable," a weight of each word in the sentence with respect to the whole sentence can be obtained after the encoding is completed, and the weight represents the influence of the current word on the emotion polarity of the whole sentence. The specific coding method comprises the following steps:
s21, selecting the word feature vector x i,t Using bidirectional LSTM to respectively pair x from front to back i,t And (3) encoding:
Figure BDA0003662110980000071
wherein x is i,t A feature vector representing the t word of the ith sentence,
Figure BDA0003662110980000072
and
Figure BDA0003662110980000073
respectively represent x i,t Hidden layer vectors in both left-to-right and right-to-left directions, h i,t And representing the final hidden layer vector obtained by splicing the two hidden layer vectors. At each instant of time, the current hidden layer vector h i,t All depend on the hidden layer vector h at the last moment i,t-1 And the current input x i,t
S22, obtaining a hidden layer vector h i,t A calculation of the attention mechanism is performed, after which the final feature vector representation of the current sentence is calculated:
v i,t =V·tanh(W v h i,t +b v )
Figure BDA0003662110980000074
Figure BDA0003662110980000075
where V represents a randomly initialized matrix and V i,t Representing the weight value of a word for the entire sentence, tanh representing the activation function, exp representing the exponential function, W v And b v Is a randomly initialized value, automatically adjusted during training, alpha i,t The weight value of the normalized word to the whole sentence is represented, and the value range is [0,1 ]],s i Representing the resulting sentence feature vector.
The sentence vector obtained after coding considers the emotional influence of each word in the sentence on the whole sentence, and is not the sentence vector obtained by simply splicing.
In step S3, the sentence feature vector and the picture feature vector in the selected text are set as a source mode and a target mode, respectively, then the source mode is translated into the target mode by using a decoder framework of the encoder, two output vectors of the encoder are spliced to serve as a guide vector to guide the generation of the text vector in the next step, so that the problem of heterogeneity of the guide vector and the text vector is solved, and meanwhile, the fusion of the text and the picture is considered again in the text vector generation process, so that the image-text fusion effect is further improved, as shown in fig. 4. Translation refers to encoding and decoding, and the process is as follows:
(1) determining sentence vectors and picture vectors: sentence feature vector S obtained by encoding in S2 i And the picture feature vector P obtained in S1 j
(2) And (3) calculating a fusion vector: using two transformers to set the two eigenvectors as a source mode and a target mode respectively for coding to obtain a fusion vector epsilon s→p And ε p→s
ε s→p =f s→p (X s )
ε p→s =f p→s (X p )
Wherein, X s Representing text as a source modality, X p Representing pictures as source modality, epsilon s→p And ε p→s Is the fused vector of the output of the encoder, and f represents the activation function in the transform architecture.
(3) The process of obtaining the guide vector: splicing the two fusion vectors to obtain a guidance vector epsilon:
Figure BDA0003662110980000081
wherein the content of the first and second substances,
Figure BDA0003662110980000082
indicating a splicing operation.
And (4) obtaining an image-text fusion vector epsilon through the steps (1) to (3), and guiding the next step to work by using the image-text fusion vector epsilon.
In step S4, the sentence feature vector S is used i Generating a text feature vector d by combining the guide vector epsilon, as shown in fig. 5, the specific process is as follows:
s41 sentence-pair vector S i And (3) encoding: using bidirectional LSTM to pair s from front to back i And (3) encoding:
Figure BDA0003662110980000083
wherein s is i A feature vector representing the ith sentence,
Figure BDA0003662110980000084
and
Figure BDA0003662110980000085
hidden layer vectors in the ith sentence from left to right and from right to left, h i Representing the final hidden layer vector obtained by splicing the two hidden layer vectors obtained by the bidirectional LSTM.
S42, coding the hidden layer vector and the guide vector: and mapping the guide vector and the sentence hiding layer vector to the same vector space in a nonlinear mode by using a tanh activation function, wherein the calculation formula is as follows:
f=tanh(W f ε+b f )
g i =tanh(W g h i +b g )
wherein, W f 、W g 、b f And b g Is a randomly initialized value, is automatically adjusted in training, f represents the characteristic vector representation of a guide vector generated after image-text fusion, g i Representing a text feature vector representation.
S43, calculating the vector u of the sentence attention weight i : f and g i For more text factors, add g alone i To obtain a vector u representing the attention weight of a sentence i The calculation formula is as follows:
u i =U·(f⊙g i +g i )
wherein U represents a parameter matrix, f and g i And performing inner product operation to show the similarity between the two. Due to f and g i The inner product result only represents the interaction between them, not the information of the text, therefore, g needs to be added i Thereby taking into account more of the effects of the text. The vector u obtained above i And representing the attention weight of the ith sentence, wherein the larger the weight is, the greater the influence of the sentence on the text emotion is.
S44, weighting vector u i Carrying out normalization calculation to obtain text representation, and firstly carrying out weighting vector u i Carrying out normalization calculation, and then normalizing the calculated weight gamma i And sentence hiding layer vector h i And weighting and summing to obtain a text feature vector d, wherein the specific process is as follows:
Figure BDA0003662110980000091
Figure BDA0003662110980000092
wherein gamma is i And expressing the attention scores, and performing weighted summation on the attention scores and respective sentence hiding layer vectors to obtain a final text vector expression which is used as one of the sources of the emotion classification characterization vectors. This text vector contains not only the information of the text but also the information of the fusion of the text and the picture because in the second layer attention mechanism, the fusion vector epsilon generated in the symmetric translation is used to guide the generated attention weight.
In step S5, the output vectors of the two transform encoders and the text feature vector are first spliced to obtain a final vector representation, and then the vector representation is classified using a full-connection network, which includes the following specific processes:
a=concat(d,ε s→pp→s )
Figure BDA0003662110980000101
where d represents the text feature vector, ε s→p And ε p→s Representing the two output vectors of the encoder, concat represents the splicing operation, Linear represents the fully-connected network,
Figure BDA0003662110980000102
and expressing the finally obtained emotion classification result, and then adjusting the parameters of the emotion classification model through the self-carried classification and back propagation algorithm of the picture and the text.
The invention also provides a graphic context emotion classification system based on heterogeneous fusion and symmetric translation, which comprises a training module and an emotion classification model, wherein the training module is used for inputting the pictures and texts in the data set into the emotion classification model, comparing the emotion classifications corresponding to the pictures and texts in the data set with the output classifications of the emotion classification model, and adjusting the parameters of the emotion classification model through a back propagation algorithm;
the emotion classification model comprises a modal feature extraction module, a feature extraction module and a feature extraction module, wherein the modal feature extraction module is used for extracting features of data of two modes in the data set to respectively obtain feature vectors of words in the text and feature vectors of the pictures; see step S1 above specifically.
And the heterogeneous fusion attention module generates a feature vector of the sentence based on the attention mechanism and the influence of the word on the emotion polarity of the sentence, and generates a feature vector of the text based on the attention mechanism and the influence of the sentence on the emotion polarity of the text, specifically referring to the above steps S2 and S4.
The symmetrical translation and fusion module is used for respectively setting sentences and pictures as a source mode and a target mode, translating the source mode into the target mode by using a Transformer, and taking the output of the encoder as a vector representation after the two modes are fused; see step S3 above specifically.
And the emotion classification module is used for splicing the two output vectors of the transform encoder and the feature vector of the text to obtain a final vector representation for emotion classification, which is specifically referred to as the step S5.
Technical contents not described in detail in the present invention belong to the well-known techniques of those skilled in the art.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (7)

1. A graphics context emotion classification method based on heterogeneous fusion and symmetric translation is characterized by comprising the following steps:
inputting texts and pictures, and obtaining emotion classification through an emotion classification model, wherein the emotion classification model training method comprises the following steps:
s1, extracting the characteristics of the text and the picture data in the data set to respectively obtain the characteristic vector representation of the words in the text and the characteristic vector representation of the picture;
s2, coding the word feature vectors to obtain sentence feature vectors based on the attention mechanism;
s3, based on a Transformer framework, setting sentences and pictures in the text into a source mode and a target mode respectively for coding, and splicing fusion vectors output by a Transformer coder to serve as guide vectors;
s4, coding the sentence characteristic vectors to obtain text characteristic vectors based on the attention mechanism and the guide vectors;
and S5, splicing the output vector of the Transformer encoder and the feature vector of the document to obtain final vector representation, and performing emotion classification and parameter adjustment of the emotion classification model.
2. The method according to claim 1, wherein the step S1 uses BertModel-derived feature vector representation x of words i,t Obtaining a feature vector representation P of a picture using VGG16 j
3. The method according to claim 1, wherein the process encoded in step S2 includes the following steps:
s21, selecting the word feature vector x i,t Using bidirectional LSTM to respectively pair x from front to back i,t Coding is carried out, the hidden layer vectors obtained from the two directions are spliced, and a final hidden layer vector h is obtained i,t
S22, for hidden layer vector h i,t Calculation is performed using an attention mechanism, and then normalization processing is performed:
v i,t =V·tanh(W v h i,t +b v )
Figure FDA0003662110970000011
Figure FDA0003662110970000012
where V represents a randomly initialized matrix and V i,t Representing the weight value of a word for the entire sentence, tanh representing the activation function, exp representing the exponential function, W v And b v Is a randomly initialized value, automatically adjusted during training, alpha i,t Representing the weight value, s, of the normalized word for the entire sentence i Representing the resulting sentence feature vector.
4. The method according to claim 1, wherein in step S3, the transform encoder encodes in a manner that:
ε s→p =f s→p (X s )
ε p→s =f p→s (X p )
Figure FDA0003662110970000021
wherein, X s Representing text as a source modality, X p Representing pictures as source modality, epsilon s→p And ε p→s Is two output vectors of the encoder, f represents the activation function of the transform architecture,
Figure FDA0003662110970000022
representing the stitching operation and epsilon representing the guide vector.
5. The method according to claim 2, wherein in the step S4, the encoding the sentence feature vector comprises the following steps:
s41, selecting sentence characteristic vector S i Using bidirectional LSTM to pair s from front to back i Coding is carried out, hidden layer vectors obtained from two directions are spliced to obtain a sentence hidden layer vector h i
S42, hiding layer vector h for the guide vector epsilon and the sentence by using tanh activation function i Mapping to the same vector space in a nonlinear way, thereby obtaining a feature vector representation f of a guide vector generated after image-text fusion, and a text feature vector representation g i
S43, f and g based on attention mechanism i Performing inner product operation, and adding g separately for considering more text factors i To obtain a vector u representing the attention weight of a sentence i The calculation formula is as follows:
u i =U·(f⊙g i +g i )
wherein, U represents a parameter matrix obtained by random initialization, and represents inner product operation;
s44, weighting vector u i Performing normalization calculation, and normalizing the calculated weight gamma i And sentence hiding layer vector h i Weighted summation to obtain text feature directionAn amount d; wherein h is i Representing a sentence hidden layer vector, gamma i And the normalized weight value of the ith sentence relative to the whole text is represented.
6. The method according to claim 1, wherein in step S5, the obtained guidance vector and the text feature vector are first concatenated to obtain a final vector representation, then the final vector representation is classified using a full-connection network, and parameters of the emotion classification model are adjusted by a back-propagation algorithm.
7. A picture and text sentiment classification system based on heterogeneous fusion and symmetric translation is characterized by comprising a sentiment classification model and a training module, wherein the training module is used for inputting pictures and texts in a data set into the sentiment classification model, comparing the sentiment classifications corresponding to the pictures and the texts in the data set with the output classifications of the sentiment classification model, and adjusting parameters of the sentiment classification model through a back propagation algorithm;
the emotion classification model comprises: the modal characteristic extraction module is used for extracting characteristics of data of two modes in the data set to respectively obtain characteristic vectors of words in the text and characteristic vectors of pictures;
the heterogeneous fusion attention module is used for generating a feature vector of a sentence based on an attention mechanism and combined with the influence of a word on the emotion polarity of the sentence, and generating a feature vector of a text based on the attention mechanism and combined with the influence of the sentence on the emotion polarity of the text;
the symmetrical translation and fusion module is used for respectively setting sentences and pictures as a source mode and a target mode, translating the source mode into the target mode by using a Transformer, and taking the output of the encoder as a vector representation after the two modes are fused;
and the emotion classification module is used for splicing the two output vectors of the Transformer encoder and the feature vector of the text to obtain final vector representation and outputting emotion classification through a full-connection network.
CN202210580293.XA 2022-05-25 2022-05-25 Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation Pending CN114969338A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210580293.XA CN114969338A (en) 2022-05-25 2022-05-25 Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210580293.XA CN114969338A (en) 2022-05-25 2022-05-25 Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation

Publications (1)

Publication Number Publication Date
CN114969338A true CN114969338A (en) 2022-08-30

Family

ID=82955929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210580293.XA Pending CN114969338A (en) 2022-05-25 2022-05-25 Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation

Country Status (1)

Country Link
CN (1) CN114969338A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116039653A (en) * 2023-03-31 2023-05-02 小米汽车科技有限公司 State identification method, device, vehicle and storage medium
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116039653A (en) * 2023-03-31 2023-05-02 小米汽车科技有限公司 State identification method, device, vehicle and storage medium
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention

Similar Documents

Publication Publication Date Title
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
Poria et al. Context-dependent sentiment analysis in user-generated videos
US11379736B2 (en) Machine comprehension of unstructured text
CN109447242B (en) Image description regeneration system and method based on iterative learning
Gan et al. Scalable multi-channel dilated CNN–BiLSTM model with attention mechanism for Chinese textual sentiment analysis
CN111881262B (en) Text emotion analysis method based on multi-channel neural network
JP2023509031A (en) Translation method, device, device and computer program based on multimodal machine learning
CN111680159B (en) Data processing method and device and electronic equipment
CN114969338A (en) Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation
CN110688832B (en) Comment generation method, comment generation device, comment generation equipment and storage medium
CN113657115B (en) Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion
CN113407663B (en) Image-text content quality identification method and device based on artificial intelligence
CN111311364B (en) Commodity recommendation method and system based on multi-mode commodity comment analysis
CN114201605A (en) Image emotion analysis method based on joint attribute modeling
CN113420212A (en) Deep feature learning-based recommendation method, device, equipment and storage medium
Chen et al. Multimodal detection of hateful memes by applying a vision-language pre-training model
Pande et al. Development and deployment of a generative model-based framework for text to photorealistic image generation
Zaoad et al. An attention-based hybrid deep learning approach for bengali video captioning
Cao et al. Visual question answering research on multi-layer attention mechanism based on image target features
Yang et al. Fast RF-UIC: a fast unsupervised image captioning model
CN113569584A (en) Text translation method and device, electronic equipment and computer readable storage medium
CN113627550A (en) Image-text emotion analysis method based on multi-mode fusion
CN117349402A (en) Emotion cause pair identification method and system based on machine reading understanding
CN113673222A (en) Social media text fine-grained emotion analysis method based on bidirectional collaborative network
Eunice et al. Deep learning and sign language models based enhanced accessibility of e-governance services for speech and hearing-impaired

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination