CN115982652A - Cross-modal emotion analysis method based on attention network - Google Patents

Cross-modal emotion analysis method based on attention network Download PDF

Info

Publication number
CN115982652A
CN115982652A CN202211623613.1A CN202211623613A CN115982652A CN 115982652 A CN115982652 A CN 115982652A CN 202211623613 A CN202211623613 A CN 202211623613A CN 115982652 A CN115982652 A CN 115982652A
Authority
CN
China
Prior art keywords
modal
text
picture
modality
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211623613.1A
Other languages
Chinese (zh)
Inventor
章韵
王梦婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202211623613.1A priority Critical patent/CN115982652A/en
Publication of CN115982652A publication Critical patent/CN115982652A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention belongs to the fields of natural language processing, computer vision and emotion analysis technology, and discloses a cross-modal emotion analysis method based on an attention network, which comprises the following steps: step 1: extracting picture characteristics, picture text characteristics and aspect characteristics; step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the aligned modalities enter the modality updating modules, and the interactive picture features and text features are finally obtained by utilizing the correlation of different modalities to supplement step by step; and step 3: performing multi-mode fusion on the step picture characteristics and the text characteristics by adopting a self-attention mechanism; and 4, step 4: and performing concat operation on the picture characteristics, the picture text characteristics and the multi-mode characteristics to predict emotion. The method makes full use of the information interaction among the cross-modal, and is beneficial to improving the accuracy of emotion prediction.

Description

Cross-modal emotion analysis method based on attention network
Technical Field
The invention relates to the fields of natural language processing, computer vision and emotion analysis, in particular to a cross-modal emotion analysis method based on an attention network.
Background
With the development of various social networking platforms and networking technologies, the way in which users make speech on the internet is more diversified, and more users choose to express their emotions and opinions by videos, pictures or articles. How to analyze emotional tendency and public opinion orientation contained in the multi-modal information becomes a challenge in the field of emotional analysis. However, fusing multimodal information is not easy due to the heterogeneity and asynchrony of multimodal data. In terms of heterogeneity, different modalities exist in different feature spaces. In terms of asynchrony, the non-uniform sampling rates of time series data of different modalities results in a failure to obtain an optimal mapping between the different modalities. There are many studies on multi-modal analysis, and the specific methods can be summarized into the following two categories: one is to model the asynchrony of multimodal data using cross-modality attention to provide soft mapping between different modalities. However, such approaches do not take into account the heterogeneity of multimodal data. Another class considers multimodal data heterogeneity. Methods in this category separate each modality into a shared portion of modalities and a private portion of modalities, represented by different neural networks. The limitation of these approaches is that they do not take into account asynchrony between different modes.
Disclosure of Invention
In order to solve the problems of multi-modal heterogeneity and heterogeneity, the invention provides a cross-modal emotion analysis method based on an attention network.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention relates to a cross-modal emotion analysis method based on an attention network, which specifically comprises the following steps:
step 1: extracting picture characteristics and picture text characteristics corresponding to an input picture text;
step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module for aligning a representation space and two modality updating modules, each modality is aligned in the modality alignment module, the aligned modalities enter the modality updating modules, and the interactive picture features and text features are finally obtained by utilizing the correlation of different modalities to supplement step by step;
and step 3: performing multi-mode fusion on the interactive picture features and text features obtained in the step 2 by adopting a self-attention mechanism to obtain multi-mode features;
and 4, step 4: and (4) performing concat operation on the picture features and the picture text features in the step (1) and the fused multi-modal features in the step (3) to perform emotion prediction.
Preferably, the following components: the step 2 specifically comprises the following steps:
step 2.1: the modal alignment module aligns feature spaces of different modes before modal interaction to obtain multi-modal information;
step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modules
Figure BDA0004003502480000021
And &>
Figure BDA0004003502480000022
Namely a text update module and a picture update module, in order to make text and visual features focus more on the information part of a given aspect and suppress the less important parts, an aspect-guided attention method is adopted at the first layer of the modality update layer, and the specific process is as follows:
Figure BDA0004003502480000023
wherein
Figure BDA0004003502480000024
Hidden representation of the generated target modality, I A Representative facet feature vector, b (1) Represents a learnable parameter, <' > or>
Figure BDA0004003502480000025
Represents a variable parameter, <' > is combined with>
Figure BDA0004003502480000026
Representing a modality vector;
calculating a normalized attention weight:
Figure BDA0004003502480000027
using attention weights
Figure BDA0004003502480000028
Carrying out weighted average on the feature vectors of the target modes to obtain a new target mode vector->
Figure BDA0004003502480000029
Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target mode
Figure BDA00040035024800000210
The specific process is as follows:
Figure BDA0004003502480000031
wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:
Figure BDA0004003502480000032
Figure BDA0004003502480000033
wherein, SA mul ,CMA mul And the Att partial table represents a multi-head self-attention mechanism, a multi-head cross-modal attention mechanism, a normalization function and an additive attention mechanism, and in order to better fuse the image and the text modal, the additive attention mechanism is used in the invention and is specifically represented as follows:
Figure BDA0004003502480000034
Figure BDA0004003502480000035
Figure BDA0004003502480000036
wherein G, W c ,b c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtained
Figure BDA0004003502480000037
And &>
Figure BDA0004003502480000038
In order to learn the deep abstract representation of the multi-modal features, the GRU is adopted to combine the result of the interactive attention mechanism with the input of the current layer, the cross-modal attention mechanism and the self-attention mechanism are firstly used in the nth layer to obtain an enhanced multi-modal sequence, and then the GRU is used to obtain new text and picture features, wherein n does not comprise the first layer, the first layer adopts the aspect guide attention mechanism, and the specific process is as follows:
Figure BDA0004003502480000039
wherein: SA mul Representing a multi-head self-attentive mechanism,
Figure BDA00040035024800000310
for the target mode vector, n represents the number of layers.
Preferably, the following components: in the step 3, the picture features and the text features obtained in the step 2 are subjected to multi-modal fusion by using an attention-oriented mechanism, which is specifically represented as follows:
Figure BDA0004003502480000041
wherein:
Figure BDA0004003502480000042
all represent a multimodal sequence, FC is a fused multimodal function.
Preferably: the step 4 specifically comprises the following steps: performing concat operation on the text features, the picture features and the fused multi-modal features in the steps 1 and 3 to obtain a representation E containing three features mul As input data:
E mul =concat(X mul ,X L ,X V )
performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:
P=softmax(W m E+b m )
wherein W m Weight representing full connection layer, b m Representing bias and P representing emotion prediction.
Preferably: the specific process of extracting the picture features by adopting the VGG16 network is as follows:
step 11: inputting: inputting 224 x 3 matrix of image pixels;
step 12: convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:
f j =R(X i *K j +b)
where R represents the ReLU activation function, b represents the bias term, K j Convolution kernels representing different matrix depths;
step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;
step 14: finally, obtaining X for picture feature vector through pre-trained VGG16 network Vp ={X V1 ,X V2 …X Vn Denotes.
Preferably, the following components: in the step 1, a Bert pre-training model is adopted to obtain the picture text characteristics, and the specific process is as follows:
step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;
step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the input text, inputting the input text by using the word sequence, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use Lp ={X L1 ,X L2 …X Ln Denotes.
Preferably: the method for extracting the aspect features of the aspect phrases given in the step 1 comprises the following steps:
given aspect phrase a = { a = { [ a ] 1 ,A 2 …A n Using word embedding to obtain word embedding vector a j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector j
Figure BDA0004003502480000051
Then all hidden tokens V are taken j As the final aspect feature vector V A
Figure BDA0004003502480000052
The invention has the beneficial effects that:
(1) The emotion analysis method utilizes the mode alignment module and the mode updating module, adopts the attention mechanism, and carries out cross-mode interaction, thereby improving the accuracy of multi-mode emotion analysis.
(2) The mode updating module layer comprises a mode aligning module and a mode updating module, wherein the mode aligning module is used for aligning the characteristic sequences of different modes and is beneficial to interaction among the modes.
(3) The mode updating module utilizes a multi-head self-attention mechanism and a cross-mode attention mechanism to enhance interaction among the modes, and fully integrates sharing characteristics and private characteristics of different modes.
(4) In order to save rich features among the modes, the fused multi-mode features and the initial mode features are fused again, and then emotion classification is carried out.
(5) The method makes full use of the information interaction among the cross-modes, and is beneficial to improving the accuracy of emotion prediction.
Drawings
FIG. 1 is a flow chart of the emotion analysis method of the present invention.
FIG. 2 is a diagram of the emotion analysis method architecture of the present invention.
FIG. 3 is a modal update block diagram of the present invention.
Detailed Description
In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.
As shown in fig. 1 to 3, the present invention is a cross-modal emotion analysis method based on an attention network, which provides a cross-modal emotion analysis model based on a visual attention network, and enhances information interaction between two modalities of graphics and text through a modality update layer, so as to enhance robustness and accuracy of the model, and specifically, the cross-modal emotion analysis method includes the following steps:
step 1: and extracting picture characteristics, picture text characteristics and aspect characteristics of the given aspect phrase corresponding to the input picture text.
And extracting picture features by adopting a VGG16 network, wherein the VGG16 network consists of 13 convolutional layers, 5 pooling layers and 3 full-connection layers. The convolution layer obtains a picture feature map by convolution. And (3) adopting a dot multiplication mode on a representation matrix of the image, sliding the convolution kernel by a certain step length, and multiplying each element of the corresponding position and the input matrix unit to obtain the characteristic diagram of the image based on the current convolution kernel. And the pooling layer reduces the dimension of the feature map after convolution and screens local features by adopting maximum pooling. And finally, synthesizing the upper-layer output characteristics by the full connection layer.
The specific process of extracting the picture features is as follows:
step 11: inputting: inputting 224 x 3 matrix of image pixels;
step 12: and (3) convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:
f j =R(X i *K j +b)
where R represents the ReLU activation function, b represents the bias term, K j Convolution kernels representing different matrix depths;
step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;
step 14: finally, the pre-trained VGG16 network is used for obtaining the picture feature vector
X Vp ={X V1 ,X V2 …X Vn Represents it.
The method comprises the following steps of obtaining picture text characteristics by adopting a Bert pre-training model, and specifically comprising the following steps:
step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;
step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the text, inputting the text by taking the word sequence as input, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use Lp ={X L1 ,X L2 …X Ln Represents it.
The aspect feature extraction method for a given aspect phrase is as follows:
given aspect phrase a = { a = { [ a ] 1 ,A 2 …A n Get word embedding vector a first using word embedding j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector j
Figure BDA0004003502480000071
Then all hidden tokens V are taken j As the final aspect feature vector V A
Figure BDA0004003502480000072
Step 2: and the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module for aligning the representation space and two modality updating modules, each modality is aligned in the modality alignment module, the images enter the modality updating modules after being aligned, and the image features and the text features after interaction are finally obtained by utilizing the correlation of different modalities to supplement step by step.
Step 2.1: the modality alignment module aims to align feature spaces of different modalities before modality interaction, and firstly maps single modality representations of a plurality of modalities to the same storage space, wherein the single modality representations are specifically represented as follows:
Figure BDA0004003502480000073
wherein
Figure BDA0004003502480000081
Representing a text vector, mem n Represents a storage space vector, theta represents a parameter, <' > v>
Figure BDA0004003502480000082
Representing the aligned modal vector, f (·) representing the exchange function of the modal vector and the storage space vector, n representing the n-th layer modal updating layer, and the specific calculation process of the modal alignment module is as follows:
Figure BDA0004003502480000083
K=Mem n ·W K
wherein W q And W K Parameters representing linear transformations, Q * Representing the vector representation after linear transformation of two modes, wherein K represents the size of a storage space, and the similarity calculation formula of the mode vector and the storage space vector is as follows:
Figure BDA0004003502480000084
the weight of the jth memory vector is represented as:
Figure BDA0004003502480000085
the storage space vector is represented as follows after linear transformation:
V=Mem n ·W v
W v representing learnable parameters, and obtaining a query vector through memory space vector and weight calculation:
Figure BDA0004003502480000086
wherein: * E { L, V } represents image and text features, V *j Representing a memory space vector.
Step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modules
Figure BDA0004003502480000087
And &>
Figure BDA0004003502480000088
Namely a text update module and a picture update module, in order to make text and visual features focus more on the information part of a given aspect and suppress the less important parts, an aspect-guided attention method is adopted at the first layer of the modality update layer, and the specific process is as follows:
Figure BDA0004003502480000089
wherein
Figure BDA0004003502480000091
Hidden representation of the generated target modality, I A Representative facet feature vector, b (1) Represents a learnable parameter, <' > based on>
Figure BDA0004003502480000092
Represents a variable parameter, is selected>
Figure BDA0004003502480000093
Representing a modality vector;
calculating a normalized attention weight:
Figure BDA0004003502480000094
using attention weights
Figure BDA0004003502480000095
Carrying out weighted average on the characteristic vectors of the target modes to obtain a new target mode vector->
Figure BDA0004003502480000096
Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target mode
Figure BDA0004003502480000097
The specific process is as follows:
Figure BDA0004003502480000098
wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:
Figure BDA0004003502480000099
Figure BDA00040035024800000910
wherein, SA mul ,CMA mul And the Att partial table represents a multi-head self-attention mechanism, a multi-head cross-modal attention mechanism, a normalization function and an additive attention mechanism, and in order to better fuse the image and the text modal, the additive attention mechanism is used in the invention and is specifically represented as follows:
Figure BDA00040035024800000911
Figure BDA00040035024800000912
Figure BDA00040035024800000913
wherein G, W c ,b c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtained
Figure BDA00040035024800000914
And &>
Figure BDA00040035024800000915
Step 2.3, in order to learn the depth abstract representation of the multi-modal features, combining the result of the interactive attention mechanism with the input of the current layer by using a GRU, obtaining an enhanced multi-modal sequence in the nth layer by using a cross-modal attention mechanism and a self-attention mechanism, and then obtaining new text and picture features by using the GRU, wherein n does not include the first layer, the first layer adopts an aspect-guided attention mechanism, and the specific process is as follows:
Figure BDA0004003502480000101
wherein: SA mul Representing a multi-head self-attention mechanism,
Figure BDA0004003502480000102
for the target mode vector, n represents the number of layers.
And step 3: performing multi-mode fusion on the picture features and the text features obtained in the step 2 by adopting an attention-machine mechanism, and specifically representing as follows:
Figure BDA0004003502480000103
wherein:
Figure BDA0004003502480000104
all represent a multimodal sequence, FC is a fused multimodal function.
And 4, step 4: performing concat operation on the picture characteristics and the picture text characteristics in the step 1 and the fused multi-modal characteristics in the step 3 to obtain a representation E comprising three characteristics mul As input data:
E mul =concat(X mul ,X L ,X V )
performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:
P=spftmax(W m E+b m )
wherein W m Weight representing full connection layer, b m Representing bias and P representing emotion prediction.
To preserve the richer features between modalities, the present invention uses the L2 penalty as a penalty function, as follows:
Figure BDA0004003502480000105
where alpha represents a hyper-parameter.
The method makes full use of the information interaction among the cross-modes, and is beneficial to improving the accuracy of emotion prediction.

Claims (8)

1. A cross-modal emotion analysis method based on an attention network is characterized by comprising the following steps: the cross-modal emotion analysis method comprises the following steps of:
step 1: extracting picture characteristics corresponding to an input picture text, picture text characteristics and aspect characteristics of a given aspect phrase;
step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the pictures enter the modality updating modules after being aligned, and finally the picture features and the text features after interaction are obtained by utilizing the correlation of different modalities to be supplemented step by step;
and step 3: performing multi-mode fusion on the interactive picture features and text features obtained in the step 2 by adopting a self-attention mechanism to obtain multi-mode features;
and 4, step 4: and (4) performing concat operation on the picture features and the picture text features in the step (1) and the fused multi-modal features in the step (3) to perform emotion prediction.
2. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the step 2 specifically comprises the following steps:
step 2.1: the modal alignment module aligns feature spaces of different modes before modal interaction to obtain multi-modal information;
step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modules
Figure FDA0004003502470000011
And &>
Figure FDA0004003502470000012
Namely a text updating module and a picture updating module, a first layer of a modal updating layer adopts an aspect-guided attention method, and the specific process is as follows:
Figure FDA0004003502470000013
wherein
Figure FDA0004003502470000014
Hidden representation of the generated target modality, I A Representative facet feature vector, b (1) Representing a parameter that can be learned by a user,
Figure FDA0004003502470000015
represents a variable parameter, is selected>
Figure FDA0004003502470000016
Representing a modality vector;
calculating a normalized attention weight:
Figure FDA0004003502470000017
using attention weights
Figure FDA0004003502470000021
Carrying out weighted average on the characteristic vectors of the target mode to obtain a new target mode vector
Figure FDA0004003502470000022
Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target mode
Figure FDA0004003502470000023
The specific process is as follows:
Figure FDA0004003502470000024
wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:
Figure FDA0004003502470000025
Figure FDA0004003502470000026
wherein, SA mul ,CMA mul And Att partial tables represent a multi-head self-attention mechanism, a multi-head trans-modal attention mechanism, a normalization function and an additive attention mechanism, and the additive attention mechanism is used and is specifically expressed as follows:
Figure FDA0004003502470000027
Figure FDA0004003502470000028
Figure FDA0004003502470000029
wherein G, W c ,b c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtained
Figure FDA00040035024700000210
And
Figure FDA00040035024700000211
3. the attention network-based cross-modal emotion analysis method of claim 2, wherein: in step 2.3, in order to learn the deep abstract representation of the multi-modal features, the GRU is adopted to combine the result of the interactive attention mechanism with the input of the current layer, in the nth layer, the cross-modal attention mechanism and the self-attention mechanism are used to obtain an enhanced multi-modal sequence, and then the GRU is used to obtain new text and picture features, which specifically includes the following steps:
Figure FDA0004003502470000031
wherein: SA mul Representing a multi-head self-attentive mechanism,
Figure FDA0004003502470000032
for the target mode vector, n represents the number of layers.
4. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: in the step 3, the picture features and the text features obtained in the step 2 are subjected to multi-modal fusion by using an attention mechanism, which is specifically represented as follows:
Figure FDA0004003502470000033
wherein:
Figure FDA0004003502470000034
all represent a multimodal sequence, FC is a fused multimodal function.
5. The attention network-based cross-modal emotion analysis method of claim 1, wherein: the step 4 specifically comprises the following steps: performing concat operation on the text features, the picture features and the fused multi-modal features in the steps 1 and 3 to obtain a representation E containing three features mul As input data:
E mul =concat(X mul ,X L ,X V )
performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:
P=softmax(W m E+b m )
wherein W m Weight representing fully connected layer, b m Representing bias and P representing emotion prediction.
6. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the method for extracting the aspect features of the aspect phrases given in the step 1 comprises the following steps:
given aspect phrase a = { a = { [ a ] 1 ,A 2 …A n Get word embedding vector a first using word embedding j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector j
Figure FDA0004003502470000035
Then all hidden tokens V are taken j As the final aspect feature vector V A
Figure FDA0004003502470000041
7. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a VGG16 network is adopted to extract picture features, the VGG16 network is composed of 13 convolution layers, 5 pooling layers and 3 full-connection layers, and the specific process of extracting the picture features by adopting the VGG16 network is as follows:
step 11: inputting: inputting 224 x 3 matrix of image pixels;
step 12: convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:
f j =R(X i *K j +b)
where R represents the ReLU activation function, which represents the convolution operation, b represents the bias term, K j Convolution kernels representing different matrix depths;
step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;
step 14: finally, the picture feature vector is obtained through the pre-trained VGG16 network
X Vp ={X V1 ,X V2 …X Vn Represents it.
8. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a Bert pre-training model is adopted to obtain the picture text characteristics, and the specific process is as follows:
step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;
step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the input text, inputting the input text by using the word sequence, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use Lp ={X L1 ,X L2 …X Ln Denotes.
CN202211623613.1A 2022-12-16 2022-12-16 Cross-modal emotion analysis method based on attention network Pending CN115982652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211623613.1A CN115982652A (en) 2022-12-16 2022-12-16 Cross-modal emotion analysis method based on attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211623613.1A CN115982652A (en) 2022-12-16 2022-12-16 Cross-modal emotion analysis method based on attention network

Publications (1)

Publication Number Publication Date
CN115982652A true CN115982652A (en) 2023-04-18

Family

ID=85962068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211623613.1A Pending CN115982652A (en) 2022-12-16 2022-12-16 Cross-modal emotion analysis method based on attention network

Country Status (1)

Country Link
CN (1) CN115982652A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention

Similar Documents

Publication Publication Date Title
Chatterjee et al. Diverse and coherent paragraph generation from images
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
Zhang et al. Rich visual knowledge-based augmentation network for visual question answering
CN112860888B (en) Attention mechanism-based bimodal emotion analysis method
CN113065577A (en) Multi-modal emotion classification method for targets
CN110852368A (en) Global and local feature embedding and image-text fusion emotion analysis method and system
CN109992686A (en) Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN113297370A (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN114662497A (en) False news detection method based on cooperative neural network
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN114419509A (en) Multi-mode emotion analysis method and device and electronic equipment
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
Cheng et al. Stack-VS: Stacked visual-semantic attention for image caption generation
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
Pande et al. Development and deployment of a generative model-based framework for text to photorealistic image generation
CN118038139A (en) Multi-mode small sample image classification method based on large model fine tuning
CN116935170A (en) Processing method and device of video processing model, computer equipment and storage medium
CN115982652A (en) Cross-modal emotion analysis method based on attention network
Tong et al. ReverseGAN: An intelligent reverse generative adversarial networks system for complex image captioning generation
Long et al. Cross-domain personalized image captioning
CN117150069A (en) Cross-modal retrieval method and system based on global and local semantic comparison learning
CN115758159B (en) Zero sample text position detection method based on mixed contrast learning and generation type data enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination