CN115982652A - A Cross-Modal Sentiment Analysis Method Based on Attention Network - Google Patents
A Cross-Modal Sentiment Analysis Method Based on Attention Network Download PDFInfo
- Publication number
- CN115982652A CN115982652A CN202211623613.1A CN202211623613A CN115982652A CN 115982652 A CN115982652 A CN 115982652A CN 202211623613 A CN202211623613 A CN 202211623613A CN 115982652 A CN115982652 A CN 115982652A
- Authority
- CN
- China
- Prior art keywords
- modal
- modality
- features
- text
- picture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 45
- 238000000034 method Methods 0.000 claims abstract description 29
- 230000003993 interaction Effects 0.000 claims abstract description 23
- 230000008451 emotion Effects 0.000 claims abstract description 20
- 230000004927 fusion Effects 0.000 claims abstract description 6
- 230000002452 interceptive effect Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 47
- 230000008569 process Effects 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 13
- 239000000654 additive Substances 0.000 claims description 9
- 230000000996 additive effect Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000002996 emotional effect Effects 0.000 claims description 2
- 238000002372 labelling Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 239000013589 supplement Substances 0.000 abstract 1
- 230000009466 transformation Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Processing Or Creating Images (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及自然语言处理、计算机视觉领域及情感分析技术领域,具体的说是涉及一种基于注意力网络的跨模态情感分析方法。The present invention relates to the fields of natural language processing, computer vision and sentiment analysis technology, and specifically to a cross-modal sentiment analysis method based on an attention network.
背景技术Background Art
随着各类网络社交平台以及网络技术的发展,用户在网络上发表言论的方式更加多样化,越来越多的用户选择用视频、图片或者文章来表达自己的情感和观点。如何分析这些多模态信息当中蕴含的情感倾向、舆论导向成为情感分析领域所面临的挑战。然而,由于多模态数据的异构性和异步性,融合多模态信息并不容易。就异构性而言,不同模态存在不同的特征空间中。就异步性而言,不同模态的时间序列数据采样率不一致导致无法获得不同模态之间的最佳映射。现在已经有许多关于多模态分析的研究,具体方法可以归纳为以下两类:一种是采用跨模态注意力来提供不同模态之间的软映射,从而对多模式数据的异步性进行建模。然而,这类方法没有考虑多模态数据的异质性。另一类则考虑多模态数据异质性。这一类别中的方法将每个模态分为模态的共享部分和模态的私有部分,由不同的神经网络表示。这些方法的局限性在于它们没有考虑不同模式之间的异步性。With the development of various online social platforms and network technologies, users have more diverse ways to express their opinions on the Internet. More and more users choose to use videos, pictures or articles to express their emotions and opinions. How to analyze the emotional tendencies and public opinion orientations contained in these multimodal information has become a challenge in the field of sentiment analysis. However, due to the heterogeneity and asynchrony of multimodal data, it is not easy to fuse multimodal information. In terms of heterogeneity, different modalities exist in different feature spaces. In terms of asynchrony, the inconsistent sampling rates of time series data of different modalities make it impossible to obtain the best mapping between different modalities. There are many studies on multimodal analysis, and the specific methods can be summarized into the following two categories: one is to use cross-modal attention to provide soft mapping between different modalities, thereby modeling the asynchrony of multimodal data. However, this type of method does not consider the heterogeneity of multimodal data. The other type considers the heterogeneity of multimodal data. The methods in this category divide each modality into a shared part of the modality and a private part of the modality, which are represented by different neural networks. The limitation of these methods is that they do not consider the asynchrony between different modes.
发明内容Summary of the invention
为了解决多模态异构性和异质性的问题,本发明提出了一种基于注意力网络的跨模态情感分析方法,采用模态对齐模块以及模态更新模块,并利用注意力机制,进行跨模态交互,从而提高多模态情感分析的准确性。In order to solve the problems of multimodal heterogeneity and heterogeneity, the present invention proposes a cross-modal sentiment analysis method based on attention network, which adopts modal alignment module and modal update module, and uses attention mechanism to perform cross-modal interaction, so as to improve the accuracy of multimodal sentiment analysis.
为了达到上述目的,本发明是通过以下技术方案实现的:In order to achieve the above object, the present invention is achieved through the following technical solutions:
本发明是一种基于注意力网络的跨模态情感分析方法,具体包括以下步骤:The present invention is a cross-modal sentiment analysis method based on attention network, which specifically includes the following steps:
步骤1:提取输入图片文本对应的图片特征和图片文本特征;Step 1: Extract the image features and image text features corresponding to the input image text;
步骤2:提取的图片文本特征进入模态更新层,每个所述模态更新层包括一个用于对齐表示空间的模态对齐模块和两个模态更新模块,每个模态在所述模态对齐模块内对齐,对齐后进入所述模态更新模块,通过利用不同模态的相关性逐步补充,最终获得交互后的图片特征和文本特征;Step 2: The extracted image and text features enter the modality update layer. Each of the modality update layers includes a modality alignment module for aligning the representation space and two modality update modules. Each modality is aligned in the modality alignment module and enters the modality update module after alignment. The features are gradually supplemented by utilizing the correlation of different modalities, and finally the image features and text features after interaction are obtained.
步骤3:将步骤2中所获得的交互后的图片特征和文本特征采用自注意力机制进行多模态融合,得到多模态特征;Step 3: The interactive image features and text features obtained in step 2 are fused multimodally using the self-attention mechanism to obtain multimodal features;
步骤4:将步骤1中的图片特征和图片文本特征与步骤3中的融合后的多模态特征进行concat操作,进行情感预测。Step 4: Concat the image features and image text features in step 1 with the fused multimodal features in step 3 to perform sentiment prediction.
优选的:步骤2具体包括如下步骤:Preferably, step 2 specifically includes the following steps:
步骤2.1:模态对齐模块在模态交互前对齐不同模态的特征空间,得到多模态信息;Step 2.1: The modality alignment module aligns the feature spaces of different modalities before modality interaction to obtain multimodal information;
步骤2.2:对齐后的多模态信息进入模态更新模块,逐步增强每个模态,每个模态更新层包含两个模态更新模块和即文本更新模块和图片更新模块,为了使文本和视觉特征更专注于给定方面的信息部分并抑制不太重要的部分,在模态更新层的第一层采用了方面引导的注意力方法,具体过程如下:Step 2.2: The aligned multimodal information enters the modality update module to gradually enhance each modality. Each modality update layer contains two modality update modules. and That is, the text update module and the image update module. In order to make the text and visual features focus more on the information part of a given aspect and suppress the less important part, the aspect-guided attention method is adopted in the first layer of the modality update layer. The specific process is as follows:
其中代表生成的目标模态的隐藏表征,IA代表方面特征向量,b(1)代表可学习参数,表示可变参数,表示模态向量;in represents the generated hidden representation of the target modality, I A represents the aspect feature vector, b (1) represents the learnable parameter, Indicates variable parameters. represents the mode vector;
计算归一化注意权重:Calculate the normalized attention weights:
使用注意力权重对目标模态的特征向量进行加权平均,得到新的目标模态向量 Using attention weights Perform weighted averaging on the eigenvectors of the target mode to obtain a new target mode vector
步骤2.3:为了捕捉不同模态间的双向交互,增强模态间的交互,模态更新模块引入了跨模态注意力机制以及自注意力机制,增强目标模态的具体过程如下:Step 2.3: In order to capture the two-way interaction between different modalities and enhance the interaction between modalities, the modality update module introduces a cross-modal attention mechanism and a self-attention mechanism to enhance the target modality. The specific process is as follows:
其中,*代表要增强的目标模态,α则代表补充模态,如果目标模态是文本,那补充模态则是图片,公式如下:Among them, * represents the target modality to be enhanced, and α represents the supplementary modality. If the target modality is text, then the supplementary modality is the image. The formula is as follows:
其中,SAmul,CMAmul和Att分表代表多头自注意力机制、多头跨模态注意力机制和归一化函数以及加性注意力机制,为了更好的融合图片和文本模态,本发明使用加性注意力机制,具体表示如下:Among them, SA mul , CMA mul and Att represent the multi-head self-attention mechanism, the multi-head cross-modal attention mechanism and the normalization function and the additive attention mechanism. In order to better integrate the image and text modalities, the present invention uses the additive attention mechanism, which is specifically expressed as follows:
其中G,Wc,bc代表可学习参数,每个模态更新模块的权重都是通过加性注意力机制动态计算获得,从而达到两个模态间信息交互的目的,最终获得曾强后的多模态序列和 Where G, Wc , bc represent learnable parameters, and the weight of each modality update module is dynamically calculated through the additive attention mechanism, so as to achieve the purpose of information interaction between the two modalities, and finally obtain the enhanced multimodal sequence and
为了学习多模态特征的深度抽象表征,采用GRU将交互注意力机制后的结果与当前层的输入结合起来,在第n层中首先使用跨模态注意力机制以及自注意力机制获得增强后的多模态序列,然后使用GRU获得新的文本和图片特征,其中,n不包括第一层,第一层采用方面引导注意力机制,具体过程如下:In order to learn the deep abstract representation of multimodal features, GRU is used to combine the results of the interactive attention mechanism with the input of the current layer. In the nth layer, the cross-modal attention mechanism and the self-attention mechanism are first used to obtain the enhanced multimodal sequence, and then GRU is used to obtain new text and image features. Among them, n does not include the first layer. The first layer uses the aspect-guided attention mechanism. The specific process is as follows:
其中:SAmul代表多头自注意力机制,为目标模态向量,n代表层数。Among them: SA mul represents the multi-head self-attention mechanism, is the target mode vector, and n represents the number of layers.
优选的:在所述步骤3,将步骤2中所获得的图片特征和文本特征采用自注意力机制进行多模态融合,具体表示如下:Preferably, in step 3, the image features and text features obtained in step 2 are multimodally fused using a self-attention mechanism, which is specifically expressed as follows:
其中:均表示多模态序列,FC是融合多模态函数。in: Both represent multimodal sequences, and FC is the fusion multimodal function.
优选的:步骤4具体为:将步骤1和步骤3中的文本特征、图片特征和融合后的多模态特征进行concat操作得到包含三种特征表示Emul作为输入数据:Preferably, step 4 is specifically as follows: performing a concat operation on the text features, the image features and the fused multimodal features in steps 1 and 3 to obtain three feature representations E mul as input data:
Emul=concat(Xmul,XL,XV)E mul =concat(X mul ,X L ,X V )
使用全连接网络对数据进行特征融合,并在最后一层使用softmax分类器进行情感预测,情感预测计算公式如下:Use a fully connected network to fuse the data features, and use a softmax classifier in the last layer to predict sentiment. The sentiment prediction calculation formula is as follows:
P=softmax(WmE+bm)P = softmax(W m E + b m )
其中Wm代表全连接层的权重,bm代表偏置,P代表情感预测。Where Wm represents the weight of the fully connected layer, bm represents the bias, and P represents the sentiment prediction.
优选的:采用VGG16网络提取图片特征的具体过程如下:Preferred: The specific process of extracting image features using the VGG16 network is as follows:
步骤11:输入:输入224*224*3的图像像素矩阵;Step 11: Input: Input 224*224*3 image pixel matrix;
步骤12:卷积池化:输入的图像像素矩阵经过5轮卷积池化操作,每轮卷积核大小均为3*3*w,w代表矩阵深度,卷积之后通过激活函数ReLU得到多个特征图,并且采用最大池化筛选局部特征,其中卷积计算公式如下:Step 12: Convolution pooling: The input image pixel matrix undergoes 5 rounds of convolution pooling operations. The size of each convolution kernel is 3*3*w, where w represents the matrix depth. After convolution, multiple feature maps are obtained through the activation function ReLU, and the maximum pooling is used to filter local features. The convolution calculation formula is as follows:
fj=R(Xi*Kj+b)f j =R(X i *K j +b)
其中R代表ReLU激活函数,*代表卷积操作,b代表偏置项,Kj代表不同矩阵深度的卷积核;Where R represents the ReLU activation function, * represents the convolution operation, b represents the bias term, and Kj represents the convolution kernel of different matrix depths;
步骤13:全连接:经过三次全连接得到1*1*1000的图像特征表示向量;Step 13: Full connection: After three full connections, a 1*1*1000 image feature representation vector is obtained;
步骤14:最终通过预训练好的VGG16网络获取图片特征向量用XVp={XV1,XV2…XVn}表示。Step 14: Finally, the image feature vector is obtained through the pre-trained VGG16 network and represented by X Vp = {X V1 , X V2 …X Vn }.
优选的:步骤1中采用Bert预训练模型获取图片文本特征,具体过程如下:Preferably: In step 1, the Bert pre-training model is used to obtain the image text features. The specific process is as follows:
步骤21:文本预处理:针对网络用语中无意义的词汇、符号进行预处理,将不影响判断文本情感倾向的词汇作为停用词并删除;Step 21: Text preprocessing: Preprocess meaningless words and symbols in network terms, and delete words that do not affect the judgment of the text sentiment tendency as stop words;
步骤22:采用预训练好的Bert模型提取输入文本的词向量序列,输入文本,经过分割标注后,以词序列为输入,并经过Bert模型的词嵌入、段嵌入、位置嵌入,在堆栈中层层流动,最终生成文本特征的词向量,用XLp={XL1,XL2…XLn}表示。Step 22: Use the pre-trained Bert model to extract the word vector sequence of the input text. After the input text is segmented and labeled, it takes the word sequence as input and passes through the word embedding, segment embedding, and position embedding of the Bert model. It flows layer by layer in the stack and finally generates the word vector of the text feature, which is represented by XLp = { XL1 , XL2 … XLn }.
优选的:步骤1中给定方面短语的方面特征的提取方法如下:Preferably, the method for extracting aspect features of a given aspect phrase in step 1 is as follows:
给定方面短语A={A1,A2…An},首先使用词嵌入获得词嵌入向量aj,然后采用双向LSTM模型学习每个方面词嵌入向量的隐藏表示Vj:Given an aspect phrase A = {A 1 , A 2 …A n }, first use word embedding to obtain the word embedding vector a j , and then use the bidirectional LSTM model to learn the hidden representation V j of each aspect word embedding vector:
然后取所有隐藏表征Vj的平均值作为最终的方面特征向量VA:Then take the average of all hidden representations Vj as the final aspect feature vector V A :
本发明的有益效果是:The beneficial effects of the present invention are:
(1)本发明的情感分析方法利用模态对齐模块以及模态更新模块,并采用注意力机制,进行跨模态交互,从而提高多模态情感分析的准确性。(1) The sentiment analysis method of the present invention utilizes a modal alignment module and a modal update module, and adopts an attention mechanism to perform cross-modal interaction, thereby improving the accuracy of multimodal sentiment analysis.
(2)本发明的模态更新模层包括模态对齐模块和模态更新模块,模态对齐模块用于对齐不同模态的特征序列,有助于模态间的交互。(2) The modality update module layer of the present invention includes a modality alignment module and a modality update module. The modality alignment module is used to align feature sequences of different modalities, which facilitates interaction between modalities.
(3)模态更新模块利用多头自注意机制,跨模态注意力机制加强模态间的交互,充分融合不同模态的共享特性及私有特性。(3) The modality update module uses a multi-head self-attention mechanism and a cross-modal attention mechanism to strengthen the interaction between modalities and fully integrate the shared and private characteristics of different modalities.
(4)为了保存模态间的丰富特征,本发明将融合后的多模态特征与初始的模态特征再次进行融合,然后再进行情感分类。(4) In order to preserve the rich features between modalities, the present invention fuses the fused multimodal features with the initial modal features again and then performs sentiment classification.
(5)本发明充分利用了跨模态间的信息交互,有助于提高情感预测的准确性。(5) The present invention makes full use of cross-modal information interaction, which helps to improve the accuracy of emotion prediction.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明情感分析方法流程图。FIG1 is a flow chart of the sentiment analysis method of the present invention.
图2是本发明情感分析方法架构图。FIG. 2 is a diagram showing the structure of the sentiment analysis method of the present invention.
图3是本发明模态更新模块图。FIG. 3 is a diagram of a modal update module of the present invention.
具体实施方式DETAILED DESCRIPTION
以下将以图式揭露本发明的实施方式,为明确说明起见,许多实务上的细节将在以下叙述中一并说明。然而,应了解到,这些实务上的细节不应用以限制本发明。也就是说,在本发明的部分实施方式中,这些实务上的细节是非必要的。The following will disclose the embodiments of the present invention with drawings. For the purpose of clear description, many practical details will be described together in the following description. However, it should be understood that these practical details should not be used to limit the present invention. That is to say, in some embodiments of the present invention, these practical details are not necessary.
如图1-3所示,本发明是一种基于注意力网络的跨模态情感分析方法,提出基于视觉注意网络的跨模态情感分析模型,通过模态更新层增强图文两个模态间的信息交互,从而增强模型的鲁棒性和准确性,具体的,该跨模态情感分析方法包括以下步骤:As shown in Figures 1-3, the present invention is a cross-modal sentiment analysis method based on an attention network, and proposes a cross-modal sentiment analysis model based on a visual attention network. The information interaction between the two modalities of image and text is enhanced through a modal update layer, thereby enhancing the robustness and accuracy of the model. Specifically, the cross-modal sentiment analysis method includes the following steps:
步骤1:提取输入图片文本对应的图片特征,图片文本特征以及给定方面短语的方面特征。Step 1: Extract the image features corresponding to the input image text, image text features and aspect features of the given aspect phrase.
采用VGG16网络提取图片特征,VGG16网络由13个卷积层、5个池化层和3个全连接层构成。卷积层通过卷积获得图片特征图。在图像的表示矩阵上采用点乘方式,卷积核以一定步长滑动,并与对应位置的每个元素以及输入的矩阵单元相乘,最总得到图像基于当前卷积核的特征图。池化层对卷积后的特征图进行降维,并采用最大池化筛局部特征。最后由全连接层综合上层输出特征。The VGG16 network is used to extract image features. The VGG16 network consists of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. The convolutional layer obtains the image feature map through convolution. The point multiplication method is used on the image representation matrix. The convolution kernel slides with a certain step size and multiplies each element at the corresponding position and the input matrix unit. Finally, the feature map of the image based on the current convolution kernel is obtained. The pooling layer reduces the dimension of the feature map after convolution and uses the maximum pooling to screen local features. Finally, the fully connected layer integrates the output features of the upper layer.
提取图片特征的具体过程如下:The specific process of extracting image features is as follows:
步骤11:输入:输入224*224*3的图像像素矩阵;Step 11: Input: Input 224*224*3 image pixel matrix;
步骤12:卷积池化:输入的图像像素矩阵经过5轮卷积池化操作,每轮卷积核大小均为3*3*w,w代表矩阵深度,卷积之后通过激活函数ReLU得到多个特征图,并且采用最大池化筛选局部特征,其中卷积计算公式如下:Step 12: Convolution pooling: The input image pixel matrix undergoes 5 rounds of convolution pooling operations. The size of each convolution kernel is 3*3*w, where w represents the matrix depth. After convolution, multiple feature maps are obtained through the activation function ReLU, and the maximum pooling is used to filter local features. The convolution calculation formula is as follows:
fj=R(Xi*Kj+b)f j =R(X i *K j +b)
其中R代表ReLU激活函数,*代表卷积操作,b代表偏置项,Kj代表不同矩阵深度的卷积核;Where R represents the ReLU activation function, * represents the convolution operation, b represents the bias term, and Kj represents the convolution kernel of different matrix depths;
步骤13:全连接:经过三次全连接得到1*1*1000的图像特征表示向量;Step 13: Full connection: After three full connections, a 1*1*1000 image feature representation vector is obtained;
步骤14:最终通过预训练好的VGG16网络获取图片特征向量用Step 14: Finally, use the pre-trained VGG16 network to obtain the image feature vector
XVp={XV1,XV2…XVn}表示。X Vp ={X V1 ,X V2 ..X Vn }.
采用Bert预训练模型获取图片文本特征,具体过程如下:The Bert pre-training model is used to obtain image text features. The specific process is as follows:
步骤21:文本预处理:针对网络用语中无意义的词汇、符号进行预处理,将不影响判断文本情感倾向的词汇作为停用词并删除;Step 21: Text preprocessing: Preprocess meaningless words and symbols in network terms, and delete words that do not affect the judgment of the text sentiment tendency as stop words;
步骤22:采用预训练好的Bert模型提取输入文本的词向量序列,输入文本,经过分割标注后,以词序列为输入,并经过Bert模型的词嵌入、段嵌入、位置嵌入,在堆栈中层层流动,最终生成文本特征的词向量,用XLp={XL1,XL2…XLn}表示。Step 22: Use the pre-trained Bert model to extract the word vector sequence of the input text. After the input text is segmented and labeled, it takes the word sequence as input and passes through the word embedding, segment embedding, and position embedding of the Bert model. It flows layer by layer in the stack and finally generates the word vector of the text feature, which is represented by XLp = { XL1 , XL2 … XLn }.
给定方面短语的方面特征的提取方法如下:The extraction method of aspect features for a given aspect phrase is as follows:
给定方面短语A={A1,A2…An},首先使用词嵌入获得词嵌入向量aj,然后采用双向LSTM模型学习每个方面词嵌入向量的隐藏表示Vj:Given an aspect phrase A = {A 1 , A 2 …A n }, first use word embedding to obtain the word embedding vector a j , and then use the bidirectional LSTM model to learn the hidden representation V j of each aspect word embedding vector:
然后取所有隐藏表征Vj的平均值作为最终的方面特征向量VA:Then take the average of all hidden representations Vj as the final aspect feature vector V A :
步骤2:提取的图片文本特征进入模态更新层,每个所述模态更新层包括一个用于对齐表示空间的模态对齐模块和两个模态更新模块,每个模态在所述模态对齐模块内对齐,对齐后进入所述模态更新模块,通过利用不同模态的相关性逐步补充,最终获得交互后的图片特征和文本特征。Step 2: The extracted image text features enter the modality update layer. Each of the modality update layers includes a modality alignment module for aligning the representation space and two modality update modules. Each modality is aligned in the modality alignment module and enters the modality update module after alignment. By gradually supplementing the correlation between different modalities, the image features and text features after interaction are finally obtained.
步骤2.1:模态对齐模块旨在在模态交互前对齐不同模态的特征空间,首先将多个模态的单模态表示映射到同一存储空间,具体表示如下:Step 2.1: The modality alignment module aims to align the feature spaces of different modalities before modal interaction. First, the single modality representations of multiple modalities are mapped to the same storage space. The specific representation is as follows:
其中代表文本向量,Memn代表存储空间向量,θ代表参数,代表对齐后的模态向量,f(·)代表模态向量和存储空间向量的交换函数,n代表第n层模态更新层,模态对齐模块具体计算过程如下:in represents the text vector, Mem n represents the storage space vector, θ represents the parameter, represents the aligned modal vector, f(·) represents the exchange function between the modal vector and the storage space vector, and n represents the nth modal update layer. The specific calculation process of the modal alignment module is as follows:
K=Memn·WK K=Mem n ·W K
其中Wq和WK代表线性变换的参数,Q*代表两种模态线性变换后的向量表示,K表示存储空间的大小,模态向量和存储空间向量相似度计算公式如下:Where Wq and WK represent the parameters of the linear transformation, Q * represents the vector representation of the two modal linear transformations, K represents the size of the storage space, and the similarity calculation formula between the modal vector and the storage space vector is as follows:
第j个存储向量的权重表示为:The weight of the jth storage vector is expressed as:
存储空间向量经过线性变换后表示如下:The storage space vector is expressed as follows after linear transformation:
V=Memn·Wv V=Mem n ·W v
Wv代表可学习参数,查询向量通过内存空间向量和权重计算获得:W v represents the learnable parameters, and the query vector is obtained by calculating the memory space vector and weight:
其中:*∈{L,V}代表图像特征和文本特征,V*j表示内存空间向量。Where: *∈{L,V} represents image features and text features, and V *j represents the memory space vector.
步骤2.2:对齐后的多模态信息进入模态更新模块,逐步增强每个模态,每个模态更新层包含两个模态更新模块和即文本更新模块和图片更新模块,为了使文本和视觉特征更专注于给定方面的信息部分并抑制不太重要的部分,在模态更新层的第一层采用了方面引导的注意力方法,具体过程如下:Step 2.2: The aligned multimodal information enters the modality update module to gradually enhance each modality. Each modality update layer contains two modality update modules. and That is, the text update module and the image update module. In order to make the text and visual features focus more on the information part of a given aspect and suppress the less important part, the aspect-guided attention method is adopted in the first layer of the modality update layer. The specific process is as follows:
其中代表生成的目标模态的隐藏表征,IA代表方面特征向量,b(1)代表可学习参数,表示可变参数,表示模态向量;in represents the generated hidden representation of the target modality, I A represents the aspect feature vector, b (1) represents the learnable parameter, Indicates variable parameters. represents the mode vector;
计算归一化注意权重:Calculate the normalized attention weights:
使用注意力权重对目标模态的特征向量进行加权平均,得到新的目标模态向量 Using attention weights Perform weighted averaging on the eigenvectors of the target mode to obtain a new target mode vector
步骤2.3:为了捕捉不同模态间的双向交互,增强模态间的交互,模态更新模块引入了跨模态注意力机制以及自注意力机制,增强目标模态的具体过程如下:Step 2.3: In order to capture the two-way interaction between different modalities and enhance the interaction between modalities, the modality update module introduces a cross-modal attention mechanism and a self-attention mechanism to enhance the target modality. The specific process is as follows:
其中,*代表要增强的目标模态,α则代表补充模态,如果目标模态是文本,那补充模态则是图片,公式如下:Among them, * represents the target modality to be enhanced, and α represents the supplementary modality. If the target modality is text, then the supplementary modality is the image. The formula is as follows:
其中,SAmul,CMAmul和Att分表代表多头自注意力机制、多头跨模态注意力机制和归一化函数以及加性注意力机制,为了更好的融合图片和文本模态,本发明使用加性注意力机制,具体表示如下:Among them, SA mul , CMA mul and Att represent the multi-head self-attention mechanism, the multi-head cross-modal attention mechanism and the normalization function and the additive attention mechanism. In order to better integrate the image and text modalities, the present invention uses the additive attention mechanism, which is specifically expressed as follows:
其中G,Wc,bc代表可学习参数,每个模态更新模块的权重都是通过加性注意力机制动态计算获得,从而达到两个模态间信息交互的目的,最终获得曾强后的多模态序列和 Where G, Wc , bc represent learnable parameters, and the weight of each modality update module is dynamically calculated through the additive attention mechanism, so as to achieve the purpose of information interaction between the two modalities, and finally obtain the enhanced multimodal sequence and
步骤2.3为了学习多模态特征的深度抽象表征,采用GRU将交互注意力机制后的结果与当前层的输入结合起来,在第n层中首先使用跨模态注意力机制以及自注意力机制获得增强后的多模态序列,然后使用GRU获得新的文本和图片特征,其中,n不包括第一层,第一层采用方面引导注意力机制,具体过程如下:Step 2.3 In order to learn the deep abstract representation of multimodal features, GRU is used to combine the results of the interactive attention mechanism with the input of the current layer. In the nth layer, the cross-modal attention mechanism and the self-attention mechanism are first used to obtain the enhanced multimodal sequence, and then GRU is used to obtain new text and image features. Among them, n does not include the first layer. The first layer uses the aspect-guided attention mechanism. The specific process is as follows:
其中:SAmul代表多头自注意力机制,为目标模态向量,n代表层数。Among them: SA mul represents the multi-head self-attention mechanism, is the target mode vector, and n represents the number of layers.
步骤3:将步骤2中所获得的图片特征和文本特征采用自注意力机制进行多模态融合,具体表示如下:Step 3: The image features and text features obtained in step 2 are fused in a multimodal manner using the self-attention mechanism, as shown below:
其中:均表示多模态序列,FC是融合多模态函数。in: Both represent multimodal sequences, and FC is the fusion multimodal function.
步骤4:将步骤1中的图片特征和图片文本特征与步骤3中的融合后的多模态特征进行concat操作得到包括三种特征表示Emul作为输入数据:Step 4: Perform a concat operation on the image features and image text features in step 1 and the fused multimodal features in step 3 to obtain three feature representations E mul as input data:
Emul=concat(Xmul,XL,XV)E mul =concat(X mul ,X L ,X V )
使用全连接网络对数据进行特征融合,并在最后一层使用softmax分类器进行情感预测,情感预测计算公式如下:Use a fully connected network to fuse the data features, and use a softmax classifier in the last layer to predict sentiment. The sentiment prediction calculation formula is as follows:
P=spftmax(WmE+bm)P = spftmax (W m E + b m )
其中Wm代表全连接层的权重,bm代表偏置,P代表情感预测。Where Wm represents the weight of the fully connected layer, bm represents the bias, and P represents the sentiment prediction.
为了保持模态间更丰富的特征,本发明使用L2损失作为损失函数,具体如下:In order to maintain richer features between modalities, the present invention uses L2 loss as the loss function, which is as follows:
其中α代表超参数。Where α represents a hyperparameter.
本发明充分利用了跨模态间的信息交互,有助于提高情感预测的准确性。The present invention makes full use of cross-modal information interaction, which helps to improve the accuracy of emotion prediction.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211623613.1A CN115982652A (en) | 2022-12-16 | 2022-12-16 | A Cross-Modal Sentiment Analysis Method Based on Attention Network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211623613.1A CN115982652A (en) | 2022-12-16 | 2022-12-16 | A Cross-Modal Sentiment Analysis Method Based on Attention Network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115982652A true CN115982652A (en) | 2023-04-18 |
Family
ID=85962068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211623613.1A Pending CN115982652A (en) | 2022-12-16 | 2022-12-16 | A Cross-Modal Sentiment Analysis Method Based on Attention Network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115982652A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
CN118690259A (en) * | 2024-08-28 | 2024-09-24 | 齐鲁工业大学(山东省科学院) | A cross-modal positive and negative semantic classification method based on text sentiment and image content perception |
-
2022
- 2022-12-16 CN CN202211623613.1A patent/CN115982652A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
CN118690259A (en) * | 2024-08-28 | 2024-09-24 | 齐鲁工业大学(山东省科学院) | A cross-modal positive and negative semantic classification method based on text sentiment and image content perception |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110852368B (en) | Global and local feature embedding and image-text fusion emotion analysis method and system | |
CN113065577A (en) | A Goal-Oriented Multimodal Sentiment Classification Method | |
CN109992686A (en) | Image-text retrieval system and method based on multi-angle self-attention mechanism | |
CN114936623B (en) | Aspect-level emotion analysis method integrating multi-mode data | |
CN110737801A (en) | Content classification method and device, computer equipment and storage medium | |
CN114549850B (en) | A multimodal image aesthetic quality assessment method to solve the missing modality problem | |
CN111881262A (en) | Text emotion analysis method based on multi-channel neural network | |
CN113822340A (en) | A method for image and text emotion recognition based on attention mechanism | |
CN112651940B (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN115982652A (en) | A Cross-Modal Sentiment Analysis Method Based on Attention Network | |
CN111598183A (en) | Multi-feature fusion image description method | |
CN116244473B (en) | A Multimodal Emotion Recognition Method Based on Feature Decoupling and Graph Knowledge Distillation | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
Cheng et al. | Stack-VS: Stacked visual-semantic attention for image caption generation | |
CN116434023A (en) | Emotion recognition method, system and device based on multi-modal cross-attention network | |
CN117671460A (en) | A cross-modal image and text sentiment analysis method based on hybrid fusion | |
CN111522979B (en) | Picture sorting recommendation method and device, electronic equipment and storage medium | |
CN117708642A (en) | A multi-modal aspect-level sentiment analysis method that fuses images and text at multiple levels | |
Liu et al. | A multimodal approach for multiple-relation extraction in videos | |
CN109766918A (en) | Salient object detection method based on multi-level context information fusion | |
CN113780350B (en) | ViLBERT and BiLSTM-based image description method | |
CN115292533A (en) | Cross-modal pedestrian retrieval method driven by visual positioning | |
Zhao et al. | Fusion with GCN and SE-ResNeXt network for aspect based multimodal sentiment analysis | |
Wang | Improved facial expression recognition method based on gan |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |