CN115982652A - A Cross-Modal Sentiment Analysis Method Based on Attention Network - Google Patents

A Cross-Modal Sentiment Analysis Method Based on Attention Network Download PDF

Info

Publication number
CN115982652A
CN115982652A CN202211623613.1A CN202211623613A CN115982652A CN 115982652 A CN115982652 A CN 115982652A CN 202211623613 A CN202211623613 A CN 202211623613A CN 115982652 A CN115982652 A CN 115982652A
Authority
CN
China
Prior art keywords
modal
modality
features
text
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211623613.1A
Other languages
Chinese (zh)
Inventor
章韵
王梦婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202211623613.1A priority Critical patent/CN115982652A/en
Publication of CN115982652A publication Critical patent/CN115982652A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention belongs to the fields of natural language processing, computer vision and emotion analysis technology, and discloses a cross-modal emotion analysis method based on an attention network, which comprises the following steps: step 1: extracting picture characteristics, picture text characteristics and aspect characteristics; step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the aligned modalities enter the modality updating modules, and the interactive picture features and text features are finally obtained by utilizing the correlation of different modalities to supplement step by step; and step 3: performing multi-mode fusion on the step picture characteristics and the text characteristics by adopting a self-attention mechanism; and 4, step 4: and performing concat operation on the picture characteristics, the picture text characteristics and the multi-mode characteristics to predict emotion. The method makes full use of the information interaction among the cross-modal, and is beneficial to improving the accuracy of emotion prediction.

Description

一种基于注意力网络的跨模态情感分析方法A cross-modal sentiment analysis method based on attention network

技术领域Technical Field

本发明涉及自然语言处理、计算机视觉领域及情感分析技术领域,具体的说是涉及一种基于注意力网络的跨模态情感分析方法。The present invention relates to the fields of natural language processing, computer vision and sentiment analysis technology, and specifically to a cross-modal sentiment analysis method based on an attention network.

背景技术Background Art

随着各类网络社交平台以及网络技术的发展,用户在网络上发表言论的方式更加多样化,越来越多的用户选择用视频、图片或者文章来表达自己的情感和观点。如何分析这些多模态信息当中蕴含的情感倾向、舆论导向成为情感分析领域所面临的挑战。然而,由于多模态数据的异构性和异步性,融合多模态信息并不容易。就异构性而言,不同模态存在不同的特征空间中。就异步性而言,不同模态的时间序列数据采样率不一致导致无法获得不同模态之间的最佳映射。现在已经有许多关于多模态分析的研究,具体方法可以归纳为以下两类:一种是采用跨模态注意力来提供不同模态之间的软映射,从而对多模式数据的异步性进行建模。然而,这类方法没有考虑多模态数据的异质性。另一类则考虑多模态数据异质性。这一类别中的方法将每个模态分为模态的共享部分和模态的私有部分,由不同的神经网络表示。这些方法的局限性在于它们没有考虑不同模式之间的异步性。With the development of various online social platforms and network technologies, users have more diverse ways to express their opinions on the Internet. More and more users choose to use videos, pictures or articles to express their emotions and opinions. How to analyze the emotional tendencies and public opinion orientations contained in these multimodal information has become a challenge in the field of sentiment analysis. However, due to the heterogeneity and asynchrony of multimodal data, it is not easy to fuse multimodal information. In terms of heterogeneity, different modalities exist in different feature spaces. In terms of asynchrony, the inconsistent sampling rates of time series data of different modalities make it impossible to obtain the best mapping between different modalities. There are many studies on multimodal analysis, and the specific methods can be summarized into the following two categories: one is to use cross-modal attention to provide soft mapping between different modalities, thereby modeling the asynchrony of multimodal data. However, this type of method does not consider the heterogeneity of multimodal data. The other type considers the heterogeneity of multimodal data. The methods in this category divide each modality into a shared part of the modality and a private part of the modality, which are represented by different neural networks. The limitation of these methods is that they do not consider the asynchrony between different modes.

发明内容Summary of the invention

为了解决多模态异构性和异质性的问题,本发明提出了一种基于注意力网络的跨模态情感分析方法,采用模态对齐模块以及模态更新模块,并利用注意力机制,进行跨模态交互,从而提高多模态情感分析的准确性。In order to solve the problems of multimodal heterogeneity and heterogeneity, the present invention proposes a cross-modal sentiment analysis method based on attention network, which adopts modal alignment module and modal update module, and uses attention mechanism to perform cross-modal interaction, so as to improve the accuracy of multimodal sentiment analysis.

为了达到上述目的,本发明是通过以下技术方案实现的:In order to achieve the above object, the present invention is achieved through the following technical solutions:

本发明是一种基于注意力网络的跨模态情感分析方法,具体包括以下步骤:The present invention is a cross-modal sentiment analysis method based on attention network, which specifically includes the following steps:

步骤1:提取输入图片文本对应的图片特征和图片文本特征;Step 1: Extract the image features and image text features corresponding to the input image text;

步骤2:提取的图片文本特征进入模态更新层,每个所述模态更新层包括一个用于对齐表示空间的模态对齐模块和两个模态更新模块,每个模态在所述模态对齐模块内对齐,对齐后进入所述模态更新模块,通过利用不同模态的相关性逐步补充,最终获得交互后的图片特征和文本特征;Step 2: The extracted image and text features enter the modality update layer. Each of the modality update layers includes a modality alignment module for aligning the representation space and two modality update modules. Each modality is aligned in the modality alignment module and enters the modality update module after alignment. The features are gradually supplemented by utilizing the correlation of different modalities, and finally the image features and text features after interaction are obtained.

步骤3:将步骤2中所获得的交互后的图片特征和文本特征采用自注意力机制进行多模态融合,得到多模态特征;Step 3: The interactive image features and text features obtained in step 2 are fused multimodally using the self-attention mechanism to obtain multimodal features;

步骤4:将步骤1中的图片特征和图片文本特征与步骤3中的融合后的多模态特征进行concat操作,进行情感预测。Step 4: Concat the image features and image text features in step 1 with the fused multimodal features in step 3 to perform sentiment prediction.

优选的:步骤2具体包括如下步骤:Preferably, step 2 specifically includes the following steps:

步骤2.1:模态对齐模块在模态交互前对齐不同模态的特征空间,得到多模态信息;Step 2.1: The modality alignment module aligns the feature spaces of different modalities before modality interaction to obtain multimodal information;

步骤2.2:对齐后的多模态信息进入模态更新模块,逐步增强每个模态,每个模态更新层包含两个模态更新模块

Figure BDA0004003502480000021
Figure BDA0004003502480000022
即文本更新模块和图片更新模块,为了使文本和视觉特征更专注于给定方面的信息部分并抑制不太重要的部分,在模态更新层的第一层采用了方面引导的注意力方法,具体过程如下:Step 2.2: The aligned multimodal information enters the modality update module to gradually enhance each modality. Each modality update layer contains two modality update modules.
Figure BDA0004003502480000021
and
Figure BDA0004003502480000022
That is, the text update module and the image update module. In order to make the text and visual features focus more on the information part of a given aspect and suppress the less important part, the aspect-guided attention method is adopted in the first layer of the modality update layer. The specific process is as follows:

Figure BDA0004003502480000023
Figure BDA0004003502480000023

其中

Figure BDA0004003502480000024
代表生成的目标模态的隐藏表征,IA代表方面特征向量,b(1)代表可学习参数,
Figure BDA0004003502480000025
表示可变参数,
Figure BDA0004003502480000026
表示模态向量;in
Figure BDA0004003502480000024
represents the generated hidden representation of the target modality, I A represents the aspect feature vector, b (1) represents the learnable parameter,
Figure BDA0004003502480000025
Indicates variable parameters.
Figure BDA0004003502480000026
represents the mode vector;

计算归一化注意权重:Calculate the normalized attention weights:

Figure BDA0004003502480000027
Figure BDA0004003502480000027

使用注意力权重

Figure BDA0004003502480000028
对目标模态的特征向量进行加权平均,得到新的目标模态向量
Figure BDA0004003502480000029
Using attention weights
Figure BDA0004003502480000028
Perform weighted averaging on the eigenvectors of the target mode to obtain a new target mode vector
Figure BDA0004003502480000029

步骤2.3:为了捕捉不同模态间的双向交互,增强模态间的交互,模态更新模块引入了跨模态注意力机制以及自注意力机制,增强目标模态

Figure BDA00040035024800000210
的具体过程如下:Step 2.3: In order to capture the two-way interaction between different modalities and enhance the interaction between modalities, the modality update module introduces a cross-modal attention mechanism and a self-attention mechanism to enhance the target modality.
Figure BDA00040035024800000210
The specific process is as follows:

Figure BDA0004003502480000031
Figure BDA0004003502480000031

其中,*代表要增强的目标模态,α则代表补充模态,如果目标模态是文本,那补充模态则是图片,公式如下:Among them, * represents the target modality to be enhanced, and α represents the supplementary modality. If the target modality is text, then the supplementary modality is the image. The formula is as follows:

Figure BDA0004003502480000032
Figure BDA0004003502480000032

Figure BDA0004003502480000033
Figure BDA0004003502480000033

其中,SAmul,CMAmul和Att分表代表多头自注意力机制、多头跨模态注意力机制和归一化函数以及加性注意力机制,为了更好的融合图片和文本模态,本发明使用加性注意力机制,具体表示如下:Among them, SA mul , CMA mul and Att represent the multi-head self-attention mechanism, the multi-head cross-modal attention mechanism and the normalization function and the additive attention mechanism. In order to better integrate the image and text modalities, the present invention uses the additive attention mechanism, which is specifically expressed as follows:

Figure BDA0004003502480000034
Figure BDA0004003502480000034

Figure BDA0004003502480000035
Figure BDA0004003502480000035

Figure BDA0004003502480000036
Figure BDA0004003502480000036

其中G,Wc,bc代表可学习参数,每个模态更新模块的权重都是通过加性注意力机制动态计算获得,从而达到两个模态间信息交互的目的,最终获得曾强后的多模态序列

Figure BDA0004003502480000037
Figure BDA0004003502480000038
Where G, Wc , bc represent learnable parameters, and the weight of each modality update module is dynamically calculated through the additive attention mechanism, so as to achieve the purpose of information interaction between the two modalities, and finally obtain the enhanced multimodal sequence
Figure BDA0004003502480000037
and
Figure BDA0004003502480000038

为了学习多模态特征的深度抽象表征,采用GRU将交互注意力机制后的结果与当前层的输入结合起来,在第n层中首先使用跨模态注意力机制以及自注意力机制获得增强后的多模态序列,然后使用GRU获得新的文本和图片特征,其中,n不包括第一层,第一层采用方面引导注意力机制,具体过程如下:In order to learn the deep abstract representation of multimodal features, GRU is used to combine the results of the interactive attention mechanism with the input of the current layer. In the nth layer, the cross-modal attention mechanism and the self-attention mechanism are first used to obtain the enhanced multimodal sequence, and then GRU is used to obtain new text and image features. Among them, n does not include the first layer. The first layer uses the aspect-guided attention mechanism. The specific process is as follows:

Figure BDA0004003502480000039
Figure BDA0004003502480000039

其中:SAmul代表多头自注意力机制,

Figure BDA00040035024800000310
为目标模态向量,n代表层数。Among them: SA mul represents the multi-head self-attention mechanism,
Figure BDA00040035024800000310
is the target mode vector, and n represents the number of layers.

优选的:在所述步骤3,将步骤2中所获得的图片特征和文本特征采用自注意力机制进行多模态融合,具体表示如下:Preferably, in step 3, the image features and text features obtained in step 2 are multimodally fused using a self-attention mechanism, which is specifically expressed as follows:

Figure BDA0004003502480000041
Figure BDA0004003502480000041

其中:

Figure BDA0004003502480000042
均表示多模态序列,FC是融合多模态函数。in:
Figure BDA0004003502480000042
Both represent multimodal sequences, and FC is the fusion multimodal function.

优选的:步骤4具体为:将步骤1和步骤3中的文本特征、图片特征和融合后的多模态特征进行concat操作得到包含三种特征表示Emul作为输入数据:Preferably, step 4 is specifically as follows: performing a concat operation on the text features, the image features and the fused multimodal features in steps 1 and 3 to obtain three feature representations E mul as input data:

Emul=concat(Xmul,XL,XV)E mul =concat(X mul ,X L ,X V )

使用全连接网络对数据进行特征融合,并在最后一层使用softmax分类器进行情感预测,情感预测计算公式如下:Use a fully connected network to fuse the data features, and use a softmax classifier in the last layer to predict sentiment. The sentiment prediction calculation formula is as follows:

P=softmax(WmE+bm)P = softmax(W m E + b m )

其中Wm代表全连接层的权重,bm代表偏置,P代表情感预测。Where Wm represents the weight of the fully connected layer, bm represents the bias, and P represents the sentiment prediction.

优选的:采用VGG16网络提取图片特征的具体过程如下:Preferred: The specific process of extracting image features using the VGG16 network is as follows:

步骤11:输入:输入224*224*3的图像像素矩阵;Step 11: Input: Input 224*224*3 image pixel matrix;

步骤12:卷积池化:输入的图像像素矩阵经过5轮卷积池化操作,每轮卷积核大小均为3*3*w,w代表矩阵深度,卷积之后通过激活函数ReLU得到多个特征图,并且采用最大池化筛选局部特征,其中卷积计算公式如下:Step 12: Convolution pooling: The input image pixel matrix undergoes 5 rounds of convolution pooling operations. The size of each convolution kernel is 3*3*w, where w represents the matrix depth. After convolution, multiple feature maps are obtained through the activation function ReLU, and the maximum pooling is used to filter local features. The convolution calculation formula is as follows:

fj=R(Xi*Kj+b)f j =R(X i *K j +b)

其中R代表ReLU激活函数,*代表卷积操作,b代表偏置项,Kj代表不同矩阵深度的卷积核;Where R represents the ReLU activation function, * represents the convolution operation, b represents the bias term, and Kj represents the convolution kernel of different matrix depths;

步骤13:全连接:经过三次全连接得到1*1*1000的图像特征表示向量;Step 13: Full connection: After three full connections, a 1*1*1000 image feature representation vector is obtained;

步骤14:最终通过预训练好的VGG16网络获取图片特征向量用XVp={XV1,XV2…XVn}表示。Step 14: Finally, the image feature vector is obtained through the pre-trained VGG16 network and represented by X Vp = {X V1 , X V2 …X Vn }.

优选的:步骤1中采用Bert预训练模型获取图片文本特征,具体过程如下:Preferably: In step 1, the Bert pre-training model is used to obtain the image text features. The specific process is as follows:

步骤21:文本预处理:针对网络用语中无意义的词汇、符号进行预处理,将不影响判断文本情感倾向的词汇作为停用词并删除;Step 21: Text preprocessing: Preprocess meaningless words and symbols in network terms, and delete words that do not affect the judgment of the text sentiment tendency as stop words;

步骤22:采用预训练好的Bert模型提取输入文本的词向量序列,输入文本,经过分割标注后,以词序列为输入,并经过Bert模型的词嵌入、段嵌入、位置嵌入,在堆栈中层层流动,最终生成文本特征的词向量,用XLp={XL1,XL2…XLn}表示。Step 22: Use the pre-trained Bert model to extract the word vector sequence of the input text. After the input text is segmented and labeled, it takes the word sequence as input and passes through the word embedding, segment embedding, and position embedding of the Bert model. It flows layer by layer in the stack and finally generates the word vector of the text feature, which is represented by XLp = { XL1 , XL2XLn }.

优选的:步骤1中给定方面短语的方面特征的提取方法如下:Preferably, the method for extracting aspect features of a given aspect phrase in step 1 is as follows:

给定方面短语A={A1,A2…An},首先使用词嵌入获得词嵌入向量aj,然后采用双向LSTM模型学习每个方面词嵌入向量的隐藏表示VjGiven an aspect phrase A = {A 1 , A 2 …A n }, first use word embedding to obtain the word embedding vector a j , and then use the bidirectional LSTM model to learn the hidden representation V j of each aspect word embedding vector:

Figure BDA0004003502480000051
Figure BDA0004003502480000051

然后取所有隐藏表征Vj的平均值作为最终的方面特征向量VAThen take the average of all hidden representations Vj as the final aspect feature vector V A :

Figure BDA0004003502480000052
Figure BDA0004003502480000052

本发明的有益效果是:The beneficial effects of the present invention are:

(1)本发明的情感分析方法利用模态对齐模块以及模态更新模块,并采用注意力机制,进行跨模态交互,从而提高多模态情感分析的准确性。(1) The sentiment analysis method of the present invention utilizes a modal alignment module and a modal update module, and adopts an attention mechanism to perform cross-modal interaction, thereby improving the accuracy of multimodal sentiment analysis.

(2)本发明的模态更新模层包括模态对齐模块和模态更新模块,模态对齐模块用于对齐不同模态的特征序列,有助于模态间的交互。(2) The modality update module layer of the present invention includes a modality alignment module and a modality update module. The modality alignment module is used to align feature sequences of different modalities, which facilitates interaction between modalities.

(3)模态更新模块利用多头自注意机制,跨模态注意力机制加强模态间的交互,充分融合不同模态的共享特性及私有特性。(3) The modality update module uses a multi-head self-attention mechanism and a cross-modal attention mechanism to strengthen the interaction between modalities and fully integrate the shared and private characteristics of different modalities.

(4)为了保存模态间的丰富特征,本发明将融合后的多模态特征与初始的模态特征再次进行融合,然后再进行情感分类。(4) In order to preserve the rich features between modalities, the present invention fuses the fused multimodal features with the initial modal features again and then performs sentiment classification.

(5)本发明充分利用了跨模态间的信息交互,有助于提高情感预测的准确性。(5) The present invention makes full use of cross-modal information interaction, which helps to improve the accuracy of emotion prediction.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明情感分析方法流程图。FIG1 is a flow chart of the sentiment analysis method of the present invention.

图2是本发明情感分析方法架构图。FIG. 2 is a diagram showing the structure of the sentiment analysis method of the present invention.

图3是本发明模态更新模块图。FIG. 3 is a diagram of a modal update module of the present invention.

具体实施方式DETAILED DESCRIPTION

以下将以图式揭露本发明的实施方式,为明确说明起见,许多实务上的细节将在以下叙述中一并说明。然而,应了解到,这些实务上的细节不应用以限制本发明。也就是说,在本发明的部分实施方式中,这些实务上的细节是非必要的。The following will disclose the embodiments of the present invention with drawings. For the purpose of clear description, many practical details will be described together in the following description. However, it should be understood that these practical details should not be used to limit the present invention. That is to say, in some embodiments of the present invention, these practical details are not necessary.

如图1-3所示,本发明是一种基于注意力网络的跨模态情感分析方法,提出基于视觉注意网络的跨模态情感分析模型,通过模态更新层增强图文两个模态间的信息交互,从而增强模型的鲁棒性和准确性,具体的,该跨模态情感分析方法包括以下步骤:As shown in Figures 1-3, the present invention is a cross-modal sentiment analysis method based on an attention network, and proposes a cross-modal sentiment analysis model based on a visual attention network. The information interaction between the two modalities of image and text is enhanced through a modal update layer, thereby enhancing the robustness and accuracy of the model. Specifically, the cross-modal sentiment analysis method includes the following steps:

步骤1:提取输入图片文本对应的图片特征,图片文本特征以及给定方面短语的方面特征。Step 1: Extract the image features corresponding to the input image text, image text features and aspect features of the given aspect phrase.

采用VGG16网络提取图片特征,VGG16网络由13个卷积层、5个池化层和3个全连接层构成。卷积层通过卷积获得图片特征图。在图像的表示矩阵上采用点乘方式,卷积核以一定步长滑动,并与对应位置的每个元素以及输入的矩阵单元相乘,最总得到图像基于当前卷积核的特征图。池化层对卷积后的特征图进行降维,并采用最大池化筛局部特征。最后由全连接层综合上层输出特征。The VGG16 network is used to extract image features. The VGG16 network consists of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. The convolutional layer obtains the image feature map through convolution. The point multiplication method is used on the image representation matrix. The convolution kernel slides with a certain step size and multiplies each element at the corresponding position and the input matrix unit. Finally, the feature map of the image based on the current convolution kernel is obtained. The pooling layer reduces the dimension of the feature map after convolution and uses the maximum pooling to screen local features. Finally, the fully connected layer integrates the output features of the upper layer.

提取图片特征的具体过程如下:The specific process of extracting image features is as follows:

步骤11:输入:输入224*224*3的图像像素矩阵;Step 11: Input: Input 224*224*3 image pixel matrix;

步骤12:卷积池化:输入的图像像素矩阵经过5轮卷积池化操作,每轮卷积核大小均为3*3*w,w代表矩阵深度,卷积之后通过激活函数ReLU得到多个特征图,并且采用最大池化筛选局部特征,其中卷积计算公式如下:Step 12: Convolution pooling: The input image pixel matrix undergoes 5 rounds of convolution pooling operations. The size of each convolution kernel is 3*3*w, where w represents the matrix depth. After convolution, multiple feature maps are obtained through the activation function ReLU, and the maximum pooling is used to filter local features. The convolution calculation formula is as follows:

fj=R(Xi*Kj+b)f j =R(X i *K j +b)

其中R代表ReLU激活函数,*代表卷积操作,b代表偏置项,Kj代表不同矩阵深度的卷积核;Where R represents the ReLU activation function, * represents the convolution operation, b represents the bias term, and Kj represents the convolution kernel of different matrix depths;

步骤13:全连接:经过三次全连接得到1*1*1000的图像特征表示向量;Step 13: Full connection: After three full connections, a 1*1*1000 image feature representation vector is obtained;

步骤14:最终通过预训练好的VGG16网络获取图片特征向量用Step 14: Finally, use the pre-trained VGG16 network to obtain the image feature vector

XVp={XV1,XV2…XVn}表示。X Vp ={X V1 ,X V2 ..X Vn }.

采用Bert预训练模型获取图片文本特征,具体过程如下:The Bert pre-training model is used to obtain image text features. The specific process is as follows:

步骤21:文本预处理:针对网络用语中无意义的词汇、符号进行预处理,将不影响判断文本情感倾向的词汇作为停用词并删除;Step 21: Text preprocessing: Preprocess meaningless words and symbols in network terms, and delete words that do not affect the judgment of the text sentiment tendency as stop words;

步骤22:采用预训练好的Bert模型提取输入文本的词向量序列,输入文本,经过分割标注后,以词序列为输入,并经过Bert模型的词嵌入、段嵌入、位置嵌入,在堆栈中层层流动,最终生成文本特征的词向量,用XLp={XL1,XL2…XLn}表示。Step 22: Use the pre-trained Bert model to extract the word vector sequence of the input text. After the input text is segmented and labeled, it takes the word sequence as input and passes through the word embedding, segment embedding, and position embedding of the Bert model. It flows layer by layer in the stack and finally generates the word vector of the text feature, which is represented by XLp = { XL1 , XL2XLn }.

给定方面短语的方面特征的提取方法如下:The extraction method of aspect features for a given aspect phrase is as follows:

给定方面短语A={A1,A2…An},首先使用词嵌入获得词嵌入向量aj,然后采用双向LSTM模型学习每个方面词嵌入向量的隐藏表示VjGiven an aspect phrase A = {A 1 , A 2 …A n }, first use word embedding to obtain the word embedding vector a j , and then use the bidirectional LSTM model to learn the hidden representation V j of each aspect word embedding vector:

Figure BDA0004003502480000071
Figure BDA0004003502480000071

然后取所有隐藏表征Vj的平均值作为最终的方面特征向量VAThen take the average of all hidden representations Vj as the final aspect feature vector V A :

Figure BDA0004003502480000072
Figure BDA0004003502480000072

步骤2:提取的图片文本特征进入模态更新层,每个所述模态更新层包括一个用于对齐表示空间的模态对齐模块和两个模态更新模块,每个模态在所述模态对齐模块内对齐,对齐后进入所述模态更新模块,通过利用不同模态的相关性逐步补充,最终获得交互后的图片特征和文本特征。Step 2: The extracted image text features enter the modality update layer. Each of the modality update layers includes a modality alignment module for aligning the representation space and two modality update modules. Each modality is aligned in the modality alignment module and enters the modality update module after alignment. By gradually supplementing the correlation between different modalities, the image features and text features after interaction are finally obtained.

步骤2.1:模态对齐模块旨在在模态交互前对齐不同模态的特征空间,首先将多个模态的单模态表示映射到同一存储空间,具体表示如下:Step 2.1: The modality alignment module aims to align the feature spaces of different modalities before modal interaction. First, the single modality representations of multiple modalities are mapped to the same storage space. The specific representation is as follows:

Figure BDA0004003502480000073
Figure BDA0004003502480000073

其中

Figure BDA0004003502480000081
代表文本向量,Memn代表存储空间向量,θ代表参数,
Figure BDA0004003502480000082
代表对齐后的模态向量,f(·)代表模态向量和存储空间向量的交换函数,n代表第n层模态更新层,模态对齐模块具体计算过程如下:in
Figure BDA0004003502480000081
represents the text vector, Mem n represents the storage space vector, θ represents the parameter,
Figure BDA0004003502480000082
represents the aligned modal vector, f(·) represents the exchange function between the modal vector and the storage space vector, and n represents the nth modal update layer. The specific calculation process of the modal alignment module is as follows:

Figure BDA0004003502480000083
Figure BDA0004003502480000083

K=Memn·WK K=Mem n ·W K

其中Wq和WK代表线性变换的参数,Q*代表两种模态线性变换后的向量表示,K表示存储空间的大小,模态向量和存储空间向量相似度计算公式如下:Where Wq and WK represent the parameters of the linear transformation, Q * represents the vector representation of the two modal linear transformations, K represents the size of the storage space, and the similarity calculation formula between the modal vector and the storage space vector is as follows:

Figure BDA0004003502480000084
Figure BDA0004003502480000084

第j个存储向量的权重表示为:The weight of the jth storage vector is expressed as:

Figure BDA0004003502480000085
Figure BDA0004003502480000085

存储空间向量经过线性变换后表示如下:The storage space vector is expressed as follows after linear transformation:

V=Memn·Wv V=Mem n ·W v

Wv代表可学习参数,查询向量通过内存空间向量和权重计算获得:W v represents the learnable parameters, and the query vector is obtained by calculating the memory space vector and weight:

Figure BDA0004003502480000086
Figure BDA0004003502480000086

其中:*∈{L,V}代表图像特征和文本特征,V*j表示内存空间向量。Where: *∈{L,V} represents image features and text features, and V *j represents the memory space vector.

步骤2.2:对齐后的多模态信息进入模态更新模块,逐步增强每个模态,每个模态更新层包含两个模态更新模块

Figure BDA0004003502480000087
Figure BDA0004003502480000088
即文本更新模块和图片更新模块,为了使文本和视觉特征更专注于给定方面的信息部分并抑制不太重要的部分,在模态更新层的第一层采用了方面引导的注意力方法,具体过程如下:Step 2.2: The aligned multimodal information enters the modality update module to gradually enhance each modality. Each modality update layer contains two modality update modules.
Figure BDA0004003502480000087
and
Figure BDA0004003502480000088
That is, the text update module and the image update module. In order to make the text and visual features focus more on the information part of a given aspect and suppress the less important part, the aspect-guided attention method is adopted in the first layer of the modality update layer. The specific process is as follows:

Figure BDA0004003502480000089
Figure BDA0004003502480000089

其中

Figure BDA0004003502480000091
代表生成的目标模态的隐藏表征,IA代表方面特征向量,b(1)代表可学习参数,
Figure BDA0004003502480000092
表示可变参数,
Figure BDA0004003502480000093
表示模态向量;in
Figure BDA0004003502480000091
represents the generated hidden representation of the target modality, I A represents the aspect feature vector, b (1) represents the learnable parameter,
Figure BDA0004003502480000092
Indicates variable parameters.
Figure BDA0004003502480000093
represents the mode vector;

计算归一化注意权重:Calculate the normalized attention weights:

Figure BDA0004003502480000094
Figure BDA0004003502480000094

使用注意力权重

Figure BDA0004003502480000095
对目标模态的特征向量进行加权平均,得到新的目标模态向量
Figure BDA0004003502480000096
Using attention weights
Figure BDA0004003502480000095
Perform weighted averaging on the eigenvectors of the target mode to obtain a new target mode vector
Figure BDA0004003502480000096

步骤2.3:为了捕捉不同模态间的双向交互,增强模态间的交互,模态更新模块引入了跨模态注意力机制以及自注意力机制,增强目标模态

Figure BDA0004003502480000097
的具体过程如下:Step 2.3: In order to capture the two-way interaction between different modalities and enhance the interaction between modalities, the modality update module introduces a cross-modal attention mechanism and a self-attention mechanism to enhance the target modality.
Figure BDA0004003502480000097
The specific process is as follows:

Figure BDA0004003502480000098
Figure BDA0004003502480000098

其中,*代表要增强的目标模态,α则代表补充模态,如果目标模态是文本,那补充模态则是图片,公式如下:Among them, * represents the target modality to be enhanced, and α represents the supplementary modality. If the target modality is text, then the supplementary modality is the image. The formula is as follows:

Figure BDA0004003502480000099
Figure BDA0004003502480000099

Figure BDA00040035024800000910
Figure BDA00040035024800000910

其中,SAmul,CMAmul和Att分表代表多头自注意力机制、多头跨模态注意力机制和归一化函数以及加性注意力机制,为了更好的融合图片和文本模态,本发明使用加性注意力机制,具体表示如下:Among them, SA mul , CMA mul and Att represent the multi-head self-attention mechanism, the multi-head cross-modal attention mechanism and the normalization function and the additive attention mechanism. In order to better integrate the image and text modalities, the present invention uses the additive attention mechanism, which is specifically expressed as follows:

Figure BDA00040035024800000911
Figure BDA00040035024800000911

Figure BDA00040035024800000912
Figure BDA00040035024800000912

Figure BDA00040035024800000913
Figure BDA00040035024800000913

其中G,Wc,bc代表可学习参数,每个模态更新模块的权重都是通过加性注意力机制动态计算获得,从而达到两个模态间信息交互的目的,最终获得曾强后的多模态序列

Figure BDA00040035024800000914
Figure BDA00040035024800000915
Where G, Wc , bc represent learnable parameters, and the weight of each modality update module is dynamically calculated through the additive attention mechanism, so as to achieve the purpose of information interaction between the two modalities, and finally obtain the enhanced multimodal sequence
Figure BDA00040035024800000914
and
Figure BDA00040035024800000915

步骤2.3为了学习多模态特征的深度抽象表征,采用GRU将交互注意力机制后的结果与当前层的输入结合起来,在第n层中首先使用跨模态注意力机制以及自注意力机制获得增强后的多模态序列,然后使用GRU获得新的文本和图片特征,其中,n不包括第一层,第一层采用方面引导注意力机制,具体过程如下:Step 2.3 In order to learn the deep abstract representation of multimodal features, GRU is used to combine the results of the interactive attention mechanism with the input of the current layer. In the nth layer, the cross-modal attention mechanism and the self-attention mechanism are first used to obtain the enhanced multimodal sequence, and then GRU is used to obtain new text and image features. Among them, n does not include the first layer. The first layer uses the aspect-guided attention mechanism. The specific process is as follows:

Figure BDA0004003502480000101
Figure BDA0004003502480000101

其中:SAmul代表多头自注意力机制,

Figure BDA0004003502480000102
为目标模态向量,n代表层数。Among them: SA mul represents the multi-head self-attention mechanism,
Figure BDA0004003502480000102
is the target mode vector, and n represents the number of layers.

步骤3:将步骤2中所获得的图片特征和文本特征采用自注意力机制进行多模态融合,具体表示如下:Step 3: The image features and text features obtained in step 2 are fused in a multimodal manner using the self-attention mechanism, as shown below:

Figure BDA0004003502480000103
Figure BDA0004003502480000103

其中:

Figure BDA0004003502480000104
均表示多模态序列,FC是融合多模态函数。in:
Figure BDA0004003502480000104
Both represent multimodal sequences, and FC is the fusion multimodal function.

步骤4:将步骤1中的图片特征和图片文本特征与步骤3中的融合后的多模态特征进行concat操作得到包括三种特征表示Emul作为输入数据:Step 4: Perform a concat operation on the image features and image text features in step 1 and the fused multimodal features in step 3 to obtain three feature representations E mul as input data:

Emul=concat(Xmul,XL,XV)E mul =concat(X mul ,X L ,X V )

使用全连接网络对数据进行特征融合,并在最后一层使用softmax分类器进行情感预测,情感预测计算公式如下:Use a fully connected network to fuse the data features, and use a softmax classifier in the last layer to predict sentiment. The sentiment prediction calculation formula is as follows:

P=spftmax(WmE+bm)P = spftmax (W m E + b m )

其中Wm代表全连接层的权重,bm代表偏置,P代表情感预测。Where Wm represents the weight of the fully connected layer, bm represents the bias, and P represents the sentiment prediction.

为了保持模态间更丰富的特征,本发明使用L2损失作为损失函数,具体如下:In order to maintain richer features between modalities, the present invention uses L2 loss as the loss function, which is as follows:

Figure BDA0004003502480000105
Figure BDA0004003502480000105

其中α代表超参数。Where α represents a hyperparameter.

本发明充分利用了跨模态间的信息交互,有助于提高情感预测的准确性。The present invention makes full use of cross-modal information interaction, which helps to improve the accuracy of emotion prediction.

Claims (8)

1. A cross-modal emotion analysis method based on an attention network is characterized by comprising the following steps: the cross-modal emotion analysis method comprises the following steps of:
step 1: extracting picture characteristics corresponding to an input picture text, picture text characteristics and aspect characteristics of a given aspect phrase;
step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the pictures enter the modality updating modules after being aligned, and finally the picture features and the text features after interaction are obtained by utilizing the correlation of different modalities to be supplemented step by step;
and step 3: performing multi-mode fusion on the interactive picture features and text features obtained in the step 2 by adopting a self-attention mechanism to obtain multi-mode features;
and 4, step 4: and (4) performing concat operation on the picture features and the picture text features in the step (1) and the fused multi-modal features in the step (3) to perform emotion prediction.
2. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the step 2 specifically comprises the following steps:
step 2.1: the modal alignment module aligns feature spaces of different modes before modal interaction to obtain multi-modal information;
step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modules
Figure FDA0004003502470000011
And &>
Figure FDA0004003502470000012
Namely a text updating module and a picture updating module, a first layer of a modal updating layer adopts an aspect-guided attention method, and the specific process is as follows:
Figure FDA0004003502470000013
wherein
Figure FDA0004003502470000014
Hidden representation of the generated target modality, I A Representative facet feature vector, b (1) Representing a parameter that can be learned by a user,
Figure FDA0004003502470000015
represents a variable parameter, is selected>
Figure FDA0004003502470000016
Representing a modality vector;
calculating a normalized attention weight:
Figure FDA0004003502470000017
using attention weights
Figure FDA0004003502470000021
Carrying out weighted average on the characteristic vectors of the target mode to obtain a new target mode vector
Figure FDA0004003502470000022
Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target mode
Figure FDA0004003502470000023
The specific process is as follows:
Figure FDA0004003502470000024
wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:
Figure FDA0004003502470000025
Figure FDA0004003502470000026
wherein, SA mul ,CMA mul And Att partial tables represent a multi-head self-attention mechanism, a multi-head trans-modal attention mechanism, a normalization function and an additive attention mechanism, and the additive attention mechanism is used and is specifically expressed as follows:
Figure FDA0004003502470000027
Figure FDA0004003502470000028
Figure FDA0004003502470000029
wherein G, W c ,b c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtained
Figure FDA00040035024700000210
And
Figure FDA00040035024700000211
3. the attention network-based cross-modal emotion analysis method of claim 2, wherein: in step 2.3, in order to learn the deep abstract representation of the multi-modal features, the GRU is adopted to combine the result of the interactive attention mechanism with the input of the current layer, in the nth layer, the cross-modal attention mechanism and the self-attention mechanism are used to obtain an enhanced multi-modal sequence, and then the GRU is used to obtain new text and picture features, which specifically includes the following steps:
Figure FDA0004003502470000031
wherein: SA mul Representing a multi-head self-attentive mechanism,
Figure FDA0004003502470000032
for the target mode vector, n represents the number of layers.
4. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: in the step 3, the picture features and the text features obtained in the step 2 are subjected to multi-modal fusion by using an attention mechanism, which is specifically represented as follows:
Figure FDA0004003502470000033
wherein:
Figure FDA0004003502470000034
all represent a multimodal sequence, FC is a fused multimodal function.
5. The attention network-based cross-modal emotion analysis method of claim 1, wherein: the step 4 specifically comprises the following steps: performing concat operation on the text features, the picture features and the fused multi-modal features in the steps 1 and 3 to obtain a representation E containing three features mul As input data:
E mul =concat(X mul ,X L ,X V )
performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:
P=softmax(W m E+b m )
wherein W m Weight representing fully connected layer, b m Representing bias and P representing emotion prediction.
6. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the method for extracting the aspect features of the aspect phrases given in the step 1 comprises the following steps:
given aspect phrase a = { a = { [ a ] 1 ,A 2 …A n Get word embedding vector a first using word embedding j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector j
Figure FDA0004003502470000035
Then all hidden tokens V are taken j As the final aspect feature vector V A
Figure FDA0004003502470000041
7. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a VGG16 network is adopted to extract picture features, the VGG16 network is composed of 13 convolution layers, 5 pooling layers and 3 full-connection layers, and the specific process of extracting the picture features by adopting the VGG16 network is as follows:
step 11: inputting: inputting 224 x 3 matrix of image pixels;
step 12: convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:
f j =R(X i *K j +b)
where R represents the ReLU activation function, which represents the convolution operation, b represents the bias term, K j Convolution kernels representing different matrix depths;
step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;
step 14: finally, the picture feature vector is obtained through the pre-trained VGG16 network
X Vp ={X V1 ,X V2 …X Vn Represents it.
8. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a Bert pre-training model is adopted to obtain the picture text characteristics, and the specific process is as follows:
step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;
step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the input text, inputting the input text by using the word sequence, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use Lp ={X L1 ,X L2 …X Ln Denotes.
CN202211623613.1A 2022-12-16 2022-12-16 A Cross-Modal Sentiment Analysis Method Based on Attention Network Pending CN115982652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211623613.1A CN115982652A (en) 2022-12-16 2022-12-16 A Cross-Modal Sentiment Analysis Method Based on Attention Network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211623613.1A CN115982652A (en) 2022-12-16 2022-12-16 A Cross-Modal Sentiment Analysis Method Based on Attention Network

Publications (1)

Publication Number Publication Date
CN115982652A true CN115982652A (en) 2023-04-18

Family

ID=85962068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211623613.1A Pending CN115982652A (en) 2022-12-16 2022-12-16 A Cross-Modal Sentiment Analysis Method Based on Attention Network

Country Status (1)

Country Link
CN (1) CN115982652A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention
CN118690259A (en) * 2024-08-28 2024-09-24 齐鲁工业大学(山东省科学院) A cross-modal positive and negative semantic classification method based on text sentiment and image content perception

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention
CN118690259A (en) * 2024-08-28 2024-09-24 齐鲁工业大学(山东省科学院) A cross-modal positive and negative semantic classification method based on text sentiment and image content perception

Similar Documents

Publication Publication Date Title
CN110852368B (en) Global and local feature embedding and image-text fusion emotion analysis method and system
CN113065577A (en) A Goal-Oriented Multimodal Sentiment Classification Method
CN109992686A (en) Image-text retrieval system and method based on multi-angle self-attention mechanism
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN110737801A (en) Content classification method and device, computer equipment and storage medium
CN114549850B (en) A multimodal image aesthetic quality assessment method to solve the missing modality problem
CN111881262A (en) Text emotion analysis method based on multi-channel neural network
CN113822340A (en) A method for image and text emotion recognition based on attention mechanism
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN115982652A (en) A Cross-Modal Sentiment Analysis Method Based on Attention Network
CN111598183A (en) Multi-feature fusion image description method
CN116244473B (en) A Multimodal Emotion Recognition Method Based on Feature Decoupling and Graph Knowledge Distillation
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
Cheng et al. Stack-VS: Stacked visual-semantic attention for image caption generation
CN116434023A (en) Emotion recognition method, system and device based on multi-modal cross-attention network
CN117671460A (en) A cross-modal image and text sentiment analysis method based on hybrid fusion
CN111522979B (en) Picture sorting recommendation method and device, electronic equipment and storage medium
CN117708642A (en) A multi-modal aspect-level sentiment analysis method that fuses images and text at multiple levels
Liu et al. A multimodal approach for multiple-relation extraction in videos
CN109766918A (en) Salient object detection method based on multi-level context information fusion
CN113780350B (en) ViLBERT and BiLSTM-based image description method
CN115292533A (en) Cross-modal pedestrian retrieval method driven by visual positioning
Zhao et al. Fusion with GCN and SE-ResNeXt network for aspect based multimodal sentiment analysis
Wang Improved facial expression recognition method based on gan

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination