CN115982652A

CN115982652A - A Cross-Modal Sentiment Analysis Method Based on Attention Network

Info

Publication number: CN115982652A
Application number: CN202211623613.1A
Authority: CN
Inventors: 章韵; 王梦婷
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-18

Abstract

The invention belongs to the fields of natural language processing, computer vision and emotion analysis technology, and discloses a cross-modal emotion analysis method based on an attention network, which comprises the following steps: step 1: extracting picture characteristics, picture text characteristics and aspect characteristics; step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the aligned modalities enter the modality updating modules, and the interactive picture features and text features are finally obtained by utilizing the correlation of different modalities to supplement step by step; and step 3: performing multi-mode fusion on the step picture characteristics and the text characteristics by adopting a self-attention mechanism; and 4, step 4: and performing concat operation on the picture characteristics, the picture text characteristics and the multi-mode characteristics to predict emotion. The method makes full use of the information interaction among the cross-modal, and is beneficial to improving the accuracy of emotion prediction.

Description

A cross-modal sentiment analysis method based on attention network

技术领域Technical Field

本发明涉及自然语言处理、计算机视觉领域及情感分析技术领域，具体的说是涉及一种基于注意力网络的跨模态情感分析方法。The present invention relates to the fields of natural language processing, computer vision and sentiment analysis technology, and specifically to a cross-modal sentiment analysis method based on an attention network.

背景技术Background Art

随着各类网络社交平台以及网络技术的发展，用户在网络上发表言论的方式更加多样化，越来越多的用户选择用视频、图片或者文章来表达自己的情感和观点。如何分析这些多模态信息当中蕴含的情感倾向、舆论导向成为情感分析领域所面临的挑战。然而，由于多模态数据的异构性和异步性，融合多模态信息并不容易。就异构性而言，不同模态存在不同的特征空间中。就异步性而言，不同模态的时间序列数据采样率不一致导致无法获得不同模态之间的最佳映射。现在已经有许多关于多模态分析的研究，具体方法可以归纳为以下两类：一种是采用跨模态注意力来提供不同模态之间的软映射，从而对多模式数据的异步性进行建模。然而，这类方法没有考虑多模态数据的异质性。另一类则考虑多模态数据异质性。这一类别中的方法将每个模态分为模态的共享部分和模态的私有部分，由不同的神经网络表示。这些方法的局限性在于它们没有考虑不同模式之间的异步性。With the development of various online social platforms and network technologies, users have more diverse ways to express their opinions on the Internet. More and more users choose to use videos, pictures or articles to express their emotions and opinions. How to analyze the emotional tendencies and public opinion orientations contained in these multimodal information has become a challenge in the field of sentiment analysis. However, due to the heterogeneity and asynchrony of multimodal data, it is not easy to fuse multimodal information. In terms of heterogeneity, different modalities exist in different feature spaces. In terms of asynchrony, the inconsistent sampling rates of time series data of different modalities make it impossible to obtain the best mapping between different modalities. There are many studies on multimodal analysis, and the specific methods can be summarized into the following two categories: one is to use cross-modal attention to provide soft mapping between different modalities, thereby modeling the asynchrony of multimodal data. However, this type of method does not consider the heterogeneity of multimodal data. The other type considers the heterogeneity of multimodal data. The methods in this category divide each modality into a shared part of the modality and a private part of the modality, which are represented by different neural networks. The limitation of these methods is that they do not consider the asynchrony between different modes.

发明内容Summary of the invention

为了解决多模态异构性和异质性的问题，本发明提出了一种基于注意力网络的跨模态情感分析方法，采用模态对齐模块以及模态更新模块，并利用注意力机制，进行跨模态交互，从而提高多模态情感分析的准确性。In order to solve the problems of multimodal heterogeneity and heterogeneity, the present invention proposes a cross-modal sentiment analysis method based on attention network, which adopts modal alignment module and modal update module, and uses attention mechanism to perform cross-modal interaction, so as to improve the accuracy of multimodal sentiment analysis.

为了达到上述目的，本发明是通过以下技术方案实现的：In order to achieve the above object, the present invention is achieved through the following technical solutions:

本发明是一种基于注意力网络的跨模态情感分析方法，具体包括以下步骤：The present invention is a cross-modal sentiment analysis method based on attention network, which specifically includes the following steps:

步骤1：提取输入图片文本对应的图片特征和图片文本特征；Step 1: Extract the image features and image text features corresponding to the input image text;

步骤2：提取的图片文本特征进入模态更新层，每个所述模态更新层包括一个用于对齐表示空间的模态对齐模块和两个模态更新模块，每个模态在所述模态对齐模块内对齐，对齐后进入所述模态更新模块，通过利用不同模态的相关性逐步补充，最终获得交互后的图片特征和文本特征；Step 2: The extracted image and text features enter the modality update layer. Each of the modality update layers includes a modality alignment module for aligning the representation space and two modality update modules. Each modality is aligned in the modality alignment module and enters the modality update module after alignment. The features are gradually supplemented by utilizing the correlation of different modalities, and finally the image features and text features after interaction are obtained.

步骤3：将步骤2中所获得的交互后的图片特征和文本特征采用自注意力机制进行多模态融合，得到多模态特征；Step 3: The interactive image features and text features obtained in step 2 are fused multimodally using the self-attention mechanism to obtain multimodal features;

步骤4：将步骤1中的图片特征和图片文本特征与步骤3中的融合后的多模态特征进行concat操作，进行情感预测。Step 4: Concat the image features and image text features in step 1 with the fused multimodal features in step 3 to perform sentiment prediction.

优选的：步骤2具体包括如下步骤：Preferably, step 2 specifically includes the following steps:

步骤2.1：模态对齐模块在模态交互前对齐不同模态的特征空间，得到多模态信息；Step 2.1: The modality alignment module aligns the feature spaces of different modalities before modality interaction to obtain multimodal information;

步骤2.2：对齐后的多模态信息进入模态更新模块，逐步增强每个模态，每个模态更新层包含两个模态更新模块

和

即文本更新模块和图片更新模块，为了使文本和视觉特征更专注于给定方面的信息部分并抑制不太重要的部分，在模态更新层的第一层采用了方面引导的注意力方法，具体过程如下：Step 2.2: The aligned multimodal information enters the modality update module to gradually enhance each modality. Each modality update layer contains two modality update modules.

and

That is, the text update module and the image update module. In order to make the text and visual features focus more on the information part of a given aspect and suppress the less important part, the aspect-guided attention method is adopted in the first layer of the modality update layer. The specific process is as follows:

其中

代表生成的目标模态的隐藏表征，I^A代表方面特征向量，b⁽¹⁾代表可学习参数，

表示可变参数，

表示模态向量；in

represents the generated hidden representation of the target modality, I ^A represents the aspect feature vector, b ⁽¹⁾ represents the learnable parameter,

Indicates variable parameters.

represents the mode vector;

计算归一化注意权重：Calculate the normalized attention weights:

使用注意力权重

对目标模态的特征向量进行加权平均，得到新的目标模态向量

Using attention weights

Perform weighted averaging on the eigenvectors of the target mode to obtain a new target mode vector

步骤2.3：为了捕捉不同模态间的双向交互，增强模态间的交互，模态更新模块引入了跨模态注意力机制以及自注意力机制，增强目标模态

的具体过程如下：Step 2.3: In order to capture the two-way interaction between different modalities and enhance the interaction between modalities, the modality update module introduces a cross-modal attention mechanism and a self-attention mechanism to enhance the target modality.

The specific process is as follows:

其中，*代表要增强的目标模态，α则代表补充模态，如果目标模态是文本，那补充模态则是图片，公式如下：Among them, * represents the target modality to be enhanced, and α represents the supplementary modality. If the target modality is text, then the supplementary modality is the image. The formula is as follows:

其中，SA^mul，CMA^mul和Att分表代表多头自注意力机制、多头跨模态注意力机制和归一化函数以及加性注意力机制，为了更好的融合图片和文本模态，本发明使用加性注意力机制，具体表示如下：Among them, SA ^mul , CMA ^mul and Att represent the multi-head self-attention mechanism, the multi-head cross-modal attention mechanism and the normalization function and the additive attention mechanism. In order to better integrate the image and text modalities, the present invention uses the additive attention mechanism, which is specifically expressed as follows:

其中G，W_c，b_c代表可学习参数，每个模态更新模块的权重都是通过加性注意力机制动态计算获得，从而达到两个模态间信息交互的目的，最终获得曾强后的多模态序列

和

Where G, _Wc , _bc represent learnable parameters, and the weight of each modality update module is dynamically calculated through the additive attention mechanism, so as to achieve the purpose of information interaction between the two modalities, and finally obtain the enhanced multimodal sequence

and

为了学习多模态特征的深度抽象表征，采用GRU将交互注意力机制后的结果与当前层的输入结合起来，在第n层中首先使用跨模态注意力机制以及自注意力机制获得增强后的多模态序列，然后使用GRU获得新的文本和图片特征，其中，n不包括第一层，第一层采用方面引导注意力机制，具体过程如下：In order to learn the deep abstract representation of multimodal features, GRU is used to combine the results of the interactive attention mechanism with the input of the current layer. In the nth layer, the cross-modal attention mechanism and the self-attention mechanism are first used to obtain the enhanced multimodal sequence, and then GRU is used to obtain new text and image features. Among them, n does not include the first layer. The first layer uses the aspect-guided attention mechanism. The specific process is as follows:

其中：SA^mul代表多头自注意力机制，

为目标模态向量，n代表层数。Among them: SA ^mul represents the multi-head self-attention mechanism,

is the target mode vector, and n represents the number of layers.

优选的：在所述步骤3，将步骤2中所获得的图片特征和文本特征采用自注意力机制进行多模态融合，具体表示如下：Preferably, in step 3, the image features and text features obtained in step 2 are multimodally fused using a self-attention mechanism, which is specifically expressed as follows:

其中：

均表示多模态序列，FC是融合多模态函数。in:

Both represent multimodal sequences, and FC is the fusion multimodal function.

优选的：步骤4具体为：将步骤1和步骤3中的文本特征、图片特征和融合后的多模态特征进行concat操作得到包含三种特征表示E_mul作为输入数据：Preferably, step 4 is specifically as follows: performing a concat operation on the text features, the image features and the fused multimodal features in steps 1 and 3 to obtain three feature representations E _mul as input data:

E_mul＝concat(X_mul,X_L,X_V)E _mul =concat(X _mul ,X _L ,X _V )

使用全连接网络对数据进行特征融合，并在最后一层使用softmax分类器进行情感预测，情感预测计算公式如下：Use a fully connected network to fuse the data features, and use a softmax classifier in the last layer to predict sentiment. The sentiment prediction calculation formula is as follows:

P＝softmax(W_mE+b_m)P = softmax(W _m E + b _m )

其中W_m代表全连接层的权重，b_m代表偏置，P代表情感预测。Where _Wm represents the weight of the fully connected layer, _bm represents the bias, and P represents the sentiment prediction.

优选的：采用VGG16网络提取图片特征的具体过程如下：Preferred: The specific process of extracting image features using the VGG16 network is as follows:

步骤11：输入：输入224*224*3的图像像素矩阵；Step 11: Input: Input 224*224*3 image pixel matrix;

步骤12：卷积池化：输入的图像像素矩阵经过5轮卷积池化操作，每轮卷积核大小均为3*3*w，w代表矩阵深度，卷积之后通过激活函数ReLU得到多个特征图，并且采用最大池化筛选局部特征，其中卷积计算公式如下：Step 12: Convolution pooling: The input image pixel matrix undergoes 5 rounds of convolution pooling operations. The size of each convolution kernel is 3*3*w, where w represents the matrix depth. After convolution, multiple feature maps are obtained through the activation function ReLU, and the maximum pooling is used to filter local features. The convolution calculation formula is as follows:

f_j＝R(X_i*K_j+b)f _j =R(X _i *K _j +b)

其中R代表ReLU激活函数，*代表卷积操作，b代表偏置项，K_j代表不同矩阵深度的卷积核；Where R represents the ReLU activation function, * represents the convolution operation, b represents the bias term, and _Kj represents the convolution kernel of different matrix depths;

步骤13：全连接：经过三次全连接得到1*1*1000的图像特征表示向量；Step 13: Full connection: After three full connections, a 1*1*1000 image feature representation vector is obtained;

步骤14：最终通过预训练好的VGG16网络获取图片特征向量用X_Vp＝{X_V1,X_V2…X_Vn}表示。Step 14: Finally, the image feature vector is obtained through the pre-trained VGG16 network and represented by X _Vp = {X _V1 , X _V2 …X _Vn }.

优选的：步骤1中采用Bert预训练模型获取图片文本特征，具体过程如下：Preferably: In step 1, the Bert pre-training model is used to obtain the image text features. The specific process is as follows:

步骤21：文本预处理：针对网络用语中无意义的词汇、符号进行预处理，将不影响判断文本情感倾向的词汇作为停用词并删除；Step 21: Text preprocessing: Preprocess meaningless words and symbols in network terms, and delete words that do not affect the judgment of the text sentiment tendency as stop words;

步骤22：采用预训练好的Bert模型提取输入文本的词向量序列，输入文本，经过分割标注后，以词序列为输入，并经过Bert模型的词嵌入、段嵌入、位置嵌入，在堆栈中层层流动，最终生成文本特征的词向量，用X_Lp＝{X_L1,X_L2…X_Ln}表示。Step 22: Use the pre-trained Bert model to extract the word vector sequence of the input text. After the input text is segmented and labeled, it takes the word sequence as input and passes through the word embedding, segment embedding, and position embedding of the Bert model. It flows layer by layer in the stack and finally generates the word vector of the text feature, which is represented by _XLp = { _XL1 , _XL2 … _XLn }.

优选的：步骤1中给定方面短语的方面特征的提取方法如下：Preferably, the method for extracting aspect features of a given aspect phrase in step 1 is as follows:

给定方面短语A＝{A₁,A₂…A_n}，首先使用词嵌入获得词嵌入向量a_j，然后采用双向LSTM模型学习每个方面词嵌入向量的隐藏表示V_j：Given an aspect phrase A = {A ₁ , A ₂ …A _n }, first use word embedding to obtain the word embedding vector a _j , and then use the bidirectional LSTM model to learn the hidden representation V _j of each aspect word embedding vector:

然后取所有隐藏表征V_j的平均值作为最终的方面特征向量V^A：Then take the average of all hidden representations _Vj as the final aspect feature vector V ^A :

本发明的有益效果是：The beneficial effects of the present invention are:

(1)本发明的情感分析方法利用模态对齐模块以及模态更新模块，并采用注意力机制，进行跨模态交互，从而提高多模态情感分析的准确性。(1) The sentiment analysis method of the present invention utilizes a modal alignment module and a modal update module, and adopts an attention mechanism to perform cross-modal interaction, thereby improving the accuracy of multimodal sentiment analysis.

(2)本发明的模态更新模层包括模态对齐模块和模态更新模块，模态对齐模块用于对齐不同模态的特征序列，有助于模态间的交互。(2) The modality update module layer of the present invention includes a modality alignment module and a modality update module. The modality alignment module is used to align feature sequences of different modalities, which facilitates interaction between modalities.

(3)模态更新模块利用多头自注意机制，跨模态注意力机制加强模态间的交互，充分融合不同模态的共享特性及私有特性。(3) The modality update module uses a multi-head self-attention mechanism and a cross-modal attention mechanism to strengthen the interaction between modalities and fully integrate the shared and private characteristics of different modalities.

(4)为了保存模态间的丰富特征，本发明将融合后的多模态特征与初始的模态特征再次进行融合，然后再进行情感分类。(4) In order to preserve the rich features between modalities, the present invention fuses the fused multimodal features with the initial modal features again and then performs sentiment classification.

(5)本发明充分利用了跨模态间的信息交互，有助于提高情感预测的准确性。(5) The present invention makes full use of cross-modal information interaction, which helps to improve the accuracy of emotion prediction.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明情感分析方法流程图。FIG1 is a flow chart of the sentiment analysis method of the present invention.

图2是本发明情感分析方法架构图。FIG. 2 is a diagram showing the structure of the sentiment analysis method of the present invention.

图3是本发明模态更新模块图。FIG. 3 is a diagram of a modal update module of the present invention.

具体实施方式DETAILED DESCRIPTION

以下将以图式揭露本发明的实施方式，为明确说明起见，许多实务上的细节将在以下叙述中一并说明。然而，应了解到，这些实务上的细节不应用以限制本发明。也就是说，在本发明的部分实施方式中，这些实务上的细节是非必要的。The following will disclose the embodiments of the present invention with drawings. For the purpose of clear description, many practical details will be described together in the following description. However, it should be understood that these practical details should not be used to limit the present invention. That is to say, in some embodiments of the present invention, these practical details are not necessary.

如图1-3所示，本发明是一种基于注意力网络的跨模态情感分析方法，提出基于视觉注意网络的跨模态情感分析模型，通过模态更新层增强图文两个模态间的信息交互，从而增强模型的鲁棒性和准确性，具体的，该跨模态情感分析方法包括以下步骤：As shown in Figures 1-3, the present invention is a cross-modal sentiment analysis method based on an attention network, and proposes a cross-modal sentiment analysis model based on a visual attention network. The information interaction between the two modalities of image and text is enhanced through a modal update layer, thereby enhancing the robustness and accuracy of the model. Specifically, the cross-modal sentiment analysis method includes the following steps:

步骤1：提取输入图片文本对应的图片特征，图片文本特征以及给定方面短语的方面特征。Step 1: Extract the image features corresponding to the input image text, image text features and aspect features of the given aspect phrase.

采用VGG16网络提取图片特征，VGG16网络由13个卷积层、5个池化层和3个全连接层构成。卷积层通过卷积获得图片特征图。在图像的表示矩阵上采用点乘方式，卷积核以一定步长滑动，并与对应位置的每个元素以及输入的矩阵单元相乘，最总得到图像基于当前卷积核的特征图。池化层对卷积后的特征图进行降维，并采用最大池化筛局部特征。最后由全连接层综合上层输出特征。The VGG16 network is used to extract image features. The VGG16 network consists of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. The convolutional layer obtains the image feature map through convolution. The point multiplication method is used on the image representation matrix. The convolution kernel slides with a certain step size and multiplies each element at the corresponding position and the input matrix unit. Finally, the feature map of the image based on the current convolution kernel is obtained. The pooling layer reduces the dimension of the feature map after convolution and uses the maximum pooling to screen local features. Finally, the fully connected layer integrates the output features of the upper layer.

提取图片特征的具体过程如下：The specific process of extracting image features is as follows:

f_j＝R(X_i*K_j+b)f _j =R(X _i *K _j +b)

步骤14：最终通过预训练好的VGG16网络获取图片特征向量用Step 14: Finally, use the pre-trained VGG16 network to obtain the image feature vector

X_Vp＝{X_V1,X_V2…X_Vn}表示。X _Vp ={X _V1 ,X _V2 ..X _Vn }.

采用Bert预训练模型获取图片文本特征，具体过程如下：The Bert pre-training model is used to obtain image text features. The specific process is as follows:

给定方面短语的方面特征的提取方法如下：The extraction method of aspect features for a given aspect phrase is as follows:

步骤2：提取的图片文本特征进入模态更新层，每个所述模态更新层包括一个用于对齐表示空间的模态对齐模块和两个模态更新模块，每个模态在所述模态对齐模块内对齐，对齐后进入所述模态更新模块，通过利用不同模态的相关性逐步补充，最终获得交互后的图片特征和文本特征。Step 2: The extracted image text features enter the modality update layer. Each of the modality update layers includes a modality alignment module for aligning the representation space and two modality update modules. Each modality is aligned in the modality alignment module and enters the modality update module after alignment. By gradually supplementing the correlation between different modalities, the image features and text features after interaction are finally obtained.

步骤2.1：模态对齐模块旨在在模态交互前对齐不同模态的特征空间，首先将多个模态的单模态表示映射到同一存储空间，具体表示如下：Step 2.1: The modality alignment module aims to align the feature spaces of different modalities before modal interaction. First, the single modality representations of multiple modalities are mapped to the same storage space. The specific representation is as follows:

其中

代表文本向量，Memⁿ代表存储空间向量，θ代表参数，

代表对齐后的模态向量，f(·)代表模态向量和存储空间向量的交换函数，n代表第n层模态更新层，模态对齐模块具体计算过程如下：in

represents the text vector, Mem ⁿ represents the storage space vector, θ represents the parameter,

represents the aligned modal vector, f(·) represents the exchange function between the modal vector and the storage space vector, and n represents the nth modal update layer. The specific calculation process of the modal alignment module is as follows:

K＝Memⁿ·W_K K＝Mem ⁿ ·W _K

其中W_q和W_K代表线性变换的参数，Q_*代表两种模态线性变换后的向量表示，K表示存储空间的大小，模态向量和存储空间向量相似度计算公式如下：Where _Wq and _WK represent the parameters of the linear transformation, Q _* represents the vector representation of the two modal linear transformations, K represents the size of the storage space, and the similarity calculation formula between the modal vector and the storage space vector is as follows:

第j个存储向量的权重表示为：The weight of the jth storage vector is expressed as:

存储空间向量经过线性变换后表示如下：The storage space vector is expressed as follows after linear transformation:

V＝Memⁿ·W_v V＝Mem ⁿ ·W _v

W_v代表可学习参数，查询向量通过内存空间向量和权重计算获得：W _v represents the learnable parameters, and the query vector is obtained by calculating the memory space vector and weight:

其中：*∈{L,V}代表图像特征和文本特征，V_*j表示内存空间向量。Where: *∈{L,V} represents image features and text features, and V _*j represents the memory space vector.

和

and

其中

表示可变参数，

表示模态向量；in

Indicates variable parameters.

represents the mode vector;

计算归一化注意权重：Calculate the normalized attention weights:

使用注意力权重

Using attention weights

The specific process is as follows:

和

and

步骤2.3为了学习多模态特征的深度抽象表征，采用GRU将交互注意力机制后的结果与当前层的输入结合起来，在第n层中首先使用跨模态注意力机制以及自注意力机制获得增强后的多模态序列，然后使用GRU获得新的文本和图片特征，其中，n不包括第一层，第一层采用方面引导注意力机制，具体过程如下：Step 2.3 In order to learn the deep abstract representation of multimodal features, GRU is used to combine the results of the interactive attention mechanism with the input of the current layer. In the nth layer, the cross-modal attention mechanism and the self-attention mechanism are first used to obtain the enhanced multimodal sequence, and then GRU is used to obtain new text and image features. Among them, n does not include the first layer. The first layer uses the aspect-guided attention mechanism. The specific process is as follows:

其中：SA^mul代表多头自注意力机制，

is the target mode vector, and n represents the number of layers.

步骤3：将步骤2中所获得的图片特征和文本特征采用自注意力机制进行多模态融合，具体表示如下：Step 3: The image features and text features obtained in step 2 are fused in a multimodal manner using the self-attention mechanism, as shown below:

其中：

均表示多模态序列，FC是融合多模态函数。in:

Both represent multimodal sequences, and FC is the fusion multimodal function.

步骤4：将步骤1中的图片特征和图片文本特征与步骤3中的融合后的多模态特征进行concat操作得到包括三种特征表示E_mul作为输入数据：Step 4: Perform a concat operation on the image features and image text features in step 1 and the fused multimodal features in step 3 to obtain three feature representations E _mul as input data:

E_mul＝concat(X_mul,X_L,X_V)E _mul =concat(X _mul ,X _L ,X _V )

P＝spftmax(W_mE+b_m)P = spftmax (W _m E + b _m )

为了保持模态间更丰富的特征，本发明使用L2损失作为损失函数，具体如下：In order to maintain richer features between modalities, the present invention uses L2 loss as the loss function, which is as follows:

其中α代表超参数。Where α represents a hyperparameter.

本发明充分利用了跨模态间的信息交互，有助于提高情感预测的准确性。The present invention makes full use of cross-modal information interaction, which helps to improve the accuracy of emotion prediction.

Claims

1. A cross-modal emotion analysis method based on an attention network is characterized by comprising the following steps: the cross-modal emotion analysis method comprises the following steps of:

step 1: extracting picture characteristics corresponding to an input picture text, picture text characteristics and aspect characteristics of a given aspect phrase;

step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the pictures enter the modality updating modules after being aligned, and finally the picture features and the text features after interaction are obtained by utilizing the correlation of different modalities to be supplemented step by step;

and step 3: performing multi-mode fusion on the interactive picture features and text features obtained in the step 2 by adopting a self-attention mechanism to obtain multi-mode features;

and 4, step 4: and (4) performing concat operation on the picture features and the picture text features in the step (1) and the fused multi-modal features in the step (3) to perform emotion prediction.

2. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the step 2 specifically comprises the following steps:

step 2.1: the modal alignment module aligns feature spaces of different modes before modal interaction to obtain multi-modal information;

step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modules

And &>

Namely a text updating module and a picture updating module, a first layer of a modal updating layer adopts an aspect-guided attention method, and the specific process is as follows:

wherein

Hidden representation of the generated target modality, I ^A Representative facet feature vector, b ⁽¹⁾ Representing a parameter that can be learned by a user,

represents a variable parameter, is selected>

Representing a modality vector;

calculating a normalized attention weight:

using attention weights

Carrying out weighted average on the characteristic vectors of the target mode to obtain a new target mode vector

Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target mode

The specific process is as follows:

wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:

wherein, SA ^mul ，CMA ^mul And Att partial tables represent a multi-head self-attention mechanism, a multi-head trans-modal attention mechanism, a normalization function and an additive attention mechanism, and the additive attention mechanism is used and is specifically expressed as follows:

wherein G, W _c ，b _c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtained

And

3. the attention network-based cross-modal emotion analysis method of claim 2, wherein: in step 2.3, in order to learn the deep abstract representation of the multi-modal features, the GRU is adopted to combine the result of the interactive attention mechanism with the input of the current layer, in the nth layer, the cross-modal attention mechanism and the self-attention mechanism are used to obtain an enhanced multi-modal sequence, and then the GRU is used to obtain new text and picture features, which specifically includes the following steps:

wherein: SA ^mul Representing a multi-head self-attentive mechanism,

for the target mode vector, n represents the number of layers.

4. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: in the step 3, the picture features and the text features obtained in the step 2 are subjected to multi-modal fusion by using an attention mechanism, which is specifically represented as follows:

wherein:

all represent a multimodal sequence, FC is a fused multimodal function.

5. The attention network-based cross-modal emotion analysis method of claim 1, wherein: the step 4 specifically comprises the following steps: performing concat operation on the text features, the picture features and the fused multi-modal features in the steps 1 and 3 to obtain a representation E containing three features _mul As input data:

E _mul ＝concat(X _mul ,X _L ,X _V )

performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:

P＝softmax(W _m E+b _m )

wherein W _m Weight representing fully connected layer, b _m Representing bias and P representing emotion prediction.

6. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the method for extracting the aspect features of the aspect phrases given in the step 1 comprises the following steps:

given aspect phrase a = { a = { [ a ] ₁ ,A ₂ …A _n Get word embedding vector a first using word embedding _j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector _j ：

Then all hidden tokens V are taken _j As the final aspect feature vector V ^A ：

7. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a VGG16 network is adopted to extract picture features, the VGG16 network is composed of 13 convolution layers, 5 pooling layers and 3 full-connection layers, and the specific process of extracting the picture features by adopting the VGG16 network is as follows:

step 11: inputting: inputting 224 x 3 matrix of image pixels;

step 12: convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:

f _j ＝R(X _i *K _j +b)

where R represents the ReLU activation function, which represents the convolution operation, b represents the bias term, K _j Convolution kernels representing different matrix depths;

step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;

step 14: finally, the picture feature vector is obtained through the pre-trained VGG16 network

X _Vp ＝{X _V1 ,X _V2 …X _Vn Represents it.

8. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a Bert pre-training model is adopted to obtain the picture text characteristics, and the specific process is as follows:

step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;

step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the input text, inputting the input text by using the word sequence, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use _Lp ＝{X _L1 ,X _L2 …X _Ln Denotes.