CN115982652A - Cross-modal emotion analysis method based on attention network - Google Patents
Cross-modal emotion analysis method based on attention network Download PDFInfo
- Publication number
- CN115982652A CN115982652A CN202211623613.1A CN202211623613A CN115982652A CN 115982652 A CN115982652 A CN 115982652A CN 202211623613 A CN202211623613 A CN 202211623613A CN 115982652 A CN115982652 A CN 115982652A
- Authority
- CN
- China
- Prior art keywords
- modal
- text
- picture
- modality
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 41
- 238000004458 analytical method Methods 0.000 title claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 28
- 230000003993 interaction Effects 0.000 claims abstract description 21
- 230000004927 fusion Effects 0.000 claims abstract description 9
- 230000002452 interceptive effect Effects 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 47
- 230000006870 function Effects 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 12
- 239000000654 additive Substances 0.000 claims description 9
- 230000000996 additive effect Effects 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 6
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000002996 emotional effect Effects 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 5
- 239000013589 supplement Substances 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Processing Or Creating Images (AREA)
Abstract
The invention belongs to the fields of natural language processing, computer vision and emotion analysis technology, and discloses a cross-modal emotion analysis method based on an attention network, which comprises the following steps: step 1: extracting picture characteristics, picture text characteristics and aspect characteristics; step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the aligned modalities enter the modality updating modules, and the interactive picture features and text features are finally obtained by utilizing the correlation of different modalities to supplement step by step; and step 3: performing multi-mode fusion on the step picture characteristics and the text characteristics by adopting a self-attention mechanism; and 4, step 4: and performing concat operation on the picture characteristics, the picture text characteristics and the multi-mode characteristics to predict emotion. The method makes full use of the information interaction among the cross-modal, and is beneficial to improving the accuracy of emotion prediction.
Description
Technical Field
The invention relates to the fields of natural language processing, computer vision and emotion analysis, in particular to a cross-modal emotion analysis method based on an attention network.
Background
With the development of various social networking platforms and networking technologies, the way in which users make speech on the internet is more diversified, and more users choose to express their emotions and opinions by videos, pictures or articles. How to analyze emotional tendency and public opinion orientation contained in the multi-modal information becomes a challenge in the field of emotional analysis. However, fusing multimodal information is not easy due to the heterogeneity and asynchrony of multimodal data. In terms of heterogeneity, different modalities exist in different feature spaces. In terms of asynchrony, the non-uniform sampling rates of time series data of different modalities results in a failure to obtain an optimal mapping between the different modalities. There are many studies on multi-modal analysis, and the specific methods can be summarized into the following two categories: one is to model the asynchrony of multimodal data using cross-modality attention to provide soft mapping between different modalities. However, such approaches do not take into account the heterogeneity of multimodal data. Another class considers multimodal data heterogeneity. Methods in this category separate each modality into a shared portion of modalities and a private portion of modalities, represented by different neural networks. The limitation of these approaches is that they do not take into account asynchrony between different modes.
Disclosure of Invention
In order to solve the problems of multi-modal heterogeneity and heterogeneity, the invention provides a cross-modal emotion analysis method based on an attention network.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention relates to a cross-modal emotion analysis method based on an attention network, which specifically comprises the following steps:
step 1: extracting picture characteristics and picture text characteristics corresponding to an input picture text;
step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module for aligning a representation space and two modality updating modules, each modality is aligned in the modality alignment module, the aligned modalities enter the modality updating modules, and the interactive picture features and text features are finally obtained by utilizing the correlation of different modalities to supplement step by step;
and step 3: performing multi-mode fusion on the interactive picture features and text features obtained in the step 2 by adopting a self-attention mechanism to obtain multi-mode features;
and 4, step 4: and (4) performing concat operation on the picture features and the picture text features in the step (1) and the fused multi-modal features in the step (3) to perform emotion prediction.
Preferably, the following components: the step 2 specifically comprises the following steps:
step 2.1: the modal alignment module aligns feature spaces of different modes before modal interaction to obtain multi-modal information;
step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modulesAnd &>Namely a text update module and a picture update module, in order to make text and visual features focus more on the information part of a given aspect and suppress the less important parts, an aspect-guided attention method is adopted at the first layer of the modality update layer, and the specific process is as follows:
whereinHidden representation of the generated target modality, I A Representative facet feature vector, b (1) Represents a learnable parameter, <' > or>Represents a variable parameter, <' > is combined with>Representing a modality vector;
calculating a normalized attention weight:
using attention weightsCarrying out weighted average on the feature vectors of the target modes to obtain a new target mode vector->
Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target modeThe specific process is as follows:
wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:
wherein, SA mul ,CMA mul And the Att partial table represents a multi-head self-attention mechanism, a multi-head cross-modal attention mechanism, a normalization function and an additive attention mechanism, and in order to better fuse the image and the text modal, the additive attention mechanism is used in the invention and is specifically represented as follows:
wherein G, W c ,b c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtainedAnd &>
In order to learn the deep abstract representation of the multi-modal features, the GRU is adopted to combine the result of the interactive attention mechanism with the input of the current layer, the cross-modal attention mechanism and the self-attention mechanism are firstly used in the nth layer to obtain an enhanced multi-modal sequence, and then the GRU is used to obtain new text and picture features, wherein n does not comprise the first layer, the first layer adopts the aspect guide attention mechanism, and the specific process is as follows:
wherein: SA mul Representing a multi-head self-attentive mechanism,for the target mode vector, n represents the number of layers.
Preferably, the following components: in the step 3, the picture features and the text features obtained in the step 2 are subjected to multi-modal fusion by using an attention-oriented mechanism, which is specifically represented as follows:
Preferably: the step 4 specifically comprises the following steps: performing concat operation on the text features, the picture features and the fused multi-modal features in the steps 1 and 3 to obtain a representation E containing three features mul As input data:
E mul =concat(X mul ,X L ,X V )
performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:
P=softmax(W m E+b m )
wherein W m Weight representing full connection layer, b m Representing bias and P representing emotion prediction.
Preferably: the specific process of extracting the picture features by adopting the VGG16 network is as follows:
step 11: inputting: inputting 224 x 3 matrix of image pixels;
step 12: convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:
f j =R(X i *K j +b)
where R represents the ReLU activation function, b represents the bias term, K j Convolution kernels representing different matrix depths;
step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;
step 14: finally, obtaining X for picture feature vector through pre-trained VGG16 network Vp ={X V1 ,X V2 …X Vn Denotes.
Preferably, the following components: in the step 1, a Bert pre-training model is adopted to obtain the picture text characteristics, and the specific process is as follows:
step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;
step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the input text, inputting the input text by using the word sequence, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use Lp ={X L1 ,X L2 …X Ln Denotes.
Preferably: the method for extracting the aspect features of the aspect phrases given in the step 1 comprises the following steps:
given aspect phrase a = { a = { [ a ] 1 ,A 2 …A n Using word embedding to obtain word embedding vector a j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector j :
Then all hidden tokens V are taken j As the final aspect feature vector V A :
The invention has the beneficial effects that:
(1) The emotion analysis method utilizes the mode alignment module and the mode updating module, adopts the attention mechanism, and carries out cross-mode interaction, thereby improving the accuracy of multi-mode emotion analysis.
(2) The mode updating module layer comprises a mode aligning module and a mode updating module, wherein the mode aligning module is used for aligning the characteristic sequences of different modes and is beneficial to interaction among the modes.
(3) The mode updating module utilizes a multi-head self-attention mechanism and a cross-mode attention mechanism to enhance interaction among the modes, and fully integrates sharing characteristics and private characteristics of different modes.
(4) In order to save rich features among the modes, the fused multi-mode features and the initial mode features are fused again, and then emotion classification is carried out.
(5) The method makes full use of the information interaction among the cross-modes, and is beneficial to improving the accuracy of emotion prediction.
Drawings
FIG. 1 is a flow chart of the emotion analysis method of the present invention.
FIG. 2 is a diagram of the emotion analysis method architecture of the present invention.
FIG. 3 is a modal update block diagram of the present invention.
Detailed Description
In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the invention. It should be understood, however, that these implementation details are not to be interpreted as limiting the invention. That is, in some embodiments of the invention, such implementation details are not necessary.
As shown in fig. 1 to 3, the present invention is a cross-modal emotion analysis method based on an attention network, which provides a cross-modal emotion analysis model based on a visual attention network, and enhances information interaction between two modalities of graphics and text through a modality update layer, so as to enhance robustness and accuracy of the model, and specifically, the cross-modal emotion analysis method includes the following steps:
step 1: and extracting picture characteristics, picture text characteristics and aspect characteristics of the given aspect phrase corresponding to the input picture text.
And extracting picture features by adopting a VGG16 network, wherein the VGG16 network consists of 13 convolutional layers, 5 pooling layers and 3 full-connection layers. The convolution layer obtains a picture feature map by convolution. And (3) adopting a dot multiplication mode on a representation matrix of the image, sliding the convolution kernel by a certain step length, and multiplying each element of the corresponding position and the input matrix unit to obtain the characteristic diagram of the image based on the current convolution kernel. And the pooling layer reduces the dimension of the feature map after convolution and screens local features by adopting maximum pooling. And finally, synthesizing the upper-layer output characteristics by the full connection layer.
The specific process of extracting the picture features is as follows:
step 11: inputting: inputting 224 x 3 matrix of image pixels;
step 12: and (3) convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:
f j =R(X i *K j +b)
where R represents the ReLU activation function, b represents the bias term, K j Convolution kernels representing different matrix depths;
step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;
step 14: finally, the pre-trained VGG16 network is used for obtaining the picture feature vector
X Vp ={X V1 ,X V2 …X Vn Represents it.
The method comprises the following steps of obtaining picture text characteristics by adopting a Bert pre-training model, and specifically comprising the following steps:
step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;
step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the text, inputting the text by taking the word sequence as input, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use Lp ={X L1 ,X L2 …X Ln Represents it.
The aspect feature extraction method for a given aspect phrase is as follows:
given aspect phrase a = { a = { [ a ] 1 ,A 2 …A n Get word embedding vector a first using word embedding j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector j :
Then all hidden tokens V are taken j As the final aspect feature vector V A :
Step 2: and the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module for aligning the representation space and two modality updating modules, each modality is aligned in the modality alignment module, the images enter the modality updating modules after being aligned, and the image features and the text features after interaction are finally obtained by utilizing the correlation of different modalities to supplement step by step.
Step 2.1: the modality alignment module aims to align feature spaces of different modalities before modality interaction, and firstly maps single modality representations of a plurality of modalities to the same storage space, wherein the single modality representations are specifically represented as follows:
whereinRepresenting a text vector, mem n Represents a storage space vector, theta represents a parameter, <' > v>Representing the aligned modal vector, f (·) representing the exchange function of the modal vector and the storage space vector, n representing the n-th layer modal updating layer, and the specific calculation process of the modal alignment module is as follows:
K=Mem n ·W K
wherein W q And W K Parameters representing linear transformations, Q * Representing the vector representation after linear transformation of two modes, wherein K represents the size of a storage space, and the similarity calculation formula of the mode vector and the storage space vector is as follows:
the weight of the jth memory vector is represented as:
the storage space vector is represented as follows after linear transformation:
V=Mem n ·W v
W v representing learnable parameters, and obtaining a query vector through memory space vector and weight calculation:
wherein: * E { L, V } represents image and text features, V *j Representing a memory space vector.
Step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modulesAnd &>Namely a text update module and a picture update module, in order to make text and visual features focus more on the information part of a given aspect and suppress the less important parts, an aspect-guided attention method is adopted at the first layer of the modality update layer, and the specific process is as follows:
whereinHidden representation of the generated target modality, I A Representative facet feature vector, b (1) Represents a learnable parameter, <' > based on>Represents a variable parameter, is selected>Representing a modality vector;
calculating a normalized attention weight:
using attention weightsCarrying out weighted average on the characteristic vectors of the target modes to obtain a new target mode vector->
Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target modeThe specific process is as follows:
wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:
wherein, SA mul ,CMA mul And the Att partial table represents a multi-head self-attention mechanism, a multi-head cross-modal attention mechanism, a normalization function and an additive attention mechanism, and in order to better fuse the image and the text modal, the additive attention mechanism is used in the invention and is specifically represented as follows:
wherein G, W c ,b c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtainedAnd &>
Step 2.3, in order to learn the depth abstract representation of the multi-modal features, combining the result of the interactive attention mechanism with the input of the current layer by using a GRU, obtaining an enhanced multi-modal sequence in the nth layer by using a cross-modal attention mechanism and a self-attention mechanism, and then obtaining new text and picture features by using the GRU, wherein n does not include the first layer, the first layer adopts an aspect-guided attention mechanism, and the specific process is as follows:
wherein: SA mul Representing a multi-head self-attention mechanism,for the target mode vector, n represents the number of layers.
And step 3: performing multi-mode fusion on the picture features and the text features obtained in the step 2 by adopting an attention-machine mechanism, and specifically representing as follows:
And 4, step 4: performing concat operation on the picture characteristics and the picture text characteristics in the step 1 and the fused multi-modal characteristics in the step 3 to obtain a representation E comprising three characteristics mul As input data:
E mul =concat(X mul ,X L ,X V )
performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:
P=spftmax(W m E+b m )
wherein W m Weight representing full connection layer, b m Representing bias and P representing emotion prediction.
To preserve the richer features between modalities, the present invention uses the L2 penalty as a penalty function, as follows:
where alpha represents a hyper-parameter.
The method makes full use of the information interaction among the cross-modes, and is beneficial to improving the accuracy of emotion prediction.
Claims (8)
1. A cross-modal emotion analysis method based on an attention network is characterized by comprising the following steps: the cross-modal emotion analysis method comprises the following steps of:
step 1: extracting picture characteristics corresponding to an input picture text, picture text characteristics and aspect characteristics of a given aspect phrase;
step 2: the extracted picture text features enter modality updating layers, each modality updating layer comprises a modality alignment module and two modality updating modules, each modality is aligned in the modality alignment module, the pictures enter the modality updating modules after being aligned, and finally the picture features and the text features after interaction are obtained by utilizing the correlation of different modalities to be supplemented step by step;
and step 3: performing multi-mode fusion on the interactive picture features and text features obtained in the step 2 by adopting a self-attention mechanism to obtain multi-mode features;
and 4, step 4: and (4) performing concat operation on the picture features and the picture text features in the step (1) and the fused multi-modal features in the step (3) to perform emotion prediction.
2. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the step 2 specifically comprises the following steps:
step 2.1: the modal alignment module aligns feature spaces of different modes before modal interaction to obtain multi-modal information;
step 2.2: the aligned multi-modal information enters a modal updating module to gradually enhance each modal, and each modal updating layer comprises two modal updating modulesAnd &>Namely a text updating module and a picture updating module, a first layer of a modal updating layer adopts an aspect-guided attention method, and the specific process is as follows:
whereinHidden representation of the generated target modality, I A Representative facet feature vector, b (1) Representing a parameter that can be learned by a user,represents a variable parameter, is selected>Representing a modality vector;
calculating a normalized attention weight:
using attention weightsCarrying out weighted average on the characteristic vectors of the target mode to obtain a new target mode vector
Step 2.3: in order to capture the bidirectional interaction among different modes and enhance the interaction among the modes, the mode updating module introduces a cross-mode attention mechanism and a self-attention mechanism and enhances the target modeThe specific process is as follows:
wherein, represents the target modality to be enhanced, α represents the supplementary modality, and if the target modality is text, the supplementary modality is a picture, and the formula is as follows:
wherein, SA mul ,CMA mul And Att partial tables represent a multi-head self-attention mechanism, a multi-head trans-modal attention mechanism, a normalization function and an additive attention mechanism, and the additive attention mechanism is used and is specifically expressed as follows:
wherein G, W c ,b c Representing learnable parameters, and the weight of each mode updating module is obtained by dynamic calculation through an additive attention mechanism, so that the aim of information interaction between two modes is fulfilled, and finally a strong multi-mode sequence is obtainedAnd
3. the attention network-based cross-modal emotion analysis method of claim 2, wherein: in step 2.3, in order to learn the deep abstract representation of the multi-modal features, the GRU is adopted to combine the result of the interactive attention mechanism with the input of the current layer, in the nth layer, the cross-modal attention mechanism and the self-attention mechanism are used to obtain an enhanced multi-modal sequence, and then the GRU is used to obtain new text and picture features, which specifically includes the following steps:
4. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: in the step 3, the picture features and the text features obtained in the step 2 are subjected to multi-modal fusion by using an attention mechanism, which is specifically represented as follows:
5. The attention network-based cross-modal emotion analysis method of claim 1, wherein: the step 4 specifically comprises the following steps: performing concat operation on the text features, the picture features and the fused multi-modal features in the steps 1 and 3 to obtain a representation E containing three features mul As input data:
E mul =concat(X mul ,X L ,X V )
performing feature fusion on the data by using a full-connection network, and performing emotion prediction by using a softmax classifier at the last layer, wherein the emotion prediction calculation formula is as follows:
P=softmax(W m E+b m )
wherein W m Weight representing fully connected layer, b m Representing bias and P representing emotion prediction.
6. The method for cross-modal emotion analysis based on attention network as claimed in claim 1, wherein: the method for extracting the aspect features of the aspect phrases given in the step 1 comprises the following steps:
given aspect phrase a = { a = { [ a ] 1 ,A 2 …A n Get word embedding vector a first using word embedding j Then, a bidirectional LSTM model is adopted to learn hidden representation V of each aspect word embedding vector j :
Then all hidden tokens V are taken j As the final aspect feature vector V A :
7. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a VGG16 network is adopted to extract picture features, the VGG16 network is composed of 13 convolution layers, 5 pooling layers and 3 full-connection layers, and the specific process of extracting the picture features by adopting the VGG16 network is as follows:
step 11: inputting: inputting 224 x 3 matrix of image pixels;
step 12: convolution pooling: the input image pixel matrix is subjected to 5 rounds of convolution pooling, the size of each convolution kernel is 3 x w, w represents the depth of the matrix, a plurality of feature maps are obtained through an activation function ReLU after convolution, and local features are screened by adopting maximum pooling, wherein the convolution calculation formula is as follows:
f j =R(X i *K j +b)
where R represents the ReLU activation function, which represents the convolution operation, b represents the bias term, K j Convolution kernels representing different matrix depths;
step 13: fully connecting: obtaining 1 × 1000 image feature representation vectors through three times of full connection;
step 14: finally, the picture feature vector is obtained through the pre-trained VGG16 network
X Vp ={X V1 ,X V2 …X Vn Represents it.
8. The attention network-based cross-modal emotion analysis method of claim 1, wherein: in the step 1, a Bert pre-training model is adopted to obtain the picture text characteristics, and the specific process is as follows:
step 21: text preprocessing: preprocessing is carried out aiming at nonsense words and symbols in the network words, and words which do not influence the judgment of the text emotional tendency are used as stop words and deleted;
step 22: extracting a word vector sequence of an input text by adopting a pretrained Bert model, inputting the text, segmenting and labeling the input text, inputting the input text by using the word sequence, performing word embedding, segment embedding and position embedding of the Bert model, flowing layer by layer in a stack, finally generating a word vector of text characteristics, and using X to use Lp ={X L1 ,X L2 …X Ln Denotes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211623613.1A CN115982652A (en) | 2022-12-16 | 2022-12-16 | Cross-modal emotion analysis method based on attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211623613.1A CN115982652A (en) | 2022-12-16 | 2022-12-16 | Cross-modal emotion analysis method based on attention network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115982652A true CN115982652A (en) | 2023-04-18 |
Family
ID=85962068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211623613.1A Pending CN115982652A (en) | 2022-12-16 | 2022-12-16 | Cross-modal emotion analysis method based on attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115982652A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
-
2022
- 2022-12-16 CN CN202211623613.1A patent/CN115982652A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chatterjee et al. | Diverse and coherent paragraph generation from images | |
CN112818861B (en) | Emotion classification method and system based on multi-mode context semantic features | |
Zhang et al. | Rich visual knowledge-based augmentation network for visual question answering | |
CN112860888B (en) | Attention mechanism-based bimodal emotion analysis method | |
CN113065577A (en) | Multi-modal emotion classification method for targets | |
CN110852368A (en) | Global and local feature embedding and image-text fusion emotion analysis method and system | |
CN109992686A (en) | Based on multi-angle from the image-text retrieval system and method for attention mechanism | |
CN112651940B (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
CN117421591A (en) | Multi-modal characterization learning method based on text-guided image block screening | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN113297370A (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN114662497A (en) | False news detection method based on cooperative neural network | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
CN114419509A (en) | Multi-mode emotion analysis method and device and electronic equipment | |
CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
Cheng et al. | Stack-VS: Stacked visual-semantic attention for image caption generation | |
CN109766918A (en) | Conspicuousness object detecting method based on the fusion of multi-level contextual information | |
Pande et al. | Development and deployment of a generative model-based framework for text to photorealistic image generation | |
CN118038139A (en) | Multi-mode small sample image classification method based on large model fine tuning | |
CN116935170A (en) | Processing method and device of video processing model, computer equipment and storage medium | |
CN115982652A (en) | Cross-modal emotion analysis method based on attention network | |
Tong et al. | ReverseGAN: An intelligent reverse generative adversarial networks system for complex image captioning generation | |
Long et al. | Cross-domain personalized image captioning | |
CN117150069A (en) | Cross-modal retrieval method and system based on global and local semantic comparison learning | |
CN115758159B (en) | Zero sample text position detection method based on mixed contrast learning and generation type data enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |