CN115408517A - Knowledge injection-based multi-modal irony recognition method of double-attention network - Google Patents

Knowledge injection-based multi-modal irony recognition method of double-attention network Download PDF

Info

Publication number
CN115408517A
CN115408517A CN202210863424.5A CN202210863424A CN115408517A CN 115408517 A CN115408517 A CN 115408517A CN 202210863424 A CN202210863424 A CN 202210863424A CN 115408517 A CN115408517 A CN 115408517A
Authority
CN
China
Prior art keywords
representation
original
context
text
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210863424.5A
Other languages
Chinese (zh)
Inventor
亢良伊
刘杰
叶丹
周志阳
李硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202210863424.5A priority Critical patent/CN115408517A/en
Publication of CN115408517A publication Critical patent/CN115408517A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a knowledge injection-based multi-modal sarcasm recognition method of a double-attention network, which comprises the following steps: acquiring data content to be identified, wherein the data content to be identified comprises a plurality of text and picture pairs; encoding words in the text and objects in the picture to obtain an original representation; expanding the original representation based on the implicit context information of the data content to be identified to obtain context-aware representation; obtaining attention calculation results of the original representation and the context perception representation; calculating an original cross-modal contrast representation and a context-aware cross-modal contrast representation according to the attention calculation result; calculating a ironic recognition result based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation. The invention is helpful to improve the integral performance of ironic identification, is convenient for the practical application of the model and provides interpretability for the prediction result.

Description

Knowledge injection-based multi-modal irony recognition method of double-attention network
Technical Field
The invention relates to a knowledge injection-based multi-modal sarcasm recognition method of a double-attention network, and belongs to the technical field of multi-modal information recognition.
Background
Irony based on multimodal information is the goal of implicitly expressing strong emotions by using text as opposed to metaphorical scenes in pictures. Currently, text and picture based irony is ubiquitous on social platforms such as microblogs, twitter, and the like. Because irony reverses the polarity of emotion or perspective in text, automatic detection of multimodal irony is of great importance in customer service, opinion mining, and various tasks requiring understanding of the person's true emotion.
The real multimodality irony detection is quite complex. The semantic expression of user input information is affected not only by explicit content, but also by implicit context. Explicit content refers to the content of a scene observable in the input text or picture, and implicit context refers to the inferred knowledge about the scene that is not visible in the input information, including the processes involved in the development of the scene and the intent of the characters in the scene. Irony identification requires accurate localization of the irony-describing portions of multimodal information and discrimination of their semantic differences, based on the complete semantic representation of text and pictures. However, existing multimodal irony detection methods only learn features from input text and pictures, ignoring modeling of the implicit context behind the content. Meanwhile, they model semantic differences among multimodal information based on the unprocessed full amount of information of texts and pictures, and noise is easily introduced, so that the accuracy of ironic identification is reduced, and the practical application of models is affected. How to inject implicit context information into a multi-modal input to obtain a better feature representation and accurately position a ironic description area based on the information to perform semantic difference recognition is an urgent problem to be solved.
Disclosure of Invention
In order to solve the problems, the knowledge injection-based multi-modal sarcasm recognition method of the double-attention network provided by the invention utilizes the knowledge-enhanced multi-dimensional attention module to inject the implicit context knowledge into the multi-modal input representation, and divides the implicit context knowledge into two angles, namely a scene state and an emotional state, according to the human reasoning mode to construct the complete semantic representation of the multi-modal information. At the same time, the photo and text attention modules are cooperatively executed using a dual attention network based on a joint memory vector that brings together previous attention results to capture ironic shared semantics in text and photos. Finally, based on the joint embedding space, a multi-dimensional cross-modal matching layer is adopted to distinguish the difference between the multiple modes from multiple dimensions. This helps to improve the overall performance of ironic identification and provides interpretability of the prediction results, facilitating practical application of the model.
The technical content of the invention comprises:
a method of knowledge injection based multimodal irony recognition of a dual-attention network, the method comprising:
acquiring data content to be identified, wherein the data content to be identified comprises: a number of < text, picture > pairs, the text containing a number of words i, the picture relating to a plurality of objects j;
respectively coding the words i in the text and the objects j in the picture to obtain original expression of the words
Figure BDA0003757581630000021
With the original representation of the object
Figure BDA0003757581630000022
Based on implicit context information of the data content to be identified, the original representation of the word is performed
Figure BDA0003757581630000023
And the original representation of the object
Figure BDA0003757581630000024
Expanding to obtain word context perception representation
Figure BDA0003757581630000025
And object context aware representation
Figure BDA0003757581630000026
Separately representing the words using a dual attention network
Figure BDA0003757581630000027
The original representation of the object
Figure BDA0003757581630000028
And the word context-aware representation
Figure BDA0003757581630000029
The object context-aware representation
Figure BDA00037575816300000210
Performing attention calculation to obtain an attention calculation result of the original representation and the context perception representation;
obtaining an original cross-modal comparison representation and a context-aware cross-modal comparison representation by comparing differences between texts and pictures according to attention calculation results of the original representation and the context-aware representation;
calculating a irony recognition result of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation.
Further, the object j in the picture is coded to obtain an original representation of the object
Figure BDA00037575816300000211
The method comprises the following steps:
for each picture, detecting the area of the object j from the picture by using a pre-trained target detector, and taking the pooled features before the multi-class classification layer as the visual feature representation r of the object j j
Representing the visual features r j Projecting into a space of text representations;
by calculating each of the textThe relevance of the word i to the object j, obtaining a textual representation specific to the object j
Figure BDA00037575816300000212
Based on the text representation
Figure BDA00037575816300000213
And the visual feature representation r j With representation of the text relevance of the calculation object j
Figure BDA00037575816300000214
Representing the visual characteristics r j The composed object sequence is input into a bidirectional gated recurrent neural network, and the representation is performed
Figure BDA00037575816300000215
As weights are calculated to obtain the original representation of each object j
Figure BDA00037575816300000216
Further, the original representation of the word is based on implicit context information of the data content to be identified
Figure BDA00037575816300000217
With the original representation of the object
Figure BDA00037575816300000218
Expanding to obtain word context perception representation
Figure BDA00037575816300000219
With object context aware representation
Figure BDA00037575816300000220
The method comprises the following steps:
generating different types of inference knowledge for event descriptions in each picture or text
Figure BDA00037575816300000221
And calculating the inference knowledge
Figure BDA0003757581630000031
Is expressed as H M,R Wherein w is l Marking words in the inference knowledge, wherein L is more than or equal to 1 and less than or equal to L, L represents the length of the inference knowledge, the relationship type R belongs to { before, after and intent }, before represents the relationship type of the event, after represents the relationship type of the event, intent represents the relationship type of the figure in the scene, and the mode M represents a text mode or a picture mode;
based on the original representation of the word
Figure BDA0003757581630000032
Composed text feature mapping H T Original representation of the object
Figure BDA0003757581630000033
Composed picture feature mapping H I And the common sense inference expression H M,R Calculating the data content to be recognized and the inference knowledge
Figure BDA0003757581630000034
A correlation matrix C between M
Based on the incidence matrix C M Obtaining the original representation of the word
Figure BDA0003757581630000035
With the original representation of the object
Figure BDA0003757581630000036
With a representation of implicit context information
Figure BDA0003757581630000037
And presentation of pictures with implicit context information
Figure BDA0003757581630000038
By reasoning knowledge for each of said
Figure BDA0003757581630000039
Learning a correlation weight, computing said representation
Figure BDA00037575816300000310
And the representation
Figure BDA00037575816300000311
Enhanced representation of
Figure BDA00037575816300000312
And enhanced representation
Figure BDA00037575816300000313
Based on the enhanced representation
Figure BDA00037575816300000314
And the enhanced representation
Figure BDA00037575816300000315
Computing word context-aware representations
Figure BDA00037575816300000316
With object context aware representation
Figure BDA00037575816300000317
Wherein the word perception vector representation
Figure BDA00037575816300000318
The method comprises the following steps: context-aware representation of context states of words
Figure BDA00037575816300000319
And emotional state context awareness representation
Figure BDA00037575816300000320
The object context aware vector representation
Figure BDA00037575816300000321
The method comprises the following steps: context-aware representation of scene states of objects
Figure BDA00037575816300000322
And emotional state context awareness representation
Figure BDA00037575816300000323
Further, the correlation matrix C is used for the correlation M Obtaining the original representation of the word
Figure BDA00037575816300000324
With the original representation of the object
Figure BDA00037575816300000325
With a representation of implicit context information
Figure BDA00037575816300000326
And presentation of pictures with implicit context information
Figure BDA00037575816300000327
The method comprises the following steps:
based on the incidence matrix C M And said common sense inference means H M,R Word-level representation of inference knowledge using attention mechanism
Figure BDA00037575816300000328
And object level representation
Figure BDA00037575816300000329
In the original representation of the word
Figure BDA00037575816300000330
And the original representation of the object
Figure BDA00037575816300000331
Separately adding the word-level representations
Figure BDA00037575816300000332
And the object level representation
Figure BDA00037575816300000333
Get a representation
Figure BDA00037575816300000334
And represents
Figure BDA00037575816300000335
Further, the based on the enhanced representation
Figure BDA00037575816300000336
Computing word context-aware representations
Figure BDA00037575816300000337
The method comprises the following steps:
representing the enhancement
Figure BDA00037575816300000338
Enhanced representation of pre-event relationship types
Figure BDA00037575816300000339
Enhanced representation of post-event relationship types
Figure BDA00037575816300000340
Enhanced representation of intent relationship types
Figure BDA00037575816300000341
According to the enhanced representation
Figure BDA00037575816300000342
The original representation of the word
Figure BDA00037575816300000343
And the enhanced representation
Figure BDA00037575816300000344
Computing word context-aware representations of context states
Figure BDA0003757581630000041
According to affiliated enhanced representations
Figure BDA0003757581630000042
Obtaining word emotional state context perception representation
Figure BDA0003757581630000043
Further, the words are represented separately and originally using a dual attention network
Figure BDA0003757581630000044
The original representation of the object
Figure BDA0003757581630000045
Performing attention calculation to obtain the attention calculation result of the original representation, including:
for each word original representation
Figure BDA0003757581630000046
Summing to obtain a text complete representation u (0)
Original representation of objects in a picture
Figure BDA0003757581630000047
Summing to obtain complete picture representation v (0)
According to said representation u (0) And said representation v (0) Calculating a joint memory vector m (0)
Based on
Figure BDA0003757581630000048
Carry out iterationCalculating and obtaining a joint memory vector m after iteration is finished (K) Where K denotes the total number of iterations, i (k) Complete representation of the text, v, representing the kth iteration (k) The complete representation of the picture representing the kth iteration,
Figure BDA0003757581630000049
is the product of the elements;
based on the joint memory vector m (K) And the attention calculation is carried out by utilizing the double attention network to respectively obtain the complete text representation u (K+1) Representing v in full with the picture (K+1)
Completely expressing the text u (K+1) Is expressed in integral with the picture v (K+1) As a result of the attention calculation of the original representation.
Further, the method is based on the joint memory vector m (K) Computing a complete representation u of the text (K+1) The method comprises the following steps:
according to the joint memory vector m (K) With the original representation of the word
Figure BDA00037575816300000410
Computing an output of a feedforward neural network in the dual attention network
Figure BDA00037575816300000411
Outputting the output
Figure BDA00037575816300000412
Substituting the softmax function into the function to obtain the attention weight
Figure BDA00037575816300000413
Based on the attention weight
Figure BDA00037575816300000414
Original representation of the word
Figure BDA00037575816300000415
Weighted summation is carried out to obtain the text complete representation u (K+1)
Further, the obtaining the original cross-modal comparison representation by comparing differences between texts and pictures according to the attention calculation result of the original representation includes:
raw cross-modal contrast representation
Figure BDA00037575816300000416
Wherein the content of the first and second substances,
Figure BDA00037575816300000417
representing a trainable weight matrix, | | | is the absolute value of the element difference; indicating a connection operation
Further, the calculating a irony recognition result of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation comprises:
connecting the original cross-modality contrast representation and the context-aware cross-modality contrast representation;
and inputting the connection result into a full connection layer, and carrying out binary irony classification by using a Sigmoid function to obtain the irony recognition result of the data content to be recognized.
An electronic device comprising a memory and a processor, the memory having stored therein a computer program that is loaded and executed by the processor to implement any of the methods described above.
Compared with the prior art, the invention has the advantages that:
(1) According to the method, visual COMET is adopted to provide implicit context information, namely scene state context and emotional state context, for text and picture modal information, knowledge-enhanced multidimensional attention modules are adopted to inject the implicit context into multimodal input, and text and picture representation of context perception is generated, so that complete semantic context is constructed by the multimodal information.
(2) The designed double attention network can accurately locate ironic areas in multi-modal information by using text and picture attention through multiple iterations. Meanwhile, the double attention is respectively applied to the original representation and the context perception representation from the network, and multi-modal information representations of multiple angles are captured. Based on a multi-dimensional shared representation space, a multi-dimensional trans-modal module is adopted to distinguish semantic differences of texts and pictures, so that the irony is accurately identified.
(3) Compared with the existing method, the method has higher performance, and meanwhile, the double attention module can provide interpretability for the predicted structure by combining the injected knowledge.
Drawings
FIG. 1 is a system model flow diagram of the present invention.
FIG. 2 is a system model architecture diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely specific embodiments of the present invention, rather than all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The technical problem of the invention is solved: the method for multi-modal sarcasm recognition of the double-attention network based on knowledge injection aims at the problem of multi-modal sarcasm recognition, on one hand, a knowledge-enhanced multi-dimensional attention model is adopted to construct a complete semantic representation of multi-modal information, on the other hand, a shared vector is maintained by adopting the double-attention machine, the sarcasm-related shared semantics in the multi-modal information are extracted through a text and picture attention module, and the difference of a multi-modal scene containing the sarcasm context is modeled through a multi-dimensional cross-modal matching layer, so that the integral performance of sarcasm recognition is improved, and a prediction result is provided with certain interpretability.
The technical scheme of the invention is as follows: a knowledge injection-based dual-attention network multi-modal irony recognition method comprises the steps of injecting implicit context knowledge into a multi-modal representation to construct complete semantics of multi-modal information, and capturing irony description areas in the multi-modal representation by using a dual-attention network to perform multi-dimensional semantic comparison. In the model, firstly, inputting and coding texts and pictures into vector representation, and aligning objects in the texts and the pictures by using an attention mechanism, thereby filtering irrelevant information in the pictures; then, in order to supplement implicit context information lacking in the text and the picture, an event knowledge graph is used for generating a scene context and an emotion context for the text and the picture, and the acquired knowledge is injected into multi-modal input through a multi-dimensional attention module with enhanced knowledge to construct a complete semantic code of the multi-modal information; in order to pay attention to the irony region in the text and the picture, a double attention module which is used for cooperatively executing the text and the picture is provided, and shared semantics which span multiple modalities are captured in original coding and complete semantic coding of multi-modality information by maintaining a joint memory vector; based on the joint embedding space, adopting multi-dimensional cross-modal matching to distinguish multi-modal differences in multiple dimensions; finally, a plurality of comparison results are concatenated into an input classification for multi-modal irony detection.
Fig. 1 is a flow chart of a system model of the present invention, and as shown in fig. 1, the system of the present invention includes: the system comprises an input coding module, a knowledge injection module, a double attention interaction module, a multi-dimensional cross-modal matching module and a classification prediction module.
Firstly, in an encoding module, for word preprocessing, separating text sentences by using an NLTK toolkit to obtain words, and embedding the words into 200-dimensional vectors generated by using a Glove algorithm as initialization; for picture preprocessing, a fast-RCNN is used for extracting picture objects and characteristics thereof, and repairing is carried out in a training stage. The hidden size of the Bi-GRU is 512 dimensions.
In the knowledge injection module, for event reasoning knowledge, reasoning knowledge of three types of relations of "before", "after" and "intent" is generated for texts and pictures by using a VisualCOMET event reasoning generator. For example, the text is entered as "this is why i want to be mom" and the picture describes "a woman standing in the room with a broom". The reasoning of the three types of relationships generated by the visual communications for the text is as follows: before: [ be in a family roo, put on her school uniform, … ], after: [ play scales with friends, tell the wrong, … ], intent: [ stay at home cozy, make her mom happy, … ]; the three types of relationship reasoning generated for the pictures are as follows: before: [ put on an apron, be from a school, … ], after: [ clean the room, finish her house work, … ], intent: [ Cleaning the pole with a broom, play with friends, … ]. In the practical application process, 15 candidates are kept for reasoning knowledge of texts and pictures on each relationship type. And then, a knowledge injection module is utilized to obtain a multi-modal information representation with enhanced implicit context knowledge, and complete semantic information of texts and pictures is constructed.
In the dual attention interaction module, the text and picture attention mechanism obtains a representation of the text and picture attention to the descriptive ironic region through 3 interactive iterations.
In the multi-dimensional cross-mode matching module, texts and pictures are compared in terms of original representation, context perception representation of scene states and context perception representation of emotional states.
In the classification module, the multi-mode information comparison results of three dimensions are connected and input into an output layer consisting of a double-layer fully-connected network and Softmax for ironic classification. In training, the batch size is 32, the model training learning rate is 0.0005, adam is used as the optimizer.
FIG. 2 is a system model architecture diagram of the present invention, as shown in FIG. 2:
input coding Module
And performing characteristic extraction on the input text and picture modal information, and encoding the input text and picture modal information into a unified vector space. Its input is a sentence containing a series of words and a picture containing a plurality of objects; the output is a vector representation that is subjected to the following operations. The input coding module comprises the following two parts:
(1) A text encoding module:
given a sequence of text words w 1 ,w 2 ,…,w N In order to acquire sentence semantic information, a bidirectional gated recurrent neural network (bi-GRU) is adopted to learn the sequence semantic information representation of words in the text, andthe coding is in vector form:
Figure BDA0003757581630000071
wherein the content of the first and second substances,
Figure BDA0003757581630000072
is the hidden state representation of the ith word output by the bi-GRU unit, and N is the number of words in the sentence, namely the sentence length. The word sequence is originally expressed as bi-GRU after being coded
Figure BDA0003757581630000073
(2) Picture coding module
The method includes the steps that a plurality of objects are involved in a picture, and in order to filter irrelevant information in the picture and avoid the problem of incomplete object semantics caused by equally dividing the picture, objects relevant to texts in the picture are directly extracted and feature representation is carried out.
For each picture I, D salient objects are detected from the picture by using a pre-trained target detector, fasser R-CNN, and the pooled features before the multi-class classification layer are taken as the feature representation of the objects. The visual features of the extracted objects are then projected into the space of the text representation.
r j =ReLu(W v r j +b v ),
Wherein r is j Is a visual feature representation of the jth detected object, W v Is a weight matrix, b v Is a deviation parameter.
The picture is used as background information of the input text, and the text only relates to a part of the target objects in the picture. In order to suppress the negative impact of irrelevant information in pictures, a gated attention mechanism is used to align text and pictures by calculating word and region level relevance. For each object in the picture, the gated attention mechanism uses a soft attention mechanism to calculate the relevance of each word in the text to the object
Figure BDA0003757581630000074
And form a text representation specific to the object
Figure BDA0003757581630000075
Then will be
Figure BDA0003757581630000076
And the visual feature representation r j And executing element multiplication to obtain a representation of each target object with a text correlation:
Figure BDA0003757581630000077
Figure BDA0003757581630000078
Figure BDA0003757581630000081
because the picture areas lack a natural sequence, scattered information in the picture is connected in series into a complete semantic expression through the bi-directional gated recurrent neural network bi-GRU,
Figure BDA0003757581630000082
the original representation of the object in the picture after bi-GRU coding is
Figure BDA0003757581630000083
And D is the number of the objects identified by the object detector in the picture.
Knowledge injection Module
In order to construct a complete semantic representation of texts and pictures, the multi-modal information is naturally expanded by using implicit context information, so that a multi-modal feature representation with rich multi-view knowledge is formed. The knowledge injection module comprises the following two parts:
(1) A knowledge acquisition module:
and providing common sense knowledge reasoning of two dimensions of scene state context and emotional state context for the input text and pictures by using a visual-text event reasoner visual-text COMET. The visualcome uses a pre-trained auto-regressive language model GPT-2 as a generation model, and can generate reasoning knowledge about three relationship types of before, after and intent, namely before and after an event (scene state context) and the intention of a person in the scene (emotional state context), given a picture or a description of the event. The conventional knowledge reasoning is usually composed of short sentences of a series of words, and the reasoning knowledge about different modalities is defined as
Figure BDA0003757581630000084
Wherein R represents three relation types, R belongs to { before, after, intent }, M represents text and picture modes, and M belongs to { T, I }.
Processing the short sentences by adopting a bidirectional gating circulation network (bi-GRU) to obtain the representation of the short sentences,
Figure BDA0003757581630000085
l is the sentence length of the reasoning knowledge.
(2) A multidimensional knowledge injection module:
and designing a knowledge perception attention layer based on common knowledge reasoning at different viewing angles to form a multi-dimensional knowledge perception multi-modal representation. First, each element in the text or the picture is queried by using each knowledge inference and the relevance of each element is calculated, and the elements in the text or the picture are aligned with the knowledge inference. In particular, given a multi-modal feature representation H M (text feature mapping)
Figure BDA0003757581630000086
Or picture feature mapping
Figure BDA0003757581630000087
) And general sense reasoning express
Figure BDA0003757581630000088
Calculating the correlation between input and reasoning knowledge, i.e. correlation matrix C M
C M =tanh(H M W M (H M,R ) T )
Wherein, W M Is a weight matrix.
Then, an attention mechanism is used to form a word-level representation of reasoning knowledge about the input features, and the word-level representation is added with the original representation of the input features to obtain a representation of the text with implicit context information
Figure BDA0003757581630000091
And presentation of pictures with implicit context information
Figure BDA0003757581630000092
Figure BDA0003757581630000093
Figure BDA0003757581630000094
Since the event reasoner visualcome will generate multiple candidate inference knowledge for text and pictures. To focus on inferences that are more relevant to the input scenario, a correlation weight is learned for each knowledge inference and weighted sum of them to generate a multimodal information knowledge enhanced representation:
Figure BDA0003757581630000095
Figure BDA0003757581630000096
wherein WM ,R Is the weight matrix and Q is the number of inference knowledge candidates.
The scene state context consists of pre-scene, mid-scene, and post-scene, so averaging the three states yields the input scene state context-aware word vector representation:
Figure BDA0003757581630000097
accordingly, the scene state context-aware picture vector is represented as
Figure BDA0003757581630000098
Similarly, the invention can obtain the context-aware text representation of the emotional state
Figure BDA0003757581630000099
And a picture representation
Figure BDA00037575816300000910
Double attention Module
To locate the irony-describing portions of text and pictures, a joint memory vector is created that collects irony-related shared information in both modalities by performing the text attention and picture attention mechanisms over multiple iterations. Based on the double-note mechanism, we can obtain a representation of text and pictures focused on a specific area. The dual attention mechanism performs in three aspects, the original representation and the context-aware representation of the multi-modality. For simplicity of expression, the description of the relationship type is omitted in the following description, i.e.
Figure BDA00037575816300000911
Simplified as u i
Figure BDA00037575816300000912
Simplified as v j
The dual attention module contains the following three sub-modules:
(1) Sharing vectors
The key to identifying irony in multimodal information is to find joint space that describes the same thing, i.e., the region that describes irony. To this end, a joint memory vector is designed to collect, in text and pictures, the information that has been identified in k iterations:
Figure BDA0003757581630000101
wherein v is (k) And u (k) Is a complete representation of pictures and text, an initial memory representation m (0) Is defined as v (0) And u (0) Is multiplied by the element vector of (1).
Figure BDA0003757581630000102
Figure BDA0003757581630000103
(2) Text attention mechanism
The text attention device identifies the region describing the irony, and measures the relevance of each part of the text related to the irony by calculating the attention weight of each word in the text and a joint memory vector. In particular, attention weights
Figure BDA0003757581630000104
The method is calculated by two layers of feedforward neural networks and a softmax function:
Figure BDA0003757581630000105
Figure BDA0003757581630000106
wherein the content of the first and second substances,
Figure BDA0003757581630000107
and
Figure BDA0003757581630000108
is the parameter of the model and is,
Figure BDA0003757581630000109
and
Figure BDA00037575816300001010
is a bias parameter.
Finally, the complete representation of the text is obtained by weighted summation:
Figure BDA00037575816300001011
(3) Picture attention mechanism
The same calculation process as the text attention mechanism. First, a region in the picture associated with a joint memory vector (i.e., identified irony-related shared semantics after k iterations) is calculated using a two-layer feedforward neural network and a softmax function, and a complete representation of the picture is obtained by weighted summation:
Figure BDA00037575816300001012
Figure BDA00037575816300001013
Figure BDA00037575816300001014
wherein the content of the first and second substances,
Figure BDA00037575816300001015
and
Figure BDA00037575816300001016
is a parameter of the model that is,
Figure BDA00037575816300001017
and
Figure BDA00037575816300001018
is a bias parameter.
The dual attention module will obtain a representation of the highlighted ironic portion of the text and picture, defined as u and v, over K iterations.
Multidimensional cross-modal matching module
To capture semantic differences between text and pictures, the differences between text and pictures are compared in terms of a multi-modal raw representation and a context-aware representation using the following depth comparison attention mechanism. The module is realized as follows:
Figure BDA0003757581630000111
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003757581630000112
is element multiplication, and | is the absolute value of the element difference; is a connection operation in which the connection is made,
Figure BDA0003757581630000113
is a trainable weight matrix.
Prediction Module
The above multi-dimensional cross-modal contrast representation (z) raw ,z sc ,z en ) Connected input to the fully connected layer and bigram using Sigmoid function.
H=fc([z raw ;z sc ;z em ]),
y=Sigmoid(H)。
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A method of multimodal ironic recognition of a knowledge-injected dual-attention network, the method comprising:
acquiring data content to be identified, wherein the data content to be identified comprises: a number of < text, picture > pairs, the text containing a number of words i, the picture relating to a plurality of objects j;
respectively coding the words i in the text and the objects j in the picture to obtain original expression of the words
Figure FDA0003757581620000011
With the original representation of the object
Figure FDA0003757581620000012
Based on implicit context information of the data content to be identified, the words are originally represented
Figure FDA0003757581620000013
With the original representation of the object
Figure FDA0003757581620000016
Expanding to obtain word context perception representation
Figure FDA0003757581620000014
With object context aware representation
Figure FDA0003757581620000015
Separately representing the words using a dual attention network
Figure FDA0003757581620000017
The original representation of the object
Figure FDA0003757581620000018
And the word context-aware representation
Figure FDA0003757581620000019
The object context-aware representation
Figure FDA00037575816200000110
Performing attention calculation to obtain an attention calculation result of the original representation and the context perception representation;
obtaining an original cross-modal comparison representation and a context-aware cross-modal comparison representation by comparing differences between texts and pictures according to attention calculation results of the original representation and the context-aware representation;
calculating a ironic recognition result of the data content to be recognized based on the original cross-modality comparison representation and the context-aware cross-modality comparison representation.
2. The method of claim 1, wherein the encoding of object j in the picture results in an original representation of the object
Figure FDA00037575816200000111
The method comprises the following steps:
for each picture, detecting the area of the object j from the picture by using a pre-trained target detector, and taking the pooled features before the multi-class classification layer as the visual feature representation r of the object j j
Representing the visual features r j Projecting into a space of text representations;
obtaining a text representation specific to an object j by calculating the relevance of each word i in the text to the object j
Figure FDA00037575816200000113
Based on the text representation
Figure FDA00037575816200000112
And the visual feature representation r j With representation of the text relevance of the calculation object j
Figure FDA00037575816200000114
Representing the visual characteristics r j The composed object sequence is input into a bidirectional gated recurrent neural network, and the representation is performed
Figure FDA00037575816200000115
As weights are calculated to obtain the original representation of each object j
Figure FDA00037575816200000116
3. The method of claim 1, wherein the original representation of the word is based on implicit context information of the data content to be recognized
Figure FDA00037575816200000117
With the original representation of the object
Figure FDA00037575816200000118
Expanding to obtain word context perception representation
Figure FDA00037575816200000119
With object context aware representation
Figure FDA00037575816200000120
The method comprises the following steps:
generating different types of inference knowledge for event descriptions in each picture or text
Figure FDA0003757581620000022
And calculating the inference knowledge
Figure FDA0003757581620000023
Is expressed as H M,R Wherein w is l Marking words in the reasoning knowledge, L is more than or equal to 1 and less than or equal to L,l represents the length of reasoning knowledge, the relationship type R belongs to { before, after, intent }, before represents the event relationship type, after represents the event relationship type, intent represents the person intention relationship type in the scene, and the mode M represents a text mode or a picture mode;
based on the original representation of the word
Figure FDA0003757581620000024
Composed text feature mapping H T Original representation of the object
Figure FDA0003757581620000025
Composed picture feature mapping H I And the common sense inference expression H M,R Calculating the data content to be identified and the inference knowledge
Figure FDA0003757581620000026
A correlation matrix C between M
Based on the incidence matrix C M Obtaining the original representation of the word
Figure FDA0003757581620000027
With the original representation of the object
Figure FDA0003757581620000028
With a representation of implicit context information
Figure FDA00037575816200000210
And presentation of pictures with implicit context information
Figure FDA0003757581620000029
By reasoning knowledge for each of said
Figure FDA00037575816200000211
Learning a correlation weight, computing said representation
Figure FDA00037575816200000212
And the representation
Figure FDA00037575816200000213
Enhanced representation of
Figure FDA00037575816200000214
And enhanced representation
Figure FDA00037575816200000215
Based on the enhanced representation
Figure FDA00037575816200000216
And the enhanced representation
Figure FDA00037575816200000217
Computing word context-aware representations
Figure FDA00037575816200000218
With object context aware representation
Figure FDA00037575816200000219
Wherein the word perception vector representation
Figure FDA00037575816200000220
The method comprises the following steps: context-aware representation of context states of words
Figure FDA00037575816200000221
And emotional state context awareness representation
Figure FDA00037575816200000223
The object context aware vector representation
Figure FDA00037575816200000222
The method comprises the following steps: context-aware representation of scene states of objects
Figure FDA00037575816200000224
And emotional state context awareness representation
Figure FDA00037575816200000225
4. The method of claim 3, wherein the correlation matrix C is based on M Obtaining the original representation of the word
Figure FDA00037575816200000226
And the original representation of the object
Figure FDA00037575816200000227
With a representation of implicit context information
Figure FDA00037575816200000228
And presentation of pictures with implicit context information
Figure FDA00037575816200000229
The method comprises the following steps:
based on the incidence matrix C M And said common sense inference means H M,R Word-level representation of inference knowledge using attention mechanism
Figure FDA00037575816200000231
And object level representation
Figure FDA00037575816200000230
In the original representation of the word
Figure FDA00037575816200000232
With the original representation of the object
Figure FDA00037575816200000233
Separately adding the word-level representations
Figure FDA00037575816200000234
And the object level representation
Figure FDA00037575816200000235
Get a representation
Figure FDA00037575816200000236
And represents
Figure FDA00037575816200000237
5. The method of claim 3, wherein the based on the enhanced representation
Figure FDA00037575816200000238
Computing word context-aware representations
Figure FDA0003757581620000021
The method comprises the following steps:
representing the enhancement
Figure FDA0003757581620000031
Enhanced representation of pre-event relationship types
Figure FDA0003757581620000032
Enhanced representation of post-event relationship types
Figure FDA0003757581620000033
Enhanced representation of intent relationship types
Figure FDA0003757581620000034
According to the enhanced representation
Figure FDA0003757581620000035
The original representation of the word
Figure FDA0003757581620000036
And the enhanced representation
Figure FDA0003757581620000037
Computing word context-aware representations
Figure FDA0003757581620000038
According to the affiliated enhanced representation
Figure FDA0003757581620000039
Obtaining word emotional state context perception representation
Figure FDA00037575816200000310
6. The method of claim 1, wherein the original representations of the words are separately represented using a dual attention network
Figure FDA00037575816200000311
The original representation of the object
Figure FDA00037575816200000312
Performing attention calculation to obtain the attention calculation result of the original representation, including:
for each word original representation
Figure FDA00037575816200000313
Summing to obtain a text complete representation u (0)
Original representation of objects in a picture
Figure FDA00037575816200000314
Summing to obtain complete picture representation v (0)
According to said representation u (0) With said representation v (0) Calculating a joint memory vector m (0)
Based on
Figure FDA00037575816200000322
Performing iterative computation, and obtaining a joint memory vector m after the iteration is finished (K) Where K denotes the total number of iterations, u (k) Complete representation of the text, v, representing the kth iteration (k) The complete representation of the picture representing the kth iteration,
Figure FDA00037575816200000321
is the product of the elements;
based on the joint memory vector m (K) And the attention calculation is carried out by utilizing the double attention network to respectively obtain the complete text representation u (K+1) Representing v in full with the picture (K+1)
Completely expressing the text u (K+1) Is expressed in integral with the picture v (K+1) As a result of the attention calculation of the original representation.
7. The method of claim 6, wherein the based on the joint memory vector m (K) Computing a complete representation u of the text (K+1) The method comprises the following steps:
according to the joint memory vector m (K) With the original representation of the word
Figure FDA00037575816200000315
Computing an output of a feedforward neural network in the dual attention network
Figure FDA00037575816200000316
Outputting the output
Figure FDA00037575816200000317
Substituting the softmax function into the function to obtain the attention weight
Figure FDA00037575816200000318
Based on the attention weight
Figure FDA00037575816200000319
Original representation of the word
Figure FDA00037575816200000320
Weighted summation is carried out to obtain the text complete representation u (K +1)
8. The method of claim 6, wherein said calculating the attention of the original representation by comparing the difference between the text and the picture to obtain an original cross-modal comparison representation comprises:
raw cross-modal contrast representation
Figure FDA0003757581620000041
Wherein the content of the first and second substances,
Figure FDA0003757581620000042
representing a trainable weight matrix, | | | is the absolute value of the element difference, representing the join operation.
9. A method as recited in claim 1, said calculating irony recognition of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation, comprising:
connecting the original cross-modality contrast representation and the context-aware cross-modality contrast representation;
and inputting the connection result into a full connection layer, and carrying out binary irony classification by using a Sigmoid function to obtain the irony recognition result of the data content to be recognized.
10. An electronic device comprising a memory and a processor, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1-9.
CN202210863424.5A 2022-07-21 2022-07-21 Knowledge injection-based multi-modal irony recognition method of double-attention network Pending CN115408517A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210863424.5A CN115408517A (en) 2022-07-21 2022-07-21 Knowledge injection-based multi-modal irony recognition method of double-attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210863424.5A CN115408517A (en) 2022-07-21 2022-07-21 Knowledge injection-based multi-modal irony recognition method of double-attention network

Publications (1)

Publication Number Publication Date
CN115408517A true CN115408517A (en) 2022-11-29

Family

ID=84157770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210863424.5A Pending CN115408517A (en) 2022-07-21 2022-07-21 Knowledge injection-based multi-modal irony recognition method of double-attention network

Country Status (1)

Country Link
CN (1) CN115408517A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402063A (en) * 2023-06-09 2023-07-07 华南师范大学 Multi-modal irony recognition method, apparatus, device and storage medium
CN116702091A (en) * 2023-06-21 2023-09-05 中南大学 Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
CN117633516A (en) * 2024-01-25 2024-03-01 华南师范大学 Multi-mode cynics detection method, device, computer equipment and storage medium
CN118093896A (en) * 2024-04-12 2024-05-28 中国科学技术大学 Ironic detection method, ironic detection device, electronic equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402063A (en) * 2023-06-09 2023-07-07 华南师范大学 Multi-modal irony recognition method, apparatus, device and storage medium
CN116402063B (en) * 2023-06-09 2023-08-15 华南师范大学 Multi-modal irony recognition method, apparatus, device and storage medium
CN116702091A (en) * 2023-06-21 2023-09-05 中南大学 Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
CN116702091B (en) * 2023-06-21 2024-03-08 中南大学 Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
CN117633516A (en) * 2024-01-25 2024-03-01 华南师范大学 Multi-mode cynics detection method, device, computer equipment and storage medium
CN117633516B (en) * 2024-01-25 2024-04-05 华南师范大学 Multi-mode cynics detection method, device, computer equipment and storage medium
CN118093896A (en) * 2024-04-12 2024-05-28 中国科学技术大学 Ironic detection method, ironic detection device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN115408517A (en) Knowledge injection-based multi-modal irony recognition method of double-attention network
CN111897913B (en) Semantic tree enhancement based cross-modal retrieval method for searching video from complex text
CN111639252A (en) False news identification method based on news-comment relevance analysis
CN114549850B (en) Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN115293170A (en) Aspect-level multi-modal emotion analysis method based on cooperative attention fusion
CN116402066A (en) Attribute-level text emotion joint extraction method and system for multi-network feature fusion
CN112115131A (en) Data denoising method, device and equipment and computer readable storage medium
CN113326384A (en) Construction method of interpretable recommendation model based on knowledge graph
CN114461821A (en) Cross-modal image-text inter-searching method based on self-attention reasoning
Khan et al. A deep neural framework for image caption generation using gru-based attention mechanism
CN117746078B (en) Object detection method and system based on user-defined category
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN115099234A (en) Chinese multi-mode fine-grained emotion analysis method based on graph neural network
CN117633516B (en) Multi-mode cynics detection method, device, computer equipment and storage medium
Chauhan et al. Analysis of Intelligent movie recommender system from facial expression
CN114330482A (en) Data processing method and device and computer readable storage medium
CN115098646A (en) Multilevel relation analysis and mining method for image-text data
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph
CN115346132A (en) Method and device for detecting abnormal events of remote sensing images by multi-modal representation learning
CN115130461A (en) Text matching method and device, electronic equipment and storage medium
WO2022099063A1 (en) Systems and methods for categorical representation learning
CN113821610A (en) Information matching method, device, equipment and storage medium
Wang et al. TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
Yaoxian et al. Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination