CN115408517A - Knowledge injection-based multi-modal irony recognition method of double-attention network - Google Patents
Knowledge injection-based multi-modal irony recognition method of double-attention network Download PDFInfo
- Publication number
- CN115408517A CN115408517A CN202210863424.5A CN202210863424A CN115408517A CN 115408517 A CN115408517 A CN 115408517A CN 202210863424 A CN202210863424 A CN 202210863424A CN 115408517 A CN115408517 A CN 115408517A
- Authority
- CN
- China
- Prior art keywords
- representation
- original
- context
- text
- picture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a knowledge injection-based multi-modal sarcasm recognition method of a double-attention network, which comprises the following steps: acquiring data content to be identified, wherein the data content to be identified comprises a plurality of text and picture pairs; encoding words in the text and objects in the picture to obtain an original representation; expanding the original representation based on the implicit context information of the data content to be identified to obtain context-aware representation; obtaining attention calculation results of the original representation and the context perception representation; calculating an original cross-modal contrast representation and a context-aware cross-modal contrast representation according to the attention calculation result; calculating a ironic recognition result based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation. The invention is helpful to improve the integral performance of ironic identification, is convenient for the practical application of the model and provides interpretability for the prediction result.
Description
Technical Field
The invention relates to a knowledge injection-based multi-modal sarcasm recognition method of a double-attention network, and belongs to the technical field of multi-modal information recognition.
Background
Irony based on multimodal information is the goal of implicitly expressing strong emotions by using text as opposed to metaphorical scenes in pictures. Currently, text and picture based irony is ubiquitous on social platforms such as microblogs, twitter, and the like. Because irony reverses the polarity of emotion or perspective in text, automatic detection of multimodal irony is of great importance in customer service, opinion mining, and various tasks requiring understanding of the person's true emotion.
The real multimodality irony detection is quite complex. The semantic expression of user input information is affected not only by explicit content, but also by implicit context. Explicit content refers to the content of a scene observable in the input text or picture, and implicit context refers to the inferred knowledge about the scene that is not visible in the input information, including the processes involved in the development of the scene and the intent of the characters in the scene. Irony identification requires accurate localization of the irony-describing portions of multimodal information and discrimination of their semantic differences, based on the complete semantic representation of text and pictures. However, existing multimodal irony detection methods only learn features from input text and pictures, ignoring modeling of the implicit context behind the content. Meanwhile, they model semantic differences among multimodal information based on the unprocessed full amount of information of texts and pictures, and noise is easily introduced, so that the accuracy of ironic identification is reduced, and the practical application of models is affected. How to inject implicit context information into a multi-modal input to obtain a better feature representation and accurately position a ironic description area based on the information to perform semantic difference recognition is an urgent problem to be solved.
Disclosure of Invention
In order to solve the problems, the knowledge injection-based multi-modal sarcasm recognition method of the double-attention network provided by the invention utilizes the knowledge-enhanced multi-dimensional attention module to inject the implicit context knowledge into the multi-modal input representation, and divides the implicit context knowledge into two angles, namely a scene state and an emotional state, according to the human reasoning mode to construct the complete semantic representation of the multi-modal information. At the same time, the photo and text attention modules are cooperatively executed using a dual attention network based on a joint memory vector that brings together previous attention results to capture ironic shared semantics in text and photos. Finally, based on the joint embedding space, a multi-dimensional cross-modal matching layer is adopted to distinguish the difference between the multiple modes from multiple dimensions. This helps to improve the overall performance of ironic identification and provides interpretability of the prediction results, facilitating practical application of the model.
The technical content of the invention comprises:
a method of knowledge injection based multimodal irony recognition of a dual-attention network, the method comprising:
acquiring data content to be identified, wherein the data content to be identified comprises: a number of < text, picture > pairs, the text containing a number of words i, the picture relating to a plurality of objects j;
respectively coding the words i in the text and the objects j in the picture to obtain original expression of the wordsWith the original representation of the object
Based on implicit context information of the data content to be identified, the original representation of the word is performedAnd the original representation of the objectExpanding to obtain word context perception representationAnd object context aware representation
Separately representing the words using a dual attention networkThe original representation of the objectAnd the word context-aware representationThe object context-aware representationPerforming attention calculation to obtain an attention calculation result of the original representation and the context perception representation;
obtaining an original cross-modal comparison representation and a context-aware cross-modal comparison representation by comparing differences between texts and pictures according to attention calculation results of the original representation and the context-aware representation;
calculating a irony recognition result of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation.
Further, the object j in the picture is coded to obtain an original representation of the objectThe method comprises the following steps:
for each picture, detecting the area of the object j from the picture by using a pre-trained target detector, and taking the pooled features before the multi-class classification layer as the visual feature representation r of the object j j ;
Representing the visual features r j Projecting into a space of text representations;
by calculating each of the textThe relevance of the word i to the object j, obtaining a textual representation specific to the object j
Based on the text representationAnd the visual feature representation r j With representation of the text relevance of the calculation object j
Representing the visual characteristics r j The composed object sequence is input into a bidirectional gated recurrent neural network, and the representation is performedAs weights are calculated to obtain the original representation of each object j
Further, the original representation of the word is based on implicit context information of the data content to be identifiedWith the original representation of the objectExpanding to obtain word context perception representationWith object context aware representationThe method comprises the following steps:
generating different types of inference knowledge for event descriptions in each picture or textAnd calculating the inference knowledgeIs expressed as H M,R Wherein w is l Marking words in the inference knowledge, wherein L is more than or equal to 1 and less than or equal to L, L represents the length of the inference knowledge, the relationship type R belongs to { before, after and intent }, before represents the relationship type of the event, after represents the relationship type of the event, intent represents the relationship type of the figure in the scene, and the mode M represents a text mode or a picture mode;
based on the original representation of the wordComposed text feature mapping H T Original representation of the objectComposed picture feature mapping H I And the common sense inference expression H M,R Calculating the data content to be recognized and the inference knowledgeA correlation matrix C between M ;
Based on the incidence matrix C M Obtaining the original representation of the wordWith the original representation of the objectWith a representation of implicit context informationAnd presentation of pictures with implicit context information
By reasoning knowledge for each of saidLearning a correlation weight, computing said representationAnd the representationEnhanced representation ofAnd enhanced representation
Based on the enhanced representationAnd the enhanced representationComputing word context-aware representationsWith object context aware representationWherein the word perception vector representationThe method comprises the following steps: context-aware representation of context states of wordsAnd emotional state context awareness representationThe object context aware vector representationThe method comprises the following steps: context-aware representation of scene states of objectsAnd emotional state context awareness representation
Further, the correlation matrix C is used for the correlation M Obtaining the original representation of the wordWith the original representation of the objectWith a representation of implicit context informationAnd presentation of pictures with implicit context informationThe method comprises the following steps:
based on the incidence matrix C M And said common sense inference means H M,R Word-level representation of inference knowledge using attention mechanismAnd object level representation
In the original representation of the wordAnd the original representation of the objectSeparately adding the word-level representationsAnd the object level representationGet a representationAnd represents
Further, the based on the enhanced representationComputing word context-aware representationsThe method comprises the following steps:
representing the enhancementEnhanced representation of pre-event relationship typesEnhanced representation of post-event relationship typesEnhanced representation of intent relationship types
According to the enhanced representationThe original representation of the wordAnd the enhanced representationComputing word context-aware representations of context states
According to affiliated enhanced representationsObtaining word emotional state context perception representation
Further, the words are represented separately and originally using a dual attention networkThe original representation of the objectPerforming attention calculation to obtain the attention calculation result of the original representation, including:
Original representation of objects in a pictureSumming to obtain complete picture representation v (0) ;
According to said representation u (0) And said representation v (0) Calculating a joint memory vector m (0) ;
Based onCarry out iterationCalculating and obtaining a joint memory vector m after iteration is finished (K) Where K denotes the total number of iterations, i (k) Complete representation of the text, v, representing the kth iteration (k) The complete representation of the picture representing the kth iteration,is the product of the elements;
based on the joint memory vector m (K) And the attention calculation is carried out by utilizing the double attention network to respectively obtain the complete text representation u (K+1) Representing v in full with the picture (K+1) ;
Completely expressing the text u (K+1) Is expressed in integral with the picture v (K+1) As a result of the attention calculation of the original representation.
Further, the method is based on the joint memory vector m (K) Computing a complete representation u of the text (K+1) The method comprises the following steps:
according to the joint memory vector m (K) With the original representation of the wordComputing an output of a feedforward neural network in the dual attention network
Outputting the outputSubstituting the softmax function into the function to obtain the attention weight
Based on the attention weightOriginal representation of the wordWeighted summation is carried out to obtain the text complete representation u (K+1) 。
Further, the obtaining the original cross-modal comparison representation by comparing differences between texts and pictures according to the attention calculation result of the original representation includes:
raw cross-modal contrast representationWherein the content of the first and second substances,representing a trainable weight matrix, | | | is the absolute value of the element difference; indicating a connection operation
Further, the calculating a irony recognition result of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation comprises:
connecting the original cross-modality contrast representation and the context-aware cross-modality contrast representation;
and inputting the connection result into a full connection layer, and carrying out binary irony classification by using a Sigmoid function to obtain the irony recognition result of the data content to be recognized.
An electronic device comprising a memory and a processor, the memory having stored therein a computer program that is loaded and executed by the processor to implement any of the methods described above.
Compared with the prior art, the invention has the advantages that:
(1) According to the method, visual COMET is adopted to provide implicit context information, namely scene state context and emotional state context, for text and picture modal information, knowledge-enhanced multidimensional attention modules are adopted to inject the implicit context into multimodal input, and text and picture representation of context perception is generated, so that complete semantic context is constructed by the multimodal information.
(2) The designed double attention network can accurately locate ironic areas in multi-modal information by using text and picture attention through multiple iterations. Meanwhile, the double attention is respectively applied to the original representation and the context perception representation from the network, and multi-modal information representations of multiple angles are captured. Based on a multi-dimensional shared representation space, a multi-dimensional trans-modal module is adopted to distinguish semantic differences of texts and pictures, so that the irony is accurately identified.
(3) Compared with the existing method, the method has higher performance, and meanwhile, the double attention module can provide interpretability for the predicted structure by combining the injected knowledge.
Drawings
FIG. 1 is a system model flow diagram of the present invention.
FIG. 2 is a system model architecture diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely specific embodiments of the present invention, rather than all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The technical problem of the invention is solved: the method for multi-modal sarcasm recognition of the double-attention network based on knowledge injection aims at the problem of multi-modal sarcasm recognition, on one hand, a knowledge-enhanced multi-dimensional attention model is adopted to construct a complete semantic representation of multi-modal information, on the other hand, a shared vector is maintained by adopting the double-attention machine, the sarcasm-related shared semantics in the multi-modal information are extracted through a text and picture attention module, and the difference of a multi-modal scene containing the sarcasm context is modeled through a multi-dimensional cross-modal matching layer, so that the integral performance of sarcasm recognition is improved, and a prediction result is provided with certain interpretability.
The technical scheme of the invention is as follows: a knowledge injection-based dual-attention network multi-modal irony recognition method comprises the steps of injecting implicit context knowledge into a multi-modal representation to construct complete semantics of multi-modal information, and capturing irony description areas in the multi-modal representation by using a dual-attention network to perform multi-dimensional semantic comparison. In the model, firstly, inputting and coding texts and pictures into vector representation, and aligning objects in the texts and the pictures by using an attention mechanism, thereby filtering irrelevant information in the pictures; then, in order to supplement implicit context information lacking in the text and the picture, an event knowledge graph is used for generating a scene context and an emotion context for the text and the picture, and the acquired knowledge is injected into multi-modal input through a multi-dimensional attention module with enhanced knowledge to construct a complete semantic code of the multi-modal information; in order to pay attention to the irony region in the text and the picture, a double attention module which is used for cooperatively executing the text and the picture is provided, and shared semantics which span multiple modalities are captured in original coding and complete semantic coding of multi-modality information by maintaining a joint memory vector; based on the joint embedding space, adopting multi-dimensional cross-modal matching to distinguish multi-modal differences in multiple dimensions; finally, a plurality of comparison results are concatenated into an input classification for multi-modal irony detection.
Fig. 1 is a flow chart of a system model of the present invention, and as shown in fig. 1, the system of the present invention includes: the system comprises an input coding module, a knowledge injection module, a double attention interaction module, a multi-dimensional cross-modal matching module and a classification prediction module.
Firstly, in an encoding module, for word preprocessing, separating text sentences by using an NLTK toolkit to obtain words, and embedding the words into 200-dimensional vectors generated by using a Glove algorithm as initialization; for picture preprocessing, a fast-RCNN is used for extracting picture objects and characteristics thereof, and repairing is carried out in a training stage. The hidden size of the Bi-GRU is 512 dimensions.
In the knowledge injection module, for event reasoning knowledge, reasoning knowledge of three types of relations of "before", "after" and "intent" is generated for texts and pictures by using a VisualCOMET event reasoning generator. For example, the text is entered as "this is why i want to be mom" and the picture describes "a woman standing in the room with a broom". The reasoning of the three types of relationships generated by the visual communications for the text is as follows: before: [ be in a family roo, put on her school uniform, … ], after: [ play scales with friends, tell the wrong, … ], intent: [ stay at home cozy, make her mom happy, … ]; the three types of relationship reasoning generated for the pictures are as follows: before: [ put on an apron, be from a school, … ], after: [ clean the room, finish her house work, … ], intent: [ Cleaning the pole with a broom, play with friends, … ]. In the practical application process, 15 candidates are kept for reasoning knowledge of texts and pictures on each relationship type. And then, a knowledge injection module is utilized to obtain a multi-modal information representation with enhanced implicit context knowledge, and complete semantic information of texts and pictures is constructed.
In the dual attention interaction module, the text and picture attention mechanism obtains a representation of the text and picture attention to the descriptive ironic region through 3 interactive iterations.
In the multi-dimensional cross-mode matching module, texts and pictures are compared in terms of original representation, context perception representation of scene states and context perception representation of emotional states.
In the classification module, the multi-mode information comparison results of three dimensions are connected and input into an output layer consisting of a double-layer fully-connected network and Softmax for ironic classification. In training, the batch size is 32, the model training learning rate is 0.0005, adam is used as the optimizer.
FIG. 2 is a system model architecture diagram of the present invention, as shown in FIG. 2:
input coding Module
And performing characteristic extraction on the input text and picture modal information, and encoding the input text and picture modal information into a unified vector space. Its input is a sentence containing a series of words and a picture containing a plurality of objects; the output is a vector representation that is subjected to the following operations. The input coding module comprises the following two parts:
(1) A text encoding module:
given a sequence of text words w 1 ,w 2 ,…,w N In order to acquire sentence semantic information, a bidirectional gated recurrent neural network (bi-GRU) is adopted to learn the sequence semantic information representation of words in the text, andthe coding is in vector form:
wherein the content of the first and second substances,is the hidden state representation of the ith word output by the bi-GRU unit, and N is the number of words in the sentence, namely the sentence length. The word sequence is originally expressed as bi-GRU after being coded
(2) Picture coding module
The method includes the steps that a plurality of objects are involved in a picture, and in order to filter irrelevant information in the picture and avoid the problem of incomplete object semantics caused by equally dividing the picture, objects relevant to texts in the picture are directly extracted and feature representation is carried out.
For each picture I, D salient objects are detected from the picture by using a pre-trained target detector, fasser R-CNN, and the pooled features before the multi-class classification layer are taken as the feature representation of the objects. The visual features of the extracted objects are then projected into the space of the text representation.
r j =ReLu(W v r j +b v ),
Wherein r is j Is a visual feature representation of the jth detected object, W v Is a weight matrix, b v Is a deviation parameter.
The picture is used as background information of the input text, and the text only relates to a part of the target objects in the picture. In order to suppress the negative impact of irrelevant information in pictures, a gated attention mechanism is used to align text and pictures by calculating word and region level relevance. For each object in the picture, the gated attention mechanism uses a soft attention mechanism to calculate the relevance of each word in the text to the objectAnd form a text representation specific to the objectThen will beAnd the visual feature representation r j And executing element multiplication to obtain a representation of each target object with a text correlation:
because the picture areas lack a natural sequence, scattered information in the picture is connected in series into a complete semantic expression through the bi-directional gated recurrent neural network bi-GRU,
the original representation of the object in the picture after bi-GRU coding is
And D is the number of the objects identified by the object detector in the picture.
Knowledge injection Module
In order to construct a complete semantic representation of texts and pictures, the multi-modal information is naturally expanded by using implicit context information, so that a multi-modal feature representation with rich multi-view knowledge is formed. The knowledge injection module comprises the following two parts:
(1) A knowledge acquisition module:
and providing common sense knowledge reasoning of two dimensions of scene state context and emotional state context for the input text and pictures by using a visual-text event reasoner visual-text COMET. The visualcome uses a pre-trained auto-regressive language model GPT-2 as a generation model, and can generate reasoning knowledge about three relationship types of before, after and intent, namely before and after an event (scene state context) and the intention of a person in the scene (emotional state context), given a picture or a description of the event. The conventional knowledge reasoning is usually composed of short sentences of a series of words, and the reasoning knowledge about different modalities is defined asWherein R represents three relation types, R belongs to { before, after, intent }, M represents text and picture modes, and M belongs to { T, I }.
Processing the short sentences by adopting a bidirectional gating circulation network (bi-GRU) to obtain the representation of the short sentences,l is the sentence length of the reasoning knowledge.
(2) A multidimensional knowledge injection module:
and designing a knowledge perception attention layer based on common knowledge reasoning at different viewing angles to form a multi-dimensional knowledge perception multi-modal representation. First, each element in the text or the picture is queried by using each knowledge inference and the relevance of each element is calculated, and the elements in the text or the picture are aligned with the knowledge inference. In particular, given a multi-modal feature representation H M (text feature mapping)Or picture feature mapping) And general sense reasoning expressCalculating the correlation between input and reasoning knowledge, i.e. correlation matrix C M :
C M =tanh(H M W M (H M,R ) T )
Wherein, W M Is a weight matrix.
Then, an attention mechanism is used to form a word-level representation of reasoning knowledge about the input features, and the word-level representation is added with the original representation of the input features to obtain a representation of the text with implicit context informationAnd presentation of pictures with implicit context information
Since the event reasoner visualcome will generate multiple candidate inference knowledge for text and pictures. To focus on inferences that are more relevant to the input scenario, a correlation weight is learned for each knowledge inference and weighted sum of them to generate a multimodal information knowledge enhanced representation:
wherein WM ,R Is the weight matrix and Q is the number of inference knowledge candidates.
The scene state context consists of pre-scene, mid-scene, and post-scene, so averaging the three states yields the input scene state context-aware word vector representation:
accordingly, the scene state context-aware picture vector is represented asSimilarly, the invention can obtain the context-aware text representation of the emotional stateAnd a picture representation
Double attention Module
To locate the irony-describing portions of text and pictures, a joint memory vector is created that collects irony-related shared information in both modalities by performing the text attention and picture attention mechanisms over multiple iterations. Based on the double-note mechanism, we can obtain a representation of text and pictures focused on a specific area. The dual attention mechanism performs in three aspects, the original representation and the context-aware representation of the multi-modality. For simplicity of expression, the description of the relationship type is omitted in the following description, i.e.Simplified as u i ,Simplified as v j 。
The dual attention module contains the following three sub-modules:
(1) Sharing vectors
The key to identifying irony in multimodal information is to find joint space that describes the same thing, i.e., the region that describes irony. To this end, a joint memory vector is designed to collect, in text and pictures, the information that has been identified in k iterations:
wherein v is (k) And u (k) Is a complete representation of pictures and text, an initial memory representation m (0) Is defined as v (0) And u (0) Is multiplied by the element vector of (1).
(2) Text attention mechanism
The text attention device identifies the region describing the irony, and measures the relevance of each part of the text related to the irony by calculating the attention weight of each word in the text and a joint memory vector. In particular, attention weightsThe method is calculated by two layers of feedforward neural networks and a softmax function:
wherein the content of the first and second substances,andis the parameter of the model and is,andis a bias parameter.
Finally, the complete representation of the text is obtained by weighted summation:
(3) Picture attention mechanism
The same calculation process as the text attention mechanism. First, a region in the picture associated with a joint memory vector (i.e., identified irony-related shared semantics after k iterations) is calculated using a two-layer feedforward neural network and a softmax function, and a complete representation of the picture is obtained by weighted summation:
wherein the content of the first and second substances,andis a parameter of the model that is,andis a bias parameter.
The dual attention module will obtain a representation of the highlighted ironic portion of the text and picture, defined as u and v, over K iterations.
Multidimensional cross-modal matching module
To capture semantic differences between text and pictures, the differences between text and pictures are compared in terms of a multi-modal raw representation and a context-aware representation using the following depth comparison attention mechanism. The module is realized as follows:
wherein, the first and the second end of the pipe are connected with each other,is element multiplication, and | is the absolute value of the element difference; is a connection operation in which the connection is made,is a trainable weight matrix.
Prediction Module
The above multi-dimensional cross-modal contrast representation (z) raw ,z sc ,z en ) Connected input to the fully connected layer and bigram using Sigmoid function.
H=fc([z raw ;z sc ;z em ]),
y=Sigmoid(H)。
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.
Claims (10)
1. A method of multimodal ironic recognition of a knowledge-injected dual-attention network, the method comprising:
acquiring data content to be identified, wherein the data content to be identified comprises: a number of < text, picture > pairs, the text containing a number of words i, the picture relating to a plurality of objects j;
respectively coding the words i in the text and the objects j in the picture to obtain original expression of the wordsWith the original representation of the object
Based on implicit context information of the data content to be identified, the words are originally representedWith the original representation of the objectExpanding to obtain word context perception representationWith object context aware representation
Separately representing the words using a dual attention networkThe original representation of the objectAnd the word context-aware representationThe object context-aware representationPerforming attention calculation to obtain an attention calculation result of the original representation and the context perception representation;
obtaining an original cross-modal comparison representation and a context-aware cross-modal comparison representation by comparing differences between texts and pictures according to attention calculation results of the original representation and the context-aware representation;
calculating a ironic recognition result of the data content to be recognized based on the original cross-modality comparison representation and the context-aware cross-modality comparison representation.
2. The method of claim 1, wherein the encoding of object j in the picture results in an original representation of the objectThe method comprises the following steps:
for each picture, detecting the area of the object j from the picture by using a pre-trained target detector, and taking the pooled features before the multi-class classification layer as the visual feature representation r of the object j j ;
Representing the visual features r j Projecting into a space of text representations;
obtaining a text representation specific to an object j by calculating the relevance of each word i in the text to the object j
Based on the text representationAnd the visual feature representation r j With representation of the text relevance of the calculation object j
3. The method of claim 1, wherein the original representation of the word is based on implicit context information of the data content to be recognizedWith the original representation of the objectExpanding to obtain word context perception representationWith object context aware representationThe method comprises the following steps:
generating different types of inference knowledge for event descriptions in each picture or textAnd calculating the inference knowledgeIs expressed as H M,R Wherein w is l Marking words in the reasoning knowledge, L is more than or equal to 1 and less than or equal to L,l represents the length of reasoning knowledge, the relationship type R belongs to { before, after, intent }, before represents the event relationship type, after represents the event relationship type, intent represents the person intention relationship type in the scene, and the mode M represents a text mode or a picture mode;
based on the original representation of the wordComposed text feature mapping H T Original representation of the objectComposed picture feature mapping H I And the common sense inference expression H M,R Calculating the data content to be identified and the inference knowledgeA correlation matrix C between M ;
Based on the incidence matrix C M Obtaining the original representation of the wordWith the original representation of the objectWith a representation of implicit context informationAnd presentation of pictures with implicit context information
By reasoning knowledge for each of saidLearning a correlation weight, computing said representationAnd the representationEnhanced representation ofAnd enhanced representation
Based on the enhanced representationAnd the enhanced representationComputing word context-aware representationsWith object context aware representationWherein the word perception vector representationThe method comprises the following steps: context-aware representation of context states of wordsAnd emotional state context awareness representationThe object context aware vector representationThe method comprises the following steps: context-aware representation of scene states of objectsAnd emotional state context awareness representation
4. The method of claim 3, wherein the correlation matrix C is based on M Obtaining the original representation of the wordAnd the original representation of the objectWith a representation of implicit context informationAnd presentation of pictures with implicit context informationThe method comprises the following steps:
based on the incidence matrix C M And said common sense inference means H M,R Word-level representation of inference knowledge using attention mechanismAnd object level representation
5. The method of claim 3, wherein the based on the enhanced representationComputing word context-aware representationsThe method comprises the following steps:
representing the enhancementEnhanced representation of pre-event relationship typesEnhanced representation of post-event relationship typesEnhanced representation of intent relationship types
According to the enhanced representationThe original representation of the wordAnd the enhanced representationComputing word context-aware representations
6. The method of claim 1, wherein the original representations of the words are separately represented using a dual attention networkThe original representation of the objectPerforming attention calculation to obtain the attention calculation result of the original representation, including:
Original representation of objects in a pictureSumming to obtain complete picture representation v (0) ;
According to said representation u (0) With said representation v (0) Calculating a joint memory vector m (0) ;
Based onPerforming iterative computation, and obtaining a joint memory vector m after the iteration is finished (K) Where K denotes the total number of iterations, u (k) Complete representation of the text, v, representing the kth iteration (k) The complete representation of the picture representing the kth iteration,is the product of the elements;
based on the joint memory vector m (K) And the attention calculation is carried out by utilizing the double attention network to respectively obtain the complete text representation u (K+1) Representing v in full with the picture (K+1) ;
Completely expressing the text u (K+1) Is expressed in integral with the picture v (K+1) As a result of the attention calculation of the original representation.
7. The method of claim 6, wherein the based on the joint memory vector m (K) Computing a complete representation u of the text (K+1) The method comprises the following steps:
according to the joint memory vector m (K) With the original representation of the wordComputing an output of a feedforward neural network in the dual attention network
Outputting the outputSubstituting the softmax function into the function to obtain the attention weight
8. The method of claim 6, wherein said calculating the attention of the original representation by comparing the difference between the text and the picture to obtain an original cross-modal comparison representation comprises:
9. A method as recited in claim 1, said calculating irony recognition of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation, comprising:
connecting the original cross-modality contrast representation and the context-aware cross-modality contrast representation;
and inputting the connection result into a full connection layer, and carrying out binary irony classification by using a Sigmoid function to obtain the irony recognition result of the data content to be recognized.
10. An electronic device comprising a memory and a processor, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210863424.5A CN115408517A (en) | 2022-07-21 | 2022-07-21 | Knowledge injection-based multi-modal irony recognition method of double-attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210863424.5A CN115408517A (en) | 2022-07-21 | 2022-07-21 | Knowledge injection-based multi-modal irony recognition method of double-attention network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115408517A true CN115408517A (en) | 2022-11-29 |
Family
ID=84157770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210863424.5A Pending CN115408517A (en) | 2022-07-21 | 2022-07-21 | Knowledge injection-based multi-modal irony recognition method of double-attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115408517A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402063A (en) * | 2023-06-09 | 2023-07-07 | 华南师范大学 | Multi-modal irony recognition method, apparatus, device and storage medium |
CN116702091A (en) * | 2023-06-21 | 2023-09-05 | 中南大学 | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP |
CN117633516A (en) * | 2024-01-25 | 2024-03-01 | 华南师范大学 | Multi-mode cynics detection method, device, computer equipment and storage medium |
CN118093896A (en) * | 2024-04-12 | 2024-05-28 | 中国科学技术大学 | Ironic detection method, ironic detection device, electronic equipment and storage medium |
-
2022
- 2022-07-21 CN CN202210863424.5A patent/CN115408517A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116402063A (en) * | 2023-06-09 | 2023-07-07 | 华南师范大学 | Multi-modal irony recognition method, apparatus, device and storage medium |
CN116402063B (en) * | 2023-06-09 | 2023-08-15 | 华南师范大学 | Multi-modal irony recognition method, apparatus, device and storage medium |
CN116702091A (en) * | 2023-06-21 | 2023-09-05 | 中南大学 | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP |
CN116702091B (en) * | 2023-06-21 | 2024-03-08 | 中南大学 | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP |
CN117633516A (en) * | 2024-01-25 | 2024-03-01 | 华南师范大学 | Multi-mode cynics detection method, device, computer equipment and storage medium |
CN117633516B (en) * | 2024-01-25 | 2024-04-05 | 华南师范大学 | Multi-mode cynics detection method, device, computer equipment and storage medium |
CN118093896A (en) * | 2024-04-12 | 2024-05-28 | 中国科学技术大学 | Ironic detection method, ironic detection device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110609891B (en) | Visual dialog generation method based on context awareness graph neural network | |
CN115408517A (en) | Knowledge injection-based multi-modal irony recognition method of double-attention network | |
CN111897913B (en) | Semantic tree enhancement based cross-modal retrieval method for searching video from complex text | |
CN111639252A (en) | False news identification method based on news-comment relevance analysis | |
CN114549850B (en) | Multi-mode image aesthetic quality evaluation method for solving modal missing problem | |
CN115293170A (en) | Aspect-level multi-modal emotion analysis method based on cooperative attention fusion | |
CN116402066A (en) | Attribute-level text emotion joint extraction method and system for multi-network feature fusion | |
CN112115131A (en) | Data denoising method, device and equipment and computer readable storage medium | |
CN113326384A (en) | Construction method of interpretable recommendation model based on knowledge graph | |
CN114461821A (en) | Cross-modal image-text inter-searching method based on self-attention reasoning | |
Khan et al. | A deep neural framework for image caption generation using gru-based attention mechanism | |
CN117746078B (en) | Object detection method and system based on user-defined category | |
CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
CN115099234A (en) | Chinese multi-mode fine-grained emotion analysis method based on graph neural network | |
CN117633516B (en) | Multi-mode cynics detection method, device, computer equipment and storage medium | |
Chauhan et al. | Analysis of Intelligent movie recommender system from facial expression | |
CN114330482A (en) | Data processing method and device and computer readable storage medium | |
CN115098646A (en) | Multilevel relation analysis and mining method for image-text data | |
CN115659242A (en) | Multimode emotion classification method based on mode enhanced convolution graph | |
CN115346132A (en) | Method and device for detecting abnormal events of remote sensing images by multi-modal representation learning | |
CN115130461A (en) | Text matching method and device, electronic equipment and storage medium | |
WO2022099063A1 (en) | Systems and methods for categorical representation learning | |
CN113821610A (en) | Information matching method, device, equipment and storage medium | |
Wang et al. | TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering | |
Yaoxian et al. | Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |