CN115408517A

CN115408517A - Knowledge injection-based multi-modal irony recognition method of double-attention network

Info

Publication number: CN115408517A
Application number: CN202210863424.5A
Authority: CN
Inventors: 亢良伊; 刘杰; 叶丹; 周志阳; 李硕
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-11-29

Abstract

The invention provides a knowledge injection-based multi-modal sarcasm recognition method of a double-attention network, which comprises the following steps: acquiring data content to be identified, wherein the data content to be identified comprises a plurality of text and picture pairs; encoding words in the text and objects in the picture to obtain an original representation; expanding the original representation based on the implicit context information of the data content to be identified to obtain context-aware representation; obtaining attention calculation results of the original representation and the context perception representation; calculating an original cross-modal contrast representation and a context-aware cross-modal contrast representation according to the attention calculation result; calculating a ironic recognition result based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation. The invention is helpful to improve the integral performance of ironic identification, is convenient for the practical application of the model and provides interpretability for the prediction result.

Description

Knowledge injection-based multi-modal irony recognition method of double-attention network

Technical Field

The invention relates to a knowledge injection-based multi-modal sarcasm recognition method of a double-attention network, and belongs to the technical field of multi-modal information recognition.

Background

Irony based on multimodal information is the goal of implicitly expressing strong emotions by using text as opposed to metaphorical scenes in pictures. Currently, text and picture based irony is ubiquitous on social platforms such as microblogs, twitter, and the like. Because irony reverses the polarity of emotion or perspective in text, automatic detection of multimodal irony is of great importance in customer service, opinion mining, and various tasks requiring understanding of the person's true emotion.

The real multimodality irony detection is quite complex. The semantic expression of user input information is affected not only by explicit content, but also by implicit context. Explicit content refers to the content of a scene observable in the input text or picture, and implicit context refers to the inferred knowledge about the scene that is not visible in the input information, including the processes involved in the development of the scene and the intent of the characters in the scene. Irony identification requires accurate localization of the irony-describing portions of multimodal information and discrimination of their semantic differences, based on the complete semantic representation of text and pictures. However, existing multimodal irony detection methods only learn features from input text and pictures, ignoring modeling of the implicit context behind the content. Meanwhile, they model semantic differences among multimodal information based on the unprocessed full amount of information of texts and pictures, and noise is easily introduced, so that the accuracy of ironic identification is reduced, and the practical application of models is affected. How to inject implicit context information into a multi-modal input to obtain a better feature representation and accurately position a ironic description area based on the information to perform semantic difference recognition is an urgent problem to be solved.

Disclosure of Invention

In order to solve the problems, the knowledge injection-based multi-modal sarcasm recognition method of the double-attention network provided by the invention utilizes the knowledge-enhanced multi-dimensional attention module to inject the implicit context knowledge into the multi-modal input representation, and divides the implicit context knowledge into two angles, namely a scene state and an emotional state, according to the human reasoning mode to construct the complete semantic representation of the multi-modal information. At the same time, the photo and text attention modules are cooperatively executed using a dual attention network based on a joint memory vector that brings together previous attention results to capture ironic shared semantics in text and photos. Finally, based on the joint embedding space, a multi-dimensional cross-modal matching layer is adopted to distinguish the difference between the multiple modes from multiple dimensions. This helps to improve the overall performance of ironic identification and provides interpretability of the prediction results, facilitating practical application of the model.

The technical content of the invention comprises:

a method of knowledge injection based multimodal irony recognition of a dual-attention network, the method comprising:

acquiring data content to be identified, wherein the data content to be identified comprises: a number of < text, picture > pairs, the text containing a number of words i, the picture relating to a plurality of objects j;

respectively coding the words i in the text and the objects j in the picture to obtain original expression of the words

With the original representation of the object

Based on implicit context information of the data content to be identified, the original representation of the word is performed

And the original representation of the object

Expanding to obtain word context perception representation

And object context aware representation

Separately representing the words using a dual attention network

The original representation of the object

And the word context-aware representation

The object context-aware representation

Performing attention calculation to obtain an attention calculation result of the original representation and the context perception representation;

obtaining an original cross-modal comparison representation and a context-aware cross-modal comparison representation by comparing differences between texts and pictures according to attention calculation results of the original representation and the context-aware representation;

calculating a irony recognition result of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation.

Further, the object j in the picture is coded to obtain an original representation of the object

The method comprises the following steps:

for each picture, detecting the area of the object j from the picture by using a pre-trained target detector, and taking the pooled features before the multi-class classification layer as the visual feature representation r of the object j _j ；

Representing the visual features r _j Projecting into a space of text representations;

by calculating each of the textThe relevance of the word i to the object j, obtaining a textual representation specific to the object j

Based on the text representation

And the visual feature representation r _j With representation of the text relevance of the calculation object j

Representing the visual characteristics r _j The composed object sequence is input into a bidirectional gated recurrent neural network, and the representation is performed

As weights are calculated to obtain the original representation of each object j

Further, the original representation of the word is based on implicit context information of the data content to be identified

With the original representation of the object

Expanding to obtain word context perception representation

With object context aware representation

The method comprises the following steps:

generating different types of inference knowledge for event descriptions in each picture or text

And calculating the inference knowledge

Is expressed as H ^M,R Wherein w is _l Marking words in the inference knowledge, wherein L is more than or equal to 1 and less than or equal to L, L represents the length of the inference knowledge, the relationship type R belongs to { before, after and intent }, before represents the relationship type of the event, after represents the relationship type of the event, intent represents the relationship type of the figure in the scene, and the mode M represents a text mode or a picture mode;

based on the original representation of the word

Composed text feature mapping H ^T Original representation of the object

Composed picture feature mapping H ^I And the common sense inference expression H ^M,R Calculating the data content to be recognized and the inference knowledge

A correlation matrix C between ^M ；

Based on the incidence matrix C ^M Obtaining the original representation of the word

With the original representation of the object

With a representation of implicit context information

And presentation of pictures with implicit context information

By reasoning knowledge for each of said

Learning a correlation weight, computing said representation

And the representation

Enhanced representation of

And enhanced representation

Based on the enhanced representation

And the enhanced representation

Computing word context-aware representations

With object context aware representation

Wherein the word perception vector representation

The method comprises the following steps: context-aware representation of context states of words

And emotional state context awareness representation

The object context aware vector representation

The method comprises the following steps: context-aware representation of scene states of objects

And emotional state context awareness representation

Further, the correlation matrix C is used for the correlation ^M Obtaining the original representation of the word

With the original representation of the object

With a representation of implicit context information

And presentation of pictures with implicit context information

The method comprises the following steps:

based on the incidence matrix C ^M And said common sense inference means H ^M,R Word-level representation of inference knowledge using attention mechanism

And object level representation

In the original representation of the word

And the original representation of the object

Separately adding the word-level representations

And the object level representation

Get a representation

And represents

Further, the based on the enhanced representation

Computing word context-aware representations

The method comprises the following steps:

representing the enhancement

Enhanced representation of pre-event relationship types

Enhanced representation of post-event relationship types

Enhanced representation of intent relationship types

According to the enhanced representation

The original representation of the word

And the enhanced representation

Computing word context-aware representations of context states

According to affiliated enhanced representations

Obtaining word emotional state context perception representation

Further, the words are represented separately and originally using a dual attention network

The original representation of the object

Performing attention calculation to obtain the attention calculation result of the original representation, including:

for each word original representation

Summing to obtain a text complete representation u ⁽⁰⁾ ；

Original representation of objects in a picture

Summing to obtain complete picture representation v ⁽⁰⁾ ；

According to said representation u ⁽⁰⁾ And said representation v ⁽⁰⁾ Calculating a joint memory vector m ⁽⁰⁾ ；

Based on

Carry out iterationCalculating and obtaining a joint memory vector m after iteration is finished ^(K) Where K denotes the total number of iterations, i ^(k) Complete representation of the text, v, representing the kth iteration ^(k) The complete representation of the picture representing the kth iteration,

is the product of the elements;

based on the joint memory vector m ^(K) And the attention calculation is carried out by utilizing the double attention network to respectively obtain the complete text representation u ^(K+1) Representing v in full with the picture ^(K+1) ；

Completely expressing the text u ^(K+1) Is expressed in integral with the picture v ^(K+1) As a result of the attention calculation of the original representation.

Further, the method is based on the joint memory vector m ^(K) Computing a complete representation u of the text ^(K+1) The method comprises the following steps:

according to the joint memory vector m ^(K) With the original representation of the word

Computing an output of a feedforward neural network in the dual attention network

Outputting the output

Substituting the softmax function into the function to obtain the attention weight

Based on the attention weight

Original representation of the word

Weighted summation is carried out to obtain the text complete representation u ^(K+1) 。

Further, the obtaining the original cross-modal comparison representation by comparing differences between texts and pictures according to the attention calculation result of the original representation includes:

raw cross-modal contrast representation

Wherein the content of the first and second substances,

representing a trainable weight matrix, | | | is the absolute value of the element difference; indicating a connection operation

Further, the calculating a irony recognition result of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation comprises:

connecting the original cross-modality contrast representation and the context-aware cross-modality contrast representation;

and inputting the connection result into a full connection layer, and carrying out binary irony classification by using a Sigmoid function to obtain the irony recognition result of the data content to be recognized.

An electronic device comprising a memory and a processor, the memory having stored therein a computer program that is loaded and executed by the processor to implement any of the methods described above.

Compared with the prior art, the invention has the advantages that:

(1) According to the method, visual COMET is adopted to provide implicit context information, namely scene state context and emotional state context, for text and picture modal information, knowledge-enhanced multidimensional attention modules are adopted to inject the implicit context into multimodal input, and text and picture representation of context perception is generated, so that complete semantic context is constructed by the multimodal information.

(2) The designed double attention network can accurately locate ironic areas in multi-modal information by using text and picture attention through multiple iterations. Meanwhile, the double attention is respectively applied to the original representation and the context perception representation from the network, and multi-modal information representations of multiple angles are captured. Based on a multi-dimensional shared representation space, a multi-dimensional trans-modal module is adopted to distinguish semantic differences of texts and pictures, so that the irony is accurately identified.

(3) Compared with the existing method, the method has higher performance, and meanwhile, the double attention module can provide interpretability for the predicted structure by combining the injected knowledge.

Drawings

FIG. 1 is a system model flow diagram of the present invention.

FIG. 2 is a system model architecture diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely specific embodiments of the present invention, rather than all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The technical problem of the invention is solved: the method for multi-modal sarcasm recognition of the double-attention network based on knowledge injection aims at the problem of multi-modal sarcasm recognition, on one hand, a knowledge-enhanced multi-dimensional attention model is adopted to construct a complete semantic representation of multi-modal information, on the other hand, a shared vector is maintained by adopting the double-attention machine, the sarcasm-related shared semantics in the multi-modal information are extracted through a text and picture attention module, and the difference of a multi-modal scene containing the sarcasm context is modeled through a multi-dimensional cross-modal matching layer, so that the integral performance of sarcasm recognition is improved, and a prediction result is provided with certain interpretability.

The technical scheme of the invention is as follows: a knowledge injection-based dual-attention network multi-modal irony recognition method comprises the steps of injecting implicit context knowledge into a multi-modal representation to construct complete semantics of multi-modal information, and capturing irony description areas in the multi-modal representation by using a dual-attention network to perform multi-dimensional semantic comparison. In the model, firstly, inputting and coding texts and pictures into vector representation, and aligning objects in the texts and the pictures by using an attention mechanism, thereby filtering irrelevant information in the pictures; then, in order to supplement implicit context information lacking in the text and the picture, an event knowledge graph is used for generating a scene context and an emotion context for the text and the picture, and the acquired knowledge is injected into multi-modal input through a multi-dimensional attention module with enhanced knowledge to construct a complete semantic code of the multi-modal information; in order to pay attention to the irony region in the text and the picture, a double attention module which is used for cooperatively executing the text and the picture is provided, and shared semantics which span multiple modalities are captured in original coding and complete semantic coding of multi-modality information by maintaining a joint memory vector; based on the joint embedding space, adopting multi-dimensional cross-modal matching to distinguish multi-modal differences in multiple dimensions; finally, a plurality of comparison results are concatenated into an input classification for multi-modal irony detection.

Fig. 1 is a flow chart of a system model of the present invention, and as shown in fig. 1, the system of the present invention includes: the system comprises an input coding module, a knowledge injection module, a double attention interaction module, a multi-dimensional cross-modal matching module and a classification prediction module.

Firstly, in an encoding module, for word preprocessing, separating text sentences by using an NLTK toolkit to obtain words, and embedding the words into 200-dimensional vectors generated by using a Glove algorithm as initialization; for picture preprocessing, a fast-RCNN is used for extracting picture objects and characteristics thereof, and repairing is carried out in a training stage. The hidden size of the Bi-GRU is 512 dimensions.

In the knowledge injection module, for event reasoning knowledge, reasoning knowledge of three types of relations of "before", "after" and "intent" is generated for texts and pictures by using a VisualCOMET event reasoning generator. For example, the text is entered as "this is why i want to be mom" and the picture describes "a woman standing in the room with a broom". The reasoning of the three types of relationships generated by the visual communications for the text is as follows: before: [ be in a family roo, put on her school uniform, … ], after: [ play scales with friends, tell the wrong, … ], intent: [ stay at home cozy, make her mom happy, … ]; the three types of relationship reasoning generated for the pictures are as follows: before: [ put on an apron, be from a school, … ], after: [ clean the room, finish her house work, … ], intent: [ Cleaning the pole with a broom, play with friends, … ]. In the practical application process, 15 candidates are kept for reasoning knowledge of texts and pictures on each relationship type. And then, a knowledge injection module is utilized to obtain a multi-modal information representation with enhanced implicit context knowledge, and complete semantic information of texts and pictures is constructed.

In the dual attention interaction module, the text and picture attention mechanism obtains a representation of the text and picture attention to the descriptive ironic region through 3 interactive iterations.

In the multi-dimensional cross-mode matching module, texts and pictures are compared in terms of original representation, context perception representation of scene states and context perception representation of emotional states.

In the classification module, the multi-mode information comparison results of three dimensions are connected and input into an output layer consisting of a double-layer fully-connected network and Softmax for ironic classification. In training, the batch size is 32, the model training learning rate is 0.0005, adam is used as the optimizer.

FIG. 2 is a system model architecture diagram of the present invention, as shown in FIG. 2:

input coding Module

And performing characteristic extraction on the input text and picture modal information, and encoding the input text and picture modal information into a unified vector space. Its input is a sentence containing a series of words and a picture containing a plurality of objects; the output is a vector representation that is subjected to the following operations. The input coding module comprises the following two parts:

(1) A text encoding module:

given a sequence of text words w ₁ ,w ₂ ,…,w _N In order to acquire sentence semantic information, a bidirectional gated recurrent neural network (bi-GRU) is adopted to learn the sequence semantic information representation of words in the text, andthe coding is in vector form:

wherein the content of the first and second substances,

is the hidden state representation of the ith word output by the bi-GRU unit, and N is the number of words in the sentence, namely the sentence length. The word sequence is originally expressed as bi-GRU after being coded

(2) Picture coding module

The method includes the steps that a plurality of objects are involved in a picture, and in order to filter irrelevant information in the picture and avoid the problem of incomplete object semantics caused by equally dividing the picture, objects relevant to texts in the picture are directly extracted and feature representation is carried out.

For each picture I, D salient objects are detected from the picture by using a pre-trained target detector, fasser R-CNN, and the pooled features before the multi-class classification layer are taken as the feature representation of the objects. The visual features of the extracted objects are then projected into the space of the text representation.

r _j ＝ReLu(W _v r _j +b _v )，

Wherein r is _j Is a visual feature representation of the jth detected object, W _v Is a weight matrix, b _v Is a deviation parameter.

The picture is used as background information of the input text, and the text only relates to a part of the target objects in the picture. In order to suppress the negative impact of irrelevant information in pictures, a gated attention mechanism is used to align text and pictures by calculating word and region level relevance. For each object in the picture, the gated attention mechanism uses a soft attention mechanism to calculate the relevance of each word in the text to the object

And form a text representation specific to the object

Then will be

And the visual feature representation r _j And executing element multiplication to obtain a representation of each target object with a text correlation:

because the picture areas lack a natural sequence, scattered information in the picture is connected in series into a complete semantic expression through the bi-directional gated recurrent neural network bi-GRU,

the original representation of the object in the picture after bi-GRU coding is

And D is the number of the objects identified by the object detector in the picture.

Knowledge injection Module

In order to construct a complete semantic representation of texts and pictures, the multi-modal information is naturally expanded by using implicit context information, so that a multi-modal feature representation with rich multi-view knowledge is formed. The knowledge injection module comprises the following two parts:

(1) A knowledge acquisition module:

and providing common sense knowledge reasoning of two dimensions of scene state context and emotional state context for the input text and pictures by using a visual-text event reasoner visual-text COMET. The visualcome uses a pre-trained auto-regressive language model GPT-2 as a generation model, and can generate reasoning knowledge about three relationship types of before, after and intent, namely before and after an event (scene state context) and the intention of a person in the scene (emotional state context), given a picture or a description of the event. The conventional knowledge reasoning is usually composed of short sentences of a series of words, and the reasoning knowledge about different modalities is defined as

Wherein R represents three relation types, R belongs to { before, after, intent }, M represents text and picture modes, and M belongs to { T, I }.

Processing the short sentences by adopting a bidirectional gating circulation network (bi-GRU) to obtain the representation of the short sentences,

l is the sentence length of the reasoning knowledge.

(2) A multidimensional knowledge injection module:

and designing a knowledge perception attention layer based on common knowledge reasoning at different viewing angles to form a multi-dimensional knowledge perception multi-modal representation. First, each element in the text or the picture is queried by using each knowledge inference and the relevance of each element is calculated, and the elements in the text or the picture are aligned with the knowledge inference. In particular, given a multi-modal feature representation H ^M (text feature mapping)

Or picture feature mapping

) And general sense reasoning express

Calculating the correlation between input and reasoning knowledge, i.e. correlation matrix C ^M ：

C ^M ＝tanh(H ^M W _M (H ^M,R ) ^T )

Wherein, W _M Is a weight matrix.

Then, an attention mechanism is used to form a word-level representation of reasoning knowledge about the input features, and the word-level representation is added with the original representation of the input features to obtain a representation of the text with implicit context information

And presentation of pictures with implicit context information

Since the event reasoner visualcome will generate multiple candidate inference knowledge for text and pictures. To focus on inferences that are more relevant to the input scenario, a correlation weight is learned for each knowledge inference and weighted sum of them to generate a multimodal information knowledge enhanced representation:

wherein WM _,R Is the weight matrix and Q is the number of inference knowledge candidates.

The scene state context consists of pre-scene, mid-scene, and post-scene, so averaging the three states yields the input scene state context-aware word vector representation:

accordingly, the scene state context-aware picture vector is represented as

Similarly, the invention can obtain the context-aware text representation of the emotional state

And a picture representation

Double attention Module

To locate the irony-describing portions of text and pictures, a joint memory vector is created that collects irony-related shared information in both modalities by performing the text attention and picture attention mechanisms over multiple iterations. Based on the double-note mechanism, we can obtain a representation of text and pictures focused on a specific area. The dual attention mechanism performs in three aspects, the original representation and the context-aware representation of the multi-modality. For simplicity of expression, the description of the relationship type is omitted in the following description, i.e.

Simplified as u _i ，

Simplified as v _j 。

The dual attention module contains the following three sub-modules:

(1) Sharing vectors

The key to identifying irony in multimodal information is to find joint space that describes the same thing, i.e., the region that describes irony. To this end, a joint memory vector is designed to collect, in text and pictures, the information that has been identified in k iterations:

wherein v is ^(k) And u ^(k) Is a complete representation of pictures and text, an initial memory representation m ⁽⁰⁾ Is defined as v ⁽⁰⁾ And u ⁽⁰⁾ Is multiplied by the element vector of (1).

(2) Text attention mechanism

The text attention device identifies the region describing the irony, and measures the relevance of each part of the text related to the irony by calculating the attention weight of each word in the text and a joint memory vector. In particular, attention weights

The method is calculated by two layers of feedforward neural networks and a softmax function:

wherein the content of the first and second substances,

and

is the parameter of the model and is,

and

is a bias parameter.

Finally, the complete representation of the text is obtained by weighted summation:

(3) Picture attention mechanism

The same calculation process as the text attention mechanism. First, a region in the picture associated with a joint memory vector (i.e., identified irony-related shared semantics after k iterations) is calculated using a two-layer feedforward neural network and a softmax function, and a complete representation of the picture is obtained by weighted summation:

wherein the content of the first and second substances,

and

is a parameter of the model that is,

and

is a bias parameter.

The dual attention module will obtain a representation of the highlighted ironic portion of the text and picture, defined as u and v, over K iterations.

Multidimensional cross-modal matching module

To capture semantic differences between text and pictures, the differences between text and pictures are compared in terms of a multi-modal raw representation and a context-aware representation using the following depth comparison attention mechanism. The module is realized as follows:

wherein, the first and the second end of the pipe are connected with each other,

is element multiplication, and | is the absolute value of the element difference; is a connection operation in which the connection is made,

is a trainable weight matrix.

Prediction Module

The above multi-dimensional cross-modal contrast representation (z) _raw ,z _sc ,z _en ) Connected input to the fully connected layer and bigram using Sigmoid function.

H＝fc([z _raw ；z _sc ；z _em ])，

y＝Sigmoid(H)。

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method of multimodal ironic recognition of a knowledge-injected dual-attention network, the method comprising:

With the original representation of the object

Based on implicit context information of the data content to be identified, the words are originally represented

With the original representation of the object

Expanding to obtain word context perception representation

With object context aware representation

Separately representing the words using a dual attention network

The original representation of the object

And the word context-aware representation

The object context-aware representation

calculating a ironic recognition result of the data content to be recognized based on the original cross-modality comparison representation and the context-aware cross-modality comparison representation.

2. The method of claim 1, wherein the encoding of object j in the picture results in an original representation of the object

The method comprises the following steps:

obtaining a text representation specific to an object j by calculating the relevance of each word i in the text to the object j

Based on the text representation

3. The method of claim 1, wherein the original representation of the word is based on implicit context information of the data content to be recognized

With the original representation of the object

Expanding to obtain word context perception representation

With object context aware representation

The method comprises the following steps:

And calculating the inference knowledge

Is expressed as H ^M,R Wherein w is _l Marking words in the reasoning knowledge, L is more than or equal to 1 and less than or equal to L,l represents the length of reasoning knowledge, the relationship type R belongs to { before, after, intent }, before represents the event relationship type, after represents the event relationship type, intent represents the person intention relationship type in the scene, and the mode M represents a text mode or a picture mode;

based on the original representation of the word

Composed text feature mapping H ^T Original representation of the object

Composed picture feature mapping H ^I And the common sense inference expression H ^M,R Calculating the data content to be identified and the inference knowledge

A correlation matrix C between ^M ；

With the original representation of the object

With a representation of implicit context information

And presentation of pictures with implicit context information

By reasoning knowledge for each of said

Learning a correlation weight, computing said representation

And the representation

Enhanced representation of

And enhanced representation

Based on the enhanced representation

And the enhanced representation

Computing word context-aware representations

With object context aware representation

Wherein the word perception vector representation

And emotional state context awareness representation

The object context aware vector representation

And emotional state context awareness representation

4. The method of claim 3, wherein the correlation matrix C is based on ^M Obtaining the original representation of the word

And the original representation of the object

With a representation of implicit context information

And presentation of pictures with implicit context information

The method comprises the following steps:

And object level representation

In the original representation of the word

With the original representation of the object

Separately adding the word-level representations

And the object level representation

Get a representation

And represents

5. The method of claim 3, wherein the based on the enhanced representation

Computing word context-aware representations

The method comprises the following steps:

representing the enhancement

Enhanced representation of pre-event relationship types

Enhanced representation of post-event relationship types

Enhanced representation of intent relationship types

According to the enhanced representation

The original representation of the word

And the enhanced representation

Computing word context-aware representations

According to the affiliated enhanced representation

Obtaining word emotional state context perception representation

6. The method of claim 1, wherein the original representations of the words are separately represented using a dual attention network

The original representation of the object

for each word original representation

Summing to obtain a text complete representation u ⁽⁰⁾ ；

Original representation of objects in a picture

Summing to obtain complete picture representation v ⁽⁰⁾ ；

According to said representation u ⁽⁰⁾ With said representation v ⁽⁰⁾ Calculating a joint memory vector m ⁽⁰⁾ ；

Based on

Performing iterative computation, and obtaining a joint memory vector m after the iteration is finished ^(K) Where K denotes the total number of iterations, u ^(k) Complete representation of the text, v, representing the kth iteration ^(k) The complete representation of the picture representing the kth iteration,

is the product of the elements;

7. The method of claim 6, wherein the based on the joint memory vector m ^(K) Computing a complete representation u of the text ^(K+1) The method comprises the following steps:

Outputting the output

Based on the attention weight

Original representation of the word

Weighted summation is carried out to obtain the text complete representation u ^(K ⁺¹⁾ 。

8. The method of claim 6, wherein said calculating the attention of the original representation by comparing the difference between the text and the picture to obtain an original cross-modal comparison representation comprises:

raw cross-modal contrast representation

Wherein the content of the first and second substances,

representing a trainable weight matrix, | | | is the absolute value of the element difference, representing the join operation.

9. A method as recited in claim 1, said calculating irony recognition of the data content to be recognized based on the original cross-modal comparison representation and the context-aware cross-modal comparison representation, comprising:

10. An electronic device comprising a memory and a processor, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1-9.