CN117892205A

CN117892205A - Multi-modal irony detection method, apparatus, device and storage medium

Info

Publication number: CN117892205A
Application number: CN202410295722.8A
Authority: CN
Inventors: 吴乔峰; 薛云; 方文泷; 钟玮瑜
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-04-16
Anticipated expiration: 2044-03-15

Abstract

The invention relates to the technical field of natural language processing, in particular to a multi-mode ironmaking detection method, device, equipment and storage medium, which are used for carrying out global feature extraction and text feature enhancement in a hierarchical manner for each mode content based on image and text dual-mode data, carrying out inconsistency evaluation in a cross-mode interaction mode based on obtained text global feature representation, image global feature representation and text enhancement feature representation to obtain inconsistency scores, carrying out cooperative identification by utilizing the image global feature representation, the text enhancement feature representation and the inconsistency scores, fully utilizing inter-mode information to carry out ironmaking detection, and improving the ironmaking detection accuracy.

Description

Multi-modal irony detection method, apparatus, device and storage medium

Technical Field

The present invention relates to the field of natural language processing technology, and in particular, to a method, an apparatus, a device, and a storage medium for multi-modal irony detection.

Background

Irony means that people or things are revealed, criticized or jeer by metaphors, exaggeration and other techniques, and is a special emotion expression mode, the text characteristics of the method are usually opposite to real/apparent emotion information, the emotion or intention is , in the viewpoint development of users of the current social platform, the viewpoint attitude of the users is greatly limited only by the traditional emotion analysis and viewpoint development methods, and inconsistent information implicit in the viewpoints is difficult to effectively identify. Therefore, the detection method for irony recognition is beneficial to analyzing the true viewpoint attitude of the user, and improves the accuracy of emotion analysis and viewpoint mining tasks.

Meanwhile, with the continuous development of social media and network culture, people are not limited to using text forms as the only way for expressing own views, and multi-mode texts such as texts, pictures and the like are widely distributed on various social platforms. Wherein, the combination of the characters and the images can achieve ironic effect, and the number of the push texts is not small. In a multimodal context, irony is no longer a pure linguistic phenomenon, and due to the nature of social media short text, text information is not always sufficient, and antisense relationships are more represented by way of cross-modality. Therefore, it is not sufficient to determine irony in multi-modal information from the viewpoint of text analysis alone.

Disclosure of Invention

Based on the above, the invention provides a multi-modal irony detection method, a device, equipment and a storage medium, global feature extraction and text feature enhancement are carried out for each modal content in a layering way based on image and text bimodal data, inconsistency evaluation is carried out by adopting a cross-modal interaction mode based on obtained text global feature representation, image global feature representation and text enhancement feature representation, inconsistency score is obtained, collaborative recognition is carried out by utilizing the image global feature representation, the text enhancement feature representation and the inconsistency score, irony detection is carried out by fully utilizing inter-modal information, and the irony detection accuracy is improved, and the technical method comprises the following steps:

in a first aspect, embodiments of the present application provide a multi-modal irony detection method comprising the steps of:

Obtaining document data to be detected and a preset ironic detection model, wherein the document data to be detected comprises a text to be detected and an image to be detected, and the ironic detection model comprises a feature extraction module, a text feature enhancement module, a cross-modal interaction module and an ironic detection module;

Inputting the document data to be tested into the feature extraction module for feature extraction, and obtaining text global feature representation corresponding to the text to be tested and image global feature representation corresponding to the image to be tested;

inputting the text global feature representation and the image global feature representation into the text feature enhancement module for feature enhancement to obtain text enhancement feature representation;

inputting the text global feature representation, the text enhancement feature representation and the image global feature representation into the cross-modal interaction module for inconsistent evaluation to obtain an inconsistent score;

And inputting the text enhancement feature representation, the image global feature representation and the inconsistency score into the ironic detection module to perform ironic detection, and obtaining ironic detection results of the document data to be detected.

In a second aspect, embodiments of the present application provide a multi-modal ironic detection device comprising:

The system comprises a data acquisition module, a sarcasm detection module and a data processing module, wherein the data acquisition module is used for acquiring document data to be detected and a preset sarcasm detection model, the document data to be detected comprises a text to be detected and an image to be detected, and the sarcasm detection model comprises a feature extraction module, a text feature enhancement module, a cross-mode interaction module and a sarcasm detection module;

The global feature extraction module is used for inputting the document data to be detected into the feature extraction module for feature extraction, and obtaining text global feature representation corresponding to the text to be detected and image global feature representation corresponding to the image to be detected;

The feature enhancement module is used for inputting the text global feature representation and the image global feature representation into the text feature enhancement module for feature enhancement to obtain text enhancement feature representation;

The inconsistency score calculation module is used for inputting the text global feature representation, the text enhancement feature representation and the image global feature representation into the cross-modal interaction module to perform inconsistency evaluation so as to obtain an inconsistency score;

The ironic detection processing module is used for inputting the text enhancement feature representation, the image global feature representation and the inconsistency score into the ironic detection module for ironic detection, and obtaining ironic detection results of the document data to be detected.

In a third aspect, an embodiment of the present application provides a computer apparatus, including: a processor, a memory, and a computer program stored on the memory and executable on the processor; the computer program when executed by the processor implements the steps of the method for multimodal irony detection as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a storage medium storing a computer program which, when executed by a processor, implements the steps of the multimodal irony detection method of the first aspect.

In this embodiment, a method, an apparatus, a device, and a storage medium for multi-mode ironmaking detection are provided, global feature extraction and text feature enhancement are performed hierarchically for each mode content based on image and text dual-mode data, inconsistency evaluation is performed by using a cross-mode interaction mode based on obtained text global feature representation, image global feature representation, and text enhancement feature representation, inconsistency scores are obtained, collaborative recognition is performed by using the image global feature representation, the text enhancement feature representation, and the inconsistency scores, ironmaking detection is performed by fully using inter-mode information, and ironmaking detection accuracy is improved.

For a better understanding and implementation, the present invention is described in detail below with reference to the drawings.

Drawings

FIG. 1 is a flow chart of a method for multi-modal irony detection according to one embodiment of the application;

FIG. 2 is a schematic flow chart of S2 in a multi-modal irony detection method according to an embodiment of the application;

FIG. 3 is a schematic flow chart of S3 in a multi-modal irony detection method according to an embodiment of the application;

FIG. 4 is a schematic flow chart of S4 in a multi-modal irony detection method according to an embodiment of the application;

FIG. 5 is a schematic flow chart of a method for multi-modal irony detection S6 according to an embodiment of the application;

FIG. 6 is a schematic flow chart of S6 in a multi-modal irony detection method according to another embodiment of the application;

FIG. 7 is a schematic flow chart of a method for multi-modal irony detection S62 according to another embodiment of the application;

FIG. 8 is a schematic flow chart of a method for multi-modal irony detection S63 according to another embodiment of the application;

FIG. 9 is a schematic flow chart of a method for multi-modal irony detection S64 according to another embodiment of the application;

FIG. 10 is a schematic diagram of a multi-modal irony detection device according to one embodiment of the application;

Fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if"/"if" as used herein may be interpreted as "at " or "at " or "in response to a determination", depending on the context.

Referring to fig. 1, fig. 1 is a flowchart of a method for detecting multi-modal irony according to an embodiment of the application, comprising the steps of:

s1: and obtaining document data to be detected and a preset ironic detection model.

The execution subject of the multi-modal irony detection method of the present application is a detection apparatus (hereinafter referred to as detection apparatus) of the multi-modal irony detection method.

In an alternative embodiment, the detecting device may be a computer device, may be a server, or a server cluster formed by combining multiple computer devices.

The detection device obtains to-be-detected document data, and in an alternative embodiment, the to-be-detected document data is derived from social media, wherein the social media is a content production and exchange platform based on user relations on the internet and mainly comprises social websites, microblogs, weChats, facebooks, twitter, blogs, forums, podcasts and the like. The document data includes opinion, insight, experience, and views shared by users on social media, and the like.

Specifically, the detection device may be connected with the social media to obtain document data published on the social media as document data to be detected, where the document data to be detected includes a text to be detected and an image to be detected corresponding to the text to be detected, and the text to be detected includes a plurality of words as follows:

Where s is a sentence sequence of the text to be tested, is a word vector of the i-th word.

The detection device acquires a preset ironic detection model, wherein the ironic detection model comprises a feature extraction module, a text feature enhancement module, a cross-modal interaction module and an ironic detection module.

S2: and inputting the document data to be tested into the feature extraction module for feature extraction, and obtaining text global feature representation corresponding to the text to be tested and image global feature representation corresponding to the image to be tested.

In this embodiment, the detection device inputs the document data to be detected into the feature extraction module to perform feature extraction, so as to obtain a text global feature representation corresponding to the text to be detected and an image global feature representation corresponding to the image to be detected, where the text global feature representation includes state vectors of a plurality of words, and the image global feature representation includes state vectors of a plurality of image sub-areas.

The feature extraction module comprises a word embedding module, a target detection module and a dimension transformation module; referring to fig. 2, fig. 2 is a schematic flow chart of step S2 in the multi-mode irony detection method according to an embodiment of the application, including steps S21 to S23, specifically as follows:

s21: and inputting the text to be tested into the word embedding module for coding processing to obtain the initial text characteristic representation of the text to be tested.

The word embedding module adopts BERT (Bidirectional Encoder Representation from Transformers) word embedding models and is used for converting vectors of a plurality of words in the text to be detected into corresponding state vectors.

In this embodiment, the detection device inputs the text to be detected into the word embedding module to perform encoding processing, so as to obtain an initial text feature representation of the text to be detected.

Specifically, the detection device inputs the text to be detected into a BERT word embedding model, maps each word in the text to be detected into a low-dimensional vector space, obtains hidden layer vectors of a plurality of words of the text to be detected output by the BERT word embedding model through inquiring a pretrained BERT matrix, and carries out coding processing to obtain the initial text feature representation, wherein the initial text feature representation is as follows:

Wherein X is the initial text feature representation, is the hidden layer vector of the i-th word, and/() is the hidden layer vector of the n-1-th word.

Wherein T is the initial text feature representation, is the hidden layer vector of the ith word,/() is the hidden layer vector of the n-1 th word, and n is the total number of words.

S22: dividing the image to be detected into a plurality of image subareas, inputting the plurality of image subareas into the target detection module for target detection, and obtaining an initial image characteristic representation of the image to be detected.

The target detection module adopts a Swin transducer model for multi-target detection, and can determine a predefined object related to an entity in a picture.

In this embodiment, the detection device divides the image to be detected into a plurality of first image sub-areas according to preset first area size data, inputs the plurality of first image sub-areas to the target detection module for target detection, and obtains an initial image feature representation of the image to be detected, where the initial image feature representation is:

Wherein V is the initial image characteristic representation, is the detection vector of the j-th first image subarea,/() is the detection vector of the m-1-th first image subarea, and m is the total number of image subareas.

S23: and inputting the initial text feature representation and the initial image feature representation into the dimension transformation module to perform dimension transformation and feature extraction, and obtaining the text global feature representation and the image global feature representation.

The dimension transformation module adopts MLP (Multilayer Perceptron) multi-layer perceptron model, which is a feedforward artificial neural network model, and maps a plurality of input data sets to a single output data set.

In this embodiment, the detection device maps the initial text feature representation and the initial image feature representation to the same dimension space by using a multi-layer perceptron model with two different parameters according to the initial text feature representation and the initial image feature representation, and performs dimension transformation and feature extraction to obtain the text global feature representation and the image global feature representation, where the text global feature representation is:

Where is the text global feature representation,/> is the normalization function and/> is the multi-layer perceptron function.

The image global features are expressed as:

wherein is the image global feature representation.

S3: and inputting the text global feature representation and the image global feature representation into the text feature enhancement module for feature enhancement to obtain text enhancement feature representation.

In this embodiment, the detection device inputs the text global feature representation and the image global feature representation into the text feature enhancement module to perform feature enhancement, so as to obtain a text enhancement feature representation.

Referring to fig. 3, fig. 3 is a schematic flow chart of step S3 in the multi-mode irony detection method according to an embodiment of the application, including steps S31 to S32, specifically as follows:

s31: and obtaining attention characteristic representations corresponding to a plurality of multi-head attention according to the text global characteristic representation, the image global characteristic representation and a preset attention extraction algorithm by adopting a multi-head attention mechanism.

In order to obtain text sequence expression closer to global features of an image, and better understand the relationship between the image and the text, in this embodiment, the detection device adopts a multi-head attention mechanism, obtains attention feature expressions corresponding to a plurality of multi-head attention according to the global feature expression of the text, the global feature expression of the image and a preset attention extraction algorithm, relieves noise caused by differences between two different modes of the image and the text, strengthens association between the two modes, and captures cross-mode association information, where the attention extraction algorithm is as follows:

wherein is an attention feature representation corresponding to the ith multi-head attention, softmax () is a normalization function,/> is the text global feature representation,/> is a first weight parameter of the ith multi-head attention,/> is the image global feature representation,/> is a second weight parameter of the ith multi-head attention,/> is a third weight parameter of the ith multi-head attention, d is a dimension parameter, h is the number of heads of the multi-head attention, and/> is a transpose symbol.

S32: and obtaining the text enhancement feature representation according to a plurality of attention feature representations corresponding to the multiple attention points, the text global feature representation and a preset text enhancement feature algorithm.

The text enhancement feature algorithm is as follows:

Where is the text enhancement feature representation,/> is a normalization function,/> is the text global feature representation,/> is a multi-layer perceptron function,/> is the attention feature representation corresponding to the h-th multi-head attention.

In this embodiment, the detection device obtains the text enhancement feature representation according to the attention feature representations corresponding to the multiple attention points, the text global feature representation and a preset text enhancement feature algorithm, and enhances the text feature by adopting a manner that the text global feature representation is aligned with the graphics context of the image global feature representation, thereby realizing the extraction of cross-modal associated information and improving the irony detection accuracy.

S4: and inputting the text global feature representation, the text enhancement feature representation and the image global feature representation into the cross-modal interaction module for inconsistent evaluation to obtain an inconsistent score.

Since different words have different effects on the ironic detection task, especially nouns, verbs, neighboring words, etc., are often of importance for understanding ironic utterances.

In this embodiment, the detection device inputs the text global feature representation, the text enhancement feature representation and the image global feature representation into the cross-modal interaction module to perform inconsistency evaluation, so as to obtain an inconsistency score, where the cross-modal interaction module includes a fully connected network and a multi-layer graph attention network, the inconsistency score includes a first inconsistency score and a second inconsistency score, the first inconsistency score is evaluation data obtained through processing of the fully connected network, and the second inconsistency score is evaluation data obtained through processing of the multi-layer graph attention network.

Referring to fig. 4, fig. 4 is a schematic flow chart of step S4 in the multi-mode irony detection method according to an embodiment of the application, including steps S41 to S43, specifically as follows:

s41: and inputting the text global feature representation, the text enhancement feature representation and the image global feature representation into a fully connected network, and obtaining a first inconsistency score according to the text enhancement feature representation, the image global feature representation and a preset first inconsistency evaluation algorithm.

The first inconsistency evaluation algorithm is:

Where is the first inconsistency score,/> is a first weight parameter,/> is a first bias parameter, and/> is a first cross-modal joint weight parameter.

In this embodiment, the detection device inputs the text global feature representation, the text enhancement feature representation, and the image global feature representation into a fully connected network, and obtains a first inconsistency score according to the text enhancement feature representation, the image global feature representation, and a preset first inconsistency evaluation algorithm.

S42: and constructing a text adjacency matrix corresponding to the text enhancement feature representation and an image adjacency matrix corresponding to the image global feature representation, respectively taking the text adjacency matrix and the image adjacency matrix as first-layer input data of the multi-layer graph attention network, and obtaining feature vectors of all layers of the multi-layer graph attention network according to a preset graph convolution algorithm.

In this embodiment, the detecting device constructs, according to the text enhancement feature representation and the image global feature representation, a text graph corresponding to the text enhancement feature representation and a visual graph corresponding to the image global feature representation, where the constructing step includes node construction and edge construction.

Specifically, for node construction, the detection device constructs a number of text nodes of the text graph according to the state vectors of a number of words in the text enhancement feature representation, each text node corresponding to a state vector of a word in the text global feature representation.

The detection device constructs a plurality of visual nodes of the visual map according to the state vectors of a plurality of image subregions in the image global feature representation, and each visual node corresponds to the state vector of one image subregion in the image global feature representation.

For edge construction, the detection equipment constructs an edge set of the text graph and an edge set of the visual graph according to a plurality of text nodes and image nodes by adopting nodes of the same mode to be connected in pairs through edges in the modes and nodes of different modes to be connected through edges between the modes.

And the detection equipment converts the text graph and the visual graph to construct a text adjacency matrix corresponding to the text enhancement feature representation and an image adjacency matrix corresponding to the image global feature representation.

The multi-layer graph attention network graph attention networks (GAT) utilizes masked self-attention layers (MASKED SELF-attentional layers) to enable learning of the relative importance among nodes. By adopting GAT to extract attention from the first and second adjacency matrices, deeper information in text and visual modalities can be learned, thereby improving the accuracy of sarcasm detection. In this embodiment, the detection device uses the text adjacency matrix and the image adjacency matrix as first-layer input data of the multi-layer graph attention network respectively, and obtains feature vectors of each layer of the multi-layer graph attention network according to a preset graph convolution algorithm, where the feature vectors include text feature vectors corresponding to the text adjacency matrix and image feature vectors corresponding to the image adjacency matrix, and the graph volume integration algorithm is:

Wherein is an attention score between an ith node and a jth neighbor node of a first layer of the multi-layer graph attention network,/> is an activation function,/> is a bias of a learnable parameter of a first layer of the multi-layer graph attention network,/> is a weight parameter of the first layer of the multi-layer graph attention network,/> is a feature vector of the ith node of the first layer of the multi-layer graph attention network, j, k represents neighbor nodes,/> and/> are feature vectors of the jth and k neighbor nodes of the first layer of the multi-layer graph attention network, respectively, k represents neighbor nodes,/> is a set of neighbor nodes of the same layer of the ith node, and/> is an attention score between the ith node of the first layer of the multi-layer graph attention network and itself.

S43: combining text feature vectors of a plurality of nodes of each layer of the multi-layer graph attention network to obtain text mode embedded representations corresponding to the text to be detected, combining image feature vectors of a plurality of nodes of each layer of the multi-layer graph attention network to obtain image mode embedded representations corresponding to the image to be detected, and obtaining second inconsistency scores according to the text global feature representations, the text enhancement feature representations, the text mode embedded representations, the image mode embedded representations and a preset second inconsistency evaluation algorithm.

In this embodiment, the detection device combines text feature vectors of a plurality of nodes of each layer of the multi-layer graph attention network to obtain a text mode embedded representation corresponding to the text to be detected, and combines image feature vectors of a plurality of nodes of each layer of the multi-layer graph attention network to obtain an image mode embedded representation corresponding to the image to be detected.

The detection device obtains a second inconsistency score according to the text global feature representation, the text enhancement feature representation, the text modality embedded representation, the image modality embedded representation and a preset second inconsistency evaluation algorithm, wherein the second inconsistency evaluation algorithm is as follows:

Wherein is the second inconsistency score,/> is the text modality embedded representation,/> is the image modality embedded representation, c is the text semantic feature representation,/> is the second weight parameter,/> is the third weight parameter,/> is the second bias parameter, is the third bias parameter, and/> is the second cross-modality joint weight parameter.

S5: and inputting the text enhancement feature representation, the image global feature representation and the inconsistency score into the ironic detection module to perform ironic detection, and obtaining ironic detection results of the document data to be detected.

In this embodiment, the detecting device inputs the text enhancement feature representation, the image global feature representation and the inconsistency score into the irony detecting module to perform irony detection, and obtains irony detection results of the document data to be detected.

Referring to fig. 5, fig. 5 is a schematic flow chart of step S6 in the multi-mode irony detection method according to an embodiment of the application, including steps S51 to S52, specifically as follows:

S51: obtaining image weight parameters according to the image global feature representation and a preset image weight parameter calculation algorithm, and respectively carrying out dot product processing on the image weight parameters and first inconsistent scores and second inconsistent scores in the inconsistent scores to obtain first cross-modal joint information and second cross-modal joint information.

The image weight parameter calculation algorithm is as follows:

Where is the image weight parameter,/> is the fourth weight parameter, and/> is the fourth bias parameter.

In this embodiment, the detection device obtains an image weight parameter according to the image global feature representation and a preset image weight parameter calculation algorithm, and performs dot product processing on the image weight parameter and a first inconsistent score and a second inconsistent score in the inconsistent scores respectively to obtain first cross-modal joint information and second cross-modal joint information.

S52: and obtaining a predicted ironic probability distribution vector of the document data to be detected as the ironic detection result according to the first cross-modal joint information, the second cross-modal joint information, the image global feature representation, the text enhancement feature representation and a preset ironic probability distribution vector calculation algorithm.

The ironic probability distribution vector calculation algorithm is as follows:

Where y is the predicted ironic probability distribution vector, is the first cross-modal joint information,/> is the second cross-modal joint information,/> is the fifth weight parameter, and/> is the fifth bias parameter.

In this embodiment, the detection device obtains, as the ironic detection result, a predicted ironic probability distribution vector of the document data to be detected according to the first cross-modal joint information, the second cross-modal joint information, the image global feature representation, the text enhancement feature representation, and a preset ironic probability distribution vector calculation algorithm.

Based on image and text bimodal data, global feature extraction and text feature enhancement are carried out in a layering manner on each modal content, inconsistency evaluation is carried out by adopting a cross-modal interaction mode based on the obtained text global feature representation, image global feature representation and text enhancement feature representation, inconsistency scores are obtained, collaborative recognition is carried out by utilizing the image global feature representation, the text enhancement feature representation and the inconsistency scores, ironmaking detection is carried out by fully utilizing inter-modal information, and ironmaking detection accuracy is improved.

In another embodiment of the present application, the method for detecting multi-modal irony further includes step S6: before step S2, please refer to fig. 6, fig. 6 is a schematic flow chart of step S6 in the multi-modal irony detection method according to another embodiment of the application, including steps S61 to S64, specifically as follows:

s61: a training dataset is obtained.

In this embodiment, the detection device obtains a training data set, wherein the training data set comprises several sets of training document data, the training document data comprising training text and training images.

S62: obtaining a plurality of groups of reconstructed images corresponding to training images of the training document data, and a plurality of groups of training text global feature representations, training image global feature representations and reconstructed image global feature representations of the training document data, and obtaining a first total loss value according to the training text global feature representations, the training image global feature representations and the reconstructed image global feature representations of the training document data by adopting a contrast learning method.

In this embodiment, the detection device divides the image to be detected into a plurality of second image sub-areas according to the preset second area size data, performs non-overlapping random sampling on the plurality of first image sub-areas and the plurality of second image sub-areas of the image to be detected, obtains a plurality of non-overlapping image sub-areas, and combines the plurality of non-overlapping image sub-areas to construct a reconstructed image so as to extract local information in the image.

The detection device inputs a plurality of groups of training texts and training images of the training document data and reconstruction images corresponding to the training images into the feature extraction module of the ironic detection model to obtain a plurality of groups of training text global feature representations, training image global feature representations and reconstruction image global feature representations of the training document data, and specific embodiments can refer to steps S21-S23 and are not repeated here.

The detection equipment adopts a contrast learning method, and obtains a first total loss value according to a plurality of groups of training text global feature representations, training image global feature representations and reconstructed image global feature representations of the training document data.

Referring to fig. 7, fig. 7 is a schematic flow chart of step S62 in the multi-mode irony detection method according to another embodiment of the application, including steps S621 to S625, which are specifically as follows:

S621: and obtaining first text transposition feature representations and first image transposition feature representations of the training document data according to the training text global feature representations of the training texts of the training document data, the training image global feature representations of the training images and a preset first transposition multiplication algorithm.

The first transfer multiplication algorithm is as follows:

Where is the first text transposed feature representation,/> is the first learnable parameter,/> is the training text global feature representation,/> is the training image global feature representation, and/> is the first image transposed feature representation.

In this embodiment, the detection device obtains the first text transposition feature representations and the first image transposition feature representations of the training document data according to the training text global feature representations of the training texts of the training document data, the training image global feature representations of the training images and the preset first transposition multiplication algorithm, and performs contrast learning on a plurality of modes in the same scene, so that the model can pay attention to global and local inconsistencies between the images and the texts, and therefore, related information can be better understood and utilized, and the accuracy of ironic detection is improved.

S622: and respectively taking a group of training document data in the training data set as a first positive sample, obtaining first sub-loss values of a plurality of groups of training document data represented by the first text transposition feature and first sub-loss values of a plurality of groups of training document data represented by the first image transposition feature according to a first text transposition feature representation, a first image transposition feature representation and a preset first contrast learning loss function of the plurality of groups of training document data, and respectively combining the first sub-loss values of the plurality of groups of training document data represented by the first text transposition feature and the first sub-loss values of the plurality of groups of training document data represented by the first image transposition feature to construct a first loss value represented by the first text transposition feature and a first loss value represented by the first image transposition feature.

In this embodiment, the detection device respectively uses a set of training document data in the training data set as a first positive sample, and obtains a first sub-loss value of the plurality of sets of training document data based on the first text transposition feature representation and a first sub-loss value of the plurality of sets of training document data based on the first image transposition feature representation according to a first text transposition feature representation, a first image transposition feature representation and a preset first contrast learning loss function of the plurality of sets of training document data, where the first contrast learning loss function is:

wherein is a first sub-loss value of training document data represented based on the first text transposition feature, is a first text transposition feature representation of the first positive sample,/> is a first positive sample, is a temperature coefficient, K is training document data in the training dataset,/> is a first text transposition feature representation of an i-th set of training document data in the training dataset,/> is an i-th set of training document data in the training dataset,/> is a first sub-loss value of training document data represented based on the first image transposition feature,/> is a first image transposition feature representation of the first positive sample, is a first image transposition feature representation of an i-th set of training document data in the training dataset.

The detection device respectively combines the first sub-penalty values of the plurality of sets of training document data represented by the first text transposition feature and the first sub-penalty values of the plurality of sets of training document data represented by the first image transposition feature to construct a first penalty value represented by the first text transposition feature and a first penalty value represented by the first image transposition feature.

S623: and combining the training texts of the plurality of groups of training document data with the reconstructed images to construct an enhancement data set, wherein the enhancement data set comprises a plurality of groups of enhancement document data, and the second text transposition feature representation and the second image transposition feature representation of the plurality of groups of enhancement document data are obtained according to the training text global feature representation, the reconstructed image global feature representation and a preset second transposition multiplication algorithm of the plurality of groups of enhancement document data.

The second transposed multiplication algorithm is:

Where is the second text transposed feature representation,/> is the reconstructed image global feature representation, and/> is the second image transposed feature representation.

In this embodiment, the detection device combines training texts and reconstructed images of the training document data in several groups to construct an enhanced data set, where the enhanced data set includes several groups of enhanced document data, and obtains second text transposed feature representations and second image transposed feature representations of the several groups of enhanced document data according to the training text global feature representations, the reconstructed image global feature representations, and a preset second transposed multiplication algorithm of the several groups of enhanced document data.

S624: and respectively taking a group of enhanced document data in the enhanced data set as a second positive sample, according to a second text transposition feature representation, a second image transposition feature representation and a preset second contrast learning loss function of a plurality of groups of enhanced document data, obtaining first enhancement sub-loss values of a plurality of groups of enhanced document data represented by the second text transposition feature and first enhancement sub-loss values of a plurality of groups of enhanced document data represented by the second image transposition feature, and respectively combining the first enhancement sub-loss values of a plurality of groups of enhanced document data represented by the second text transposition feature and the first enhancement sub-loss values of a plurality of groups of enhanced document data represented by the second image transposition feature to construct a first enhancement loss value represented by the second text transposition feature and a first enhancement loss value represented by the second image transposition feature.

In this embodiment, the detection device respectively uses a group of enhanced document data in the enhanced data set as a second positive sample, and obtains a first enhancer loss value of a plurality of groups of enhanced document data represented by a second text transposition feature and a first enhancer loss value of a plurality of groups of enhanced document data represented by a second image transposition feature according to the second text transposition feature representation, the second image transposition feature representation and a preset second contrast learning loss function, where the second contrast learning loss function is:

/>

Where is a first enhancer loss value for enhanced document data based on the second text transposition feature representation,/> is a second text transposition feature representation of the second positive sample,/> is a second positive sample, M is enhanced document data in the enhancement dataset,/> is a second text transposition feature representation of an i-th set of enhanced document data in the enhancement dataset,/> is an i-th set of enhanced document data in the enhancement dataset,/> is a first enhancer loss value for enhanced document data based on the second image transposition feature representation,/> is a second image transposition feature representation of the second positive sample, is a second image transposition feature representation of an i-th set of enhanced document data in the enhancement dataset.

The detection device combines the first enhancer loss values of the plurality of sets of enhanced document data represented by the second text transposition feature and the first enhancer loss values of the plurality of sets of enhanced document data represented by the second image transposition feature, respectively, to construct a first enhancement loss value represented by the second text transposition feature and a first enhancement loss value represented by the second image transposition feature.

S625: and obtaining a first total loss value according to a first loss value expressed based on the first text transposition feature, a first loss value expressed based on the first image transposition feature, a first enhancement loss value expressed based on the second text transposition feature, a first enhancement loss value expressed based on the second image transposition feature and a preset first total loss value calculation algorithm.

In this embodiment, the detection device obtains a first total loss value according to a first loss value represented based on the first text transposition feature, a first loss value represented based on the first image transposition feature, a first enhancement loss value represented based on the second text transposition feature, a first enhancement loss value represented based on the second image transposition feature, and a preset first total loss value calculation algorithm, where the first total loss value calculation algorithm is:

Where is a first total loss value,/> is a first loss value represented by the first text transposition feature,/> is a first loss value represented by the first image transposition feature,/> is a first enhancement loss value represented by the second text transposition feature,/> is a first enhancement loss value represented by the second image transposition feature, and m is the number of reconstructed images corresponding to each training image.

The text global feature representation and the image global feature representation are optimized through contrast learning, so that the alignment of the text global feature representation and the image global feature representation is better realized, the semantic gap is reduced, and the accuracy of the sarcasm detection is improved.

S63: obtaining training text mode embedded representations and training image mode embedded representations of the training document data, and obtaining a second total loss value according to the training text mode embedded representations and the training image mode embedded representations of the training document data by adopting a contrast learning method.

In this embodiment, the detection device obtains a plurality of sets of training text mode embedded representations and training image mode embedded representations of the training document data, and obtains the second total loss value according to the plurality of sets of training text mode embedded representations and training image mode embedded representations of the training document data by adopting a contrast learning method.

Referring to fig. 8, fig. 8 is a schematic flow chart of step S63 in the multi-mode irony detection method according to another embodiment of the application, including steps S631 to S634, which are specifically as follows:

S631: respectively taking training text mode embedded representations and training image mode embedded representations of a plurality of groups of training document data as input mode embedded representations, and obtaining negative pairing similarity weight matrixes corresponding to the input mode embedded representations of the plurality of groups of training document data according to feature vectors of a plurality of nodes of each layer of a multi-layer graph attention network in the input mode embedded representations of the plurality of groups of training document data and a preset negative pairing similarity weight calculation algorithm.

In this embodiment, the detection device respectively uses training text mode embedded representations and training image mode embedded representations of a plurality of groups of training document data as input mode embedded representations, and obtains negative pairing similarity weight matrixes corresponding to the input mode embedded representations of the plurality of groups of training document data according to feature vectors of a plurality of nodes of each layer of a multi-layer graph attention network in the input mode embedded representations of the plurality of groups of training document data and a preset negative pairing similarity weight calculation algorithm, where the negative pairing similarity weight matrixes include negative pairing similarity weight parameters among the plurality of nodes, and the negative pairing similarity weight calculation algorithm is as follows:

Wherein is a negative pairing similarity weight parameter of the ith node of the first layer and the jth node of the first+1 layer of the multi-layer graph attention network,/> is a second leachable parameter,/> is a cosine similarity function,/> is a feature vector of the ith node of the first layer of the multi-layer graph attention network, and/> is a feature vector of the jth node of the first layer of the multi-layer graph attention network.

S632: and obtaining second sub-loss values of a plurality of nodes of a plurality of layers of the multi-layer graph attention network corresponding to the plurality of groups of training document data according to the input mode embedded representation of the plurality of groups of training document data, the negative pairing similarity weight matrix corresponding to the input mode embedded representation and a preset third contrast loss function.

In this embodiment, the detection device uses a plurality of nodes of a plurality of layers in the input modality embedded representation as anchor points according to the input modality embedded representation of a plurality of groups of training document data, uses a intra-layer opposite mode to take neighbor nodes of the same layer of the multi-layer graph attention network corresponding to the anchor points as positive samples, uses an inter-layer opposite mode to take nodes of the same position indexes of different layers of the multi-layer graph attention network corresponding to the anchor points as positive samples, and takes other nodes as negative samples.

The detection device obtains second sub-loss values of a plurality of nodes of a plurality of layers of the multi-layer graph attention network corresponding to a plurality of groups of training document data according to the negative pairing similarity weight matrix corresponding to the input modality embedding representation and a preset third comparison loss function, wherein the second sub-loss values comprise second sub-loss values of a plurality of nodes based on the training text modality embedding representation and second sub-loss values of a plurality of nodes based on the training image modality embedding representation, and the third comparison loss function is as follows:

Wherein is a second sub-loss value of an ith node of a first layer of the multi-layer graph attention network,/> is a correlation coefficient between the ith node of the first layer and an ith node of a first+1th layer of the multi-layer graph attention network,/> is a correlation coefficient between the ith node of the first layer and a jth node of the first layer of the multi-layer graph attention network, and/> is a correlation coefficient between the ith node of the first layer and a jth node of the first+1th layer of the multi-layer graph attention network.

S633: and obtaining second loss values of the plurality of groups of training document data based on the input modality embedded representation according to second sub-loss values of the plurality of nodes of the plurality of layers of the multi-layer graph attention network corresponding to the plurality of groups of training document data and a preset second loss value calculation algorithm.

In this embodiment, the detection device obtains, according to a second sub-loss value of a plurality of nodes of a plurality of layers of the multi-layer graph attention network corresponding to a plurality of sets of the training document data and a preset second loss value calculation algorithm, a second loss value of a plurality of sets of the training document data based on the input modality embedded representation, where the second loss value includes a second loss value of a plurality of sets of the training document data based on the training text modality embedded representation and a second loss value of a plurality of sets of the training document data based on the training image modality embedded representation, and the second loss value calculation algorithm is:

Wherein is a second loss value of the training document data based on the input modality embedded representation, L is the number of layers of the multi-layer graph attention network, and N is the number of nodes in each layer of the multi-layer graph attention network.

S634: and accumulating second loss values of the plurality of groups of training document data based on the training text mode embedded representation and the plurality of groups of training document data based on the training image mode embedded representation to obtain the second total loss value.

In this embodiment, the detection device adds up the second loss values of the plurality of sets of training document data based on the training text modality embedded representation and the second loss values of the plurality of sets of training document data based on the training image modality embedded representation to obtain the second total loss value, and fully considers all layer information of the attention network of the multi-layer diagram, including the mutual correlation information between the present layer information and the next layer, so as to improve the accuracy of irony detection.

S64: obtaining real ironic probability distribution vectors of a plurality of groups of training document data, obtaining a third total loss value according to the predicted ironic probability distribution vectors and the real ironic probability distribution vectors of the plurality of groups of training document data by adopting a cross entropy method, and training the ironic detection model according to the first total loss value, the second total loss value and the third total loss value to obtain a target ironic detection model.

In this embodiment, the detecting device obtains real-ironic probability distribution vectors of the plurality of sets of the training document data, and obtains the third total loss value based on the predicted-ironic probability distribution vectors and the real-ironic probability distribution vectors of the plurality of sets of the training document data by using a cross entropy method.

The detection equipment trains the ironic detection model according to the first total loss value, the second total loss value and the third total loss value to obtain a target ironic detection model, specifically, the detection equipment acquires an overall loss value according to the first total loss value, the second total loss value, the third total loss value and a preset overall loss function, trains the ironic detection model according to the overall loss value to obtain the target ironic detection model, wherein the overall loss function is as follows:

Where is the overall loss value,/> is the first overall loss value,/> is the second overall loss value,/> is the third overall loss value,/> is the first hyper-parameter and/> is the second hyper-parameter.

Referring to fig. 9, fig. 9 is a flowchart of step S64 in a multi-modal irony detection method according to another embodiment of the application, including step S641, which is specifically as follows:

S641: according to the predicted ironic probability distribution vector, the real ironic probability distribution vector and the preset cross entropy loss function of the training document data, a plurality of third loss values of the training document data are obtained, and the third loss values of the training document data are accumulated to obtain the third total loss value.

The cross entropy loss function is:

Wherein is a second loss value, K is the number of training document data in the training dataset, v > is a true irony probability distribution vector of the ith training document data, v > is a predicted irony probability distribution vector of the ith training document data.

In this embodiment, the detection device obtains a plurality of third loss values of the training document data according to the predicted ironic probability distribution vector, the real ironic probability distribution vector and the preset cross entropy loss function of the plurality of training document data, and accumulates the plurality of third loss values of the training document data to obtain the third total loss value.

Referring now to fig. 10, fig. 10 is a schematic structural diagram of a multi-modal irony detection device according to one embodiment of the application, the device implementing all or a portion of the multi-modal irony detection method by software, hardware, or a combination of both, the device 10 comprising:

The data acquisition module 101 is configured to obtain document data to be detected and a preset ironic detection model, where the document data to be detected includes a text to be detected and an image to be detected, and the ironic detection model includes a feature extraction module, a text feature enhancement module, a cross-mode interaction module, and an ironic detection module;

The global feature extraction module 102 is configured to input the document data to be tested into the feature extraction module to perform feature extraction, so as to obtain a text global feature representation corresponding to the text to be tested and an image global feature representation corresponding to the image to be tested;

the feature enhancement module 103 is configured to input the text global feature representation and the image global feature representation into the text feature enhancement module for feature enhancement, so as to obtain a text enhancement feature representation;

the inconsistency score calculation module 104 is configured to input the text global feature representation, the text enhancement feature representation, and the image global feature representation into the cross-modal interaction module for inconsistency evaluation, so as to obtain an inconsistency score;

The ironic detection processing module 105 is configured to input the text enhancement feature representation, the image global feature representation and the inconsistency score into the ironic detection module to perform ironic detection, so as to obtain ironic detection results of the document data to be detected.

In the embodiment of the application, the document data to be detected and a preset ironic detection model are obtained through a data acquisition module, wherein the document data to be detected comprises a text to be detected and an image to be detected, and the ironic detection model comprises a feature extraction module, a text feature enhancement module, a cross-mode interaction module and an ironic detection module; inputting the document data to be tested into the feature extraction module through a global feature extraction module to perform feature extraction, and obtaining text global feature representation corresponding to the text to be tested and image global feature representation corresponding to the image to be tested; inputting the text global feature representation and the image global feature representation into the text feature enhancement module through a feature enhancement module to perform feature enhancement, so as to obtain text enhancement feature representation; inputting the text global feature representation, the text enhancement feature representation and the image global feature representation into the cross-modal interaction module through an inconsistency score calculation module to perform inconsistency evaluation to obtain an inconsistency score; and inputting the text enhancement feature representation, the image global feature representation and the inconsistency score into the irony detection module through the irony detection processing module for irony detection, and obtaining irony detection results of the document data to be detected. Based on image and text bimodal data, global feature extraction and text feature enhancement are carried out in a layering manner on each modal content, inconsistency evaluation is carried out by adopting a cross-modal interaction mode based on the obtained text global feature representation, image global feature representation and text enhancement feature representation, inconsistency scores are obtained, collaborative recognition is carried out by utilizing the image global feature representation, the text enhancement feature representation and the inconsistency scores, ironmaking detection is carried out by fully utilizing inter-modal information, and ironmaking detection accuracy is improved.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 11 includes: a processor 111, a memory 112, and a computer program 113 stored on the memory 112 and executable on the processor 111; the computer device may store a plurality of instructions adapted to be loaded by the processor 111 and to execute the steps of the method according to the embodiment of fig. 1 to 9, and the specific execution process may refer to the specific description of the embodiment of fig. 1 to 9, which is not repeated here.

Wherein processor 111 may include one or more processing cores. The processor 111 performs various functions of the multi-modal irony detection device 10 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 112 and invoking data in the memory 112 using various interfaces and various parts within the wired connection server, alternatively, the processor 111 may be implemented in at least one hardware form of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (fieldprogrammable GATE ARRAY, FPGA), programmable logic array (Programble Logic Array, PLA). The processor 111 may integrate one or a combination of several of a central processor 111 (Central Processing Unit, CPU), an image processor 111 (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the touch display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 111 and may be implemented by a single chip.

The memory 112 may include a random access memory 112 (Random Access Memory, RAM), or may include a read-only memory 112. Optionally, the memory 112 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 112 may be used to store instructions, programs, code sets, or instruction sets. The memory 112 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as touch instructions, etc.), instructions for implementing the various method embodiments described above, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 112 may also optionally be at least one storage device located remotely from the aforementioned processor 111.

The embodiment of the present application further provides a storage medium, where the storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executed to perform the method steps of the first embodiment to the third embodiment, and the specific execution process may refer to the specific description of the embodiments described in fig. 1 to fig. 9, which are not repeated herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc.

The present invention is not limited to the above-described embodiments, but, if various modifications or variations of the present invention are not departing from the spirit and scope of the present invention, the present invention is intended to include such modifications and variations as fall within the scope of the claims and the equivalents thereof.

Claims

1. A method of multimodal irony detection comprising the steps of:

2. The method of claim 1, wherein the feature extraction module comprises a word embedding module, a target detection module, and a dimension transformation module;

Inputting the document data to be tested into the feature extraction module for feature extraction to obtain a text global feature representation corresponding to the text to be tested and an image global feature representation corresponding to the image to be tested, wherein the method comprises the following steps:

inputting the text to be tested into the word embedding module for coding processing to obtain an initial text characteristic representation of the text to be tested;

dividing the image to be detected into a plurality of image subareas, inputting the image subareas into the target detection module for target detection, and obtaining an initial image characteristic representation of the image to be detected;

And inputting the initial text feature representation and the initial image feature representation into the dimension transformation module to perform dimension transformation and feature extraction, and obtaining the text global feature representation and the image global feature representation.

3. A multi-modal irony detection method according to claim 2, characterized in that said inputting of said text global feature representation and image global feature representation into said text feature enhancement module for feature enhancement, obtaining a text enhanced feature representation, comprises the steps of:

and obtaining attention characteristic representations corresponding to a plurality of multi-head attention according to the text global characteristic representation, the image global characteristic representation and a preset attention extraction algorithm by adopting a multi-head attention mechanism, wherein the attention extraction algorithm is as follows:

Wherein is an attention feature representation corresponding to the ith multi-head attention, softmax () is a normalization function,/> is the text global feature representation,/> is a first weight parameter of the ith multi-head attention,/> is the image global feature representation,/> is a second weight parameter of the ith multi-head attention,/> is a third weight parameter of the ith multi-head attention, d is a dimension parameter, h is the number of heads of the multi-head attention and/> is a transpose symbol;

Obtaining the text enhancement feature representation according to a plurality of attention feature representations corresponding to the multiple attention points, a text global feature representation and a preset text enhancement feature algorithm, wherein the text enhancement feature algorithm is as follows:

Where is the text enhancement feature representation,/> is the normalization function,/> is the multi-layer perceptron function,/> is the attention feature representation for the h-th multi-head attention.

4. A method of multi-modal irony detection according to claim 3, characterised in that: the inconsistency score comprises a first inconsistency score and a second inconsistency score, and the cross-modal interaction module comprises a fully connected network and a multi-layer graph attention network;

The step of inputting the text global feature representation, the text enhancement feature representation and the image global feature representation into the cross-modal interaction module for performing inconsistency evaluation to obtain an inconsistency score comprises the following steps:

Inputting the text global feature representation, the text enhancement feature representation and the image global feature representation into a fully connected network, and obtaining a first inconsistency score according to the text enhancement feature representation, the image global feature representation and a preset first inconsistency evaluation algorithm, wherein the first inconsistency evaluation algorithm is as follows:

Wherein is the first inconsistency score,/> is a first weight parameter,/> is a first bias parameter,/> is a first cross-modal joint weight parameter;

Constructing a text adjacency matrix corresponding to the text enhancement feature representation and an image adjacency matrix corresponding to the image global feature representation, respectively taking the text adjacency matrix and the image adjacency matrix as first-layer input data of the multi-layer graph attention network, and obtaining feature vectors of each layer of the multi-layer graph attention network according to a preset graph convolution algorithm, wherein the feature vectors comprise text feature vectors corresponding to the text adjacency matrix and image feature vectors corresponding to the image adjacency matrix, and the graph convolution algorithm is as follows:

Wherein is an attention score between an ith node and a jth neighbor node of a first layer of the multi-layer graph attention network,/> is an activation function,/> is a bias of a learnable parameter of a first layer of the multi-layer graph attention network,/> is a weight parameter of the first layer of the multi-layer graph attention network,/> is a feature vector of the ith node of the first layer of the multi-layer graph attention network, j, k represents neighbor nodes,/> and/> are feature vectors of the jth and k neighbor nodes of the first layer of the multi-layer graph attention network, respectively, k represents neighbor nodes,/> is a set of neighbor nodes of the same layer of the ith node, and/> is an attention score between the ith node of the first layer of the multi-layer graph attention network and itself;

Combining text feature vectors of a plurality of nodes of each layer of the multi-layer graph attention network to obtain text mode embedded representations corresponding to the text to be detected, combining image feature vectors of a plurality of nodes of each layer of the multi-layer graph attention network to obtain image mode embedded representations corresponding to the image to be detected, and obtaining second inconsistency scores according to the text global feature representations, the text enhancement feature representations, the text mode embedded representations, the image mode embedded representations and a preset second inconsistency evaluation algorithm, wherein the second inconsistency evaluation algorithm is as follows:

Wherein is the second inconsistency score,/> is the text modality embedded representation,/> is the image modality embedded representation, c is the text semantic feature representation,/> is the second weight parameter,/> is the third weight parameter,/> is the second bias parameter,/> is the third bias parameter, and/> is the second cross-modality joint weight parameter.

5. The method for detecting the irony of multiple modes according to claim 4, wherein the irony detection method for inputting the text enhancement feature representation, the image global feature representation and the inconsistency score into the irony detection module to perform irony detection, and obtaining irony detection results of the document data to be detected comprises the steps of:

Obtaining an image weight parameter according to the image global feature representation and a preset image weight parameter calculation algorithm, and respectively carrying out dot product processing on the image weight parameter and a first inconsistent fraction and a second inconsistent fraction in the inconsistent fractions to obtain first cross-modal joint information and second cross-modal joint information, wherein the image weight parameter calculation algorithm is as follows:

Wherein is the image weight parameter,/> is the fourth weight parameter, and/> is the fourth bias parameter;

Obtaining a predicted ironic probability distribution vector of the document data to be detected as the ironic detection result according to the first cross-modal joint information, the second cross-modal joint information, the image global feature representation, the text enhancement feature representation and a preset ironic probability distribution vector calculation algorithm, wherein the ironic probability distribution vector calculation algorithm is as follows:

6. A method of multi-modal irony detection according to claim 5 further comprising the step of: training the ironic detection model, the training the ironic detection model comprising the steps of:

obtaining a training data set, wherein the training data set comprises a plurality of groups of training document data, and the training document data comprises training texts and training images;

obtaining a plurality of groups of reconstructed images corresponding to training images of the training document data, and a plurality of groups of training text global feature representations, training image global feature representations and reconstructed image global feature representations of the training document data, and obtaining a first total loss value according to the training text global feature representations, the training image global feature representations and the reconstructed image global feature representations of the training document data by adopting a contrast learning method;

Obtaining training text mode embedded representations and training image mode embedded representations of a plurality of groups of training document data, and obtaining a second total loss value according to the training text mode embedded representations and the training image mode embedded representations of the plurality of groups of training document data by adopting a contrast learning method;

Obtaining real ironic probability distribution vectors of a plurality of groups of training document data, obtaining a third total loss value according to the predicted ironic probability distribution vectors and the real ironic probability distribution vectors of the plurality of groups of training document data by adopting a cross entropy method, and training the ironic detection model according to the first total loss value, the second total loss value and the third total loss value to obtain a target ironic detection model.

7. A multi-modal irony detection method according to claim 6 wherein said employing a contrast learning method obtains a first total loss value based on sets of training text global feature representations, training image global feature representations, and reconstructed image global feature representations of said training document data, comprising the steps of:

Obtaining a first text transposition feature representation and a first image transposition feature representation of a plurality of groups of training document data according to training text global feature representations of training texts of the plurality of groups of training document data, training image global feature representations of training images and a preset first transposition multiplication algorithm, wherein the first transposition multiplication algorithm is as follows:

Wherein is a first text transposed feature representation,/> is a first learnable parameter,/> is a training text global feature representation, is a training image global feature representation, and/> is a first image transposed feature representation;

Respectively taking a group of training document data in the training data set as a first positive sample, according to a first text transposition feature representation, a first image transposition feature representation and a preset first contrast learning loss function of a plurality of groups of training document data, obtaining first sub-loss values of a plurality of groups of training document data represented by the first text transposition feature and first sub-loss values of a plurality of groups of training document data represented by the first image transposition feature, respectively combining the first sub-loss values of a plurality of groups of training document data represented by the first text transposition feature and the first sub-loss values of a plurality of groups of training document data represented by the first image transposition feature, and constructing a first loss value represented by the first text transposition feature and a first loss value represented by the first image transposition feature, wherein the first contrast learning loss function is as follows:

Wherein is a first sub-loss value of training document data represented based on the first text transposition feature,/> is a first text transposition feature representation of the first positive sample,/> is a first positive sample, is a temperature coefficient, K is training document data in the training dataset,/> is a first text transposition feature representation of an i-th set of training document data in the training dataset,/> is an i-th set of training document data in the training dataset,/> is a first sub-loss value of training document data represented based on the first image transposition feature,/> is a first image transposition feature representation of the first positive sample, and/> is a first image transposition feature representation of an i-th set of training document data in the training dataset;

Combining training texts and reconstruction images of the plurality of groups of training document data to construct an enhancement data set, wherein the enhancement data set comprises a plurality of groups of enhancement document data, and a second text transposition feature representation and a second image transposition feature representation of the plurality of groups of enhancement document data are obtained according to a training text global feature representation, a reconstruction image global feature representation and a preset second transposition multiplication algorithm of the plurality of groups of enhancement document data, wherein the second transposition multiplication algorithm is as follows:

Wherein is a second text transposed feature representation,/> is a reconstructed image global feature representation, and/> is a second image transposed feature representation;

Respectively taking a group of enhanced document data in the enhanced data set as a second positive sample, according to a second text transposition feature representation, a second image transposition feature representation and a preset second contrast learning loss function of a plurality of groups of enhanced document data, obtaining a first enhancement sub-loss value of a plurality of groups of enhanced document data based on the second text transposition feature representation and a first enhancement sub-loss value of a plurality of groups of enhanced document data based on the second image transposition feature representation, and respectively combining the first enhancement sub-loss value of a plurality of groups of enhanced document data based on the second text transposition feature representation and the first enhancement sub-loss value of a plurality of groups of enhanced document data based on the second image transposition feature representation to construct a first enhancement loss value based on the second text transposition feature representation and a first enhancement loss value based on the second image transposition feature representation, wherein the second contrast learning loss function is as follows:

Wherein is a first enhancer loss value of enhanced document data represented based on the second text transposition feature, is a second text transposition feature representation of the second positive sample,/> is a second positive sample, M is enhanced document data in the enhanced dataset,/> is a second text transposition feature representation of an i-th set of enhanced document data in the enhanced dataset,/> is an i-th set of enhanced document data in the enhanced dataset,/> is a first enhancer loss value of enhanced document data represented based on the second image transposition feature,/> is a second image transposition feature representation of the second positive sample, is a second image transposition feature representation of an i-th set of enhanced document data in the enhanced dataset;

Obtaining a first total loss value according to a first loss value represented by the first text transposition feature, a first loss value represented by the first image transposition feature, a first enhancement loss value represented by the second text transposition feature, a first enhancement loss value represented by the second image transposition feature and a preset first total loss value calculation algorithm, wherein the first total loss value calculation algorithm is as follows:

8. A method of multimodal irony detection according to claim 6 wherein said obtaining a second total loss value from training text modality embedded representations and training image modality embedded representations of sets of said training document data using a contrast learning method comprises the steps of:

Respectively taking training text mode embedded representations and training image mode embedded representations of a plurality of groups of training document data as input mode embedded representations, and obtaining negative pairing similarity weight matrixes corresponding to the input mode embedded representations of the plurality of groups of training document data according to feature vectors of a plurality of nodes of each layer of a multi-layer graph attention network in the input mode embedded representations of the plurality of groups of training document data and a preset negative pairing similarity weight calculation algorithm, wherein the negative pairing similarity weight matrixes comprise negative pairing similarity weight parameters among the plurality of nodes, and the negative pairing similarity weight calculation algorithm comprises the following steps:

Wherein is a negative pairing similarity weight parameter of the ith node of the first layer and the jth node of the first+1 layer of the multi-layer graph attention network,/> is a second leachable parameter,/> is a cosine similarity function,/> is a feature vector of the ith node of the first layer of the multi-layer graph attention network, and/> is a feature vector of the jth node of the first layer of the multi-layer graph attention network;

Obtaining second sub-loss values of a plurality of nodes of a plurality of layers of the multi-layer graph attention network corresponding to the plurality of groups of training document data according to input modality embedded representations of the plurality of groups of training document data, negative pairing similarity weight matrixes corresponding to the input modality embedded representations and a preset third contrast loss function, wherein the second sub-loss values comprise second sub-loss values of the plurality of nodes based on the training text modality embedded representations and second sub-loss values of the plurality of nodes based on the training image modality embedded representations, and the third contrast loss function is as follows:

Wherein is a second sub-loss value of an ith node of a first layer of the multi-layer graph attention network,/> is a correlation coefficient between the ith node of the first layer and an ith node of a first+1th layer of the multi-layer graph attention network,/> is a correlation coefficient between the ith node of the first layer and a jth node of the first layer of the multi-layer graph attention network,/> is a correlation coefficient between the ith node of the first layer and a jth node of the first+1th layer of the multi-layer graph attention network;

Obtaining second loss values of a plurality of groups of training document data based on the input modality embedded representation according to second sub-loss values of a plurality of nodes of a plurality of layers of the multi-layer graph attention network corresponding to the plurality of groups of training document data and a preset second loss value calculation algorithm, wherein the second loss values comprise the second loss values of the plurality of groups of training document data based on the training text modality embedded representation and the second loss values of the plurality of groups of training document data based on the training image modality embedded representation, and the second loss value calculation algorithm is as follows:

Wherein is a second loss value of the training document data based on the input modality embedded representation, L is the number of layers of the multi-layer graph attention network, and N is the number of nodes in each layer of the multi-layer graph attention network;

and accumulating second loss values of the plurality of groups of training document data based on the training text mode embedded representation and the plurality of groups of training document data based on the training image mode embedded representation to obtain the second total loss value.

9. A multi-modal irony detection method according to claim 6 wherein said cross entropy method is used to obtain a third total loss value based on a predicted irony probability distribution vector and a true irony probability distribution vector for sets of said training document data, comprising the steps of:

Obtaining a plurality of third loss values of the training document data according to the predicted ironic probability distribution vector, the real ironic probability distribution vector and a preset cross entropy loss function of the training document data, and accumulating the third loss values of the training document data to obtain a third total loss value, wherein the cross entropy loss function is as follows:

10. A multi-modal ironic inspection device, comprising: