CN116702091A

CN116702091A - Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP

Info

Publication number: CN116702091A
Application number: CN202310737347.3A
Authority: CN
Inventors: 覃立波; 周璟轩; 黄仕爵; 陈麒光; 蔡晨冉; 张钰迪; 梁斌; 车万翔; 徐睿峰
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-09-05
Anticipated expiration: 2043-06-21
Also published as: CN116702091B

Abstract

The application discloses a multi-mode ironic intention recognition method, a device and equipment based on a multi-view CLIP, wherein the method comprises the following steps: sequentially encoding and decoding text information and image information in the data tuples; the method comprises the steps of adopting a CLIP model to encode to respectively obtain respective vector representations of a text and an image, and decoding to respectively obtain ironic score distribution based on a text view and an image view; after the respective vector representations of the text and the image obtained by encoding are spliced, feeding the text and the image into a transducer for modal fusion, determining the attention weight of the text and the image by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the interactive view of the text and the image; 3 ironic score distributions based on the text view, the image view and the text-image interaction view are aggregated, and an ironic intention recognition result of the data tuple is obtained according to the aggregation result. The application improves the ironic intention recognition accuracy and has good interpretation.

Description

Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP

Technical Field

The application belongs to the technical field of data identification, and particularly relates to a multi-mode ironic intention identification method, device and equipment based on a multi-view CLIP.

Background

Irony is a language method for expressing strong discontents, opposition, and jeers by using the convincing techniques such as bilingual, exaggeration, and metaphors. Irony has a long history in human society, from drama in the ancient greek period to modern cartoon and internet segments, irony is always an important way for people to express criticism. However, since the true emotion intended to be expressed ironically may be contrary to the superficial speech, conventional emotion analysis methods may have erroneous emotion classification in analyzing ironically text, thereby affecting the accuracy thereof. Thus, ironic intent recognition can help to identify the true emotion contained in the information, facilitating tasks such as emotion analysis and opinion mining.

Irony's meaning is often understood by context, which is often multi-level, ambiguous and ambiguous. This makes irony difficult to understand and identify accurately. In addition, the techniques of congratulation commonly used in irony also increase the difficulty of identifying irony. In recent years, due to the rapid development of social media, multimodal irony recognition, which aims at recognizing irony emotion in a multimodal scene, has attracted more and more research attention. Unlike traditional text-based ironic recognition methods, multi-modal ironic recognition comprehensively utilizes information of multiple modalities to perform feature fusion, adapts to various ironic manifestations, and has more accurate and comprehensive recognition performance in ironic recognition tasks.

With the rapid development of deep neural networks, multimodal irony recognition has achieved significant results. There are a number of multimodal irony recognition techniques including explicitly connecting text features and image features, implicitly employing an attention mechanism to merge features from different modalities, graph-based approaches, and the like. However, whether the results of these models faithfully reflect their multimodal understanding capabilities remains questionable. In fact, when a text-only modal model is applied to multi-modal irony recognition, its performance significantly exceeds the current most advanced multi-modal model. This illustrates that the performance of current multi-modal irony recognition models may rely heavily on false cues in the text data rather than actually capturing the essential features of irony by actually capturing the relationships between the different modes.

Disclosure of Invention

The application provides a multi-mode ironic intention recognition method, a device and equipment based on multi-view CLIP, which utilize texts, images and information provided by a plurality of visual angles of image-text interaction to capture interaction relations among the texts, complete multi-mode ironic intention recognition and have high recognition accuracy.

In order to achieve the technical purpose, the application adopts the following technical scheme:

a multi-modal irony intent recognition method based on multi-view CLIP, comprising:

step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; the method comprises the steps of adopting CLIP model coding to respectively obtain text information vector representation and image information vector representation, and decoding to respectively obtain irony score distribution based on a text view and an image view;

step 2, splicing the text information vector representation and the image information vector representation obtained by the encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the text and image interaction view;

and 3, aggregating the 3 irony score distributions based on the text view, the image view and the text-image interaction view obtained in the steps 1 and 2, and obtaining irony intention recognition results of the data tuples according to the aggregation results.

Further, the text information is encoded by using the CLIP model, and a vector representation T of the text information is obtained, as shown in formula (1):

where x is the text information in the data tuple,representing a text encoder in the CLIP model, n is the sequence length of text x, t _i Vector representation for the ith word in x, t _CLS A vector representation of semantic information for the entire text x;

using a linear classifier to divide t _CLS Mapping after linear transformation to obtain ironic score distribution y based on text view ^t As shown in formula (2):

y ^t ＝softmax(W _t t _CLS +b _t ) (2)

wherein W is _t And b _t Respectively for text semantic information t _CLS The weight matrix and bias parameters of the decoded linear classifier.

Further, the image information is encoded by using the CLIP model, and a vector representation I of the image information is obtained, as shown in formula (3):

where y is the entire image information in the data tuple, v _CLS For vector representation of the whole image, m is the number of blocks of image y, v _i A vector representation for the i-th block of the image;

using a linear classifier to classify v _CLS Mapping after linear transformation to obtain ironic score distribution y based on image view ^v As shown in formula (4):

y ^v ＝softmax(W _v v _CLS +b _v ) (4)

wherein W is _v And b _v Respectively for the semantic information v of the image _CLS The weight matrix and bias parameters of the decoded linear classifier.

Further, step 2 includes:

firstly, splicing the text information vector representation obtained by encoding and the image information vector representation to obtain a vector F of a composite image and text information, namely:

F＝(v _CLS ,v ₁ ,…,v _m ,t ₁ ,…,t _n ,t _CLS )＝Concat(T,I)

wherein, T, I are text information vector representation and image information vector representation respectively, concat (T, I) represents splicing operation; n is the sequence length of the text x, t _i For the vector representation of the i-th word in the text information, t _CLS A vector representation of semantic information for the entire text x; v _CLS For vector representation of the whole image, m is the number of blocks of image y, v _i A vector representation for the i-th block of the image;

then, feature fusion is carried out by using a transducer as a composite feature vector F, and F is subjected to different linear transformations by an internal self-attention mechanism to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and thus an updated vector is further obtainedAs shown in formula (5):

wherein d _k Is mapped to the dimensions of K and V through linear transformation;

after obtaining the feature vector of the composite of the updated image and the text informationAfter that, use key-less attention mechanism to +.>And->Further fusing to obtain a feature vector f of text and image interaction, wherein the feature vector f is shown in a formula (6) and a formula (7):

wherein p is _t And p _v Respectively isAnd->Corresponding attention weights, W and b are respectively a weight matrix and a bias parameter of the linear classifier;

finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view ^f Such as a maleFormula (8):

y ^f ＝softmax(W _f f+b _f ) (8)

wherein W is _f And b _f The weight matrix and bias parameters of the linear classifier for decoding the feature vector f, respectively.

Further, step 3, the 3 ironic score distributions are aggregated by adopting a post-fusion method to obtain ironic score distribution y considering multiple views ^o As shown in formula (9):

y ^o ＝y ^t +y ^v +y ^f (9)

wherein y is ^t ,y ^v ,y ^f Ironic score distribution based on text view, image view, text and image interaction view, respectively;

irony score distribution y is then taken ^o The higher-probability subscript serves as the ironic intention recognition result.

A multi-modal ironic intent recognition device based on multi-view CLIP, comprising:

a text view identification module for: sequentially encoding and decoding text information in the acquired data tuples, wherein a text encoder adopting a CLIP model is used for encoding to obtain text information vector representation, and decoding to obtain ironic score distribution based on a text view;

an image view identification module for: sequentially encoding and decoding the image information in the acquired data tuples, wherein an image encoder adopting a CLIP model is used for encoding to obtain an image information vector representation, and decoding to obtain ironic score distribution based on image views;

the text and image interaction view identification module is used for: splicing the text information vector representation and the image information vector representation obtained by encoding, feeding the spliced vectors into a transducer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interactive views;

an aggregation module for: and aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained by the recognition modules, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.

An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the multi-modal irony intent recognition method based on multi-view CLIP as claimed in any one of the preceding claims.

Advantageous effects

Compared with the prior art, the method has the advantages that:

(1) Compared with the prior art, the method has the advantages that the accuracy rate is improved by more than 5.6% and the F1 value is improved by more than 7.0% on the MMSD2.0 data set. The application has proven its effectiveness in integrating features from different modality views and simplifying the network architecture.

(2) The method does not need any image preprocessing step, and simplifies the training process.

(3) The method does not need to use a complex network structure, can naturally utilize knowledge in the CLIP model to perform multi-modal irony recognition, and naturally fuses information provided by different viewing angles to improve performance. Possess better interpretability.

(4) Experiments are carried out by adopting different training scales, and experimental results prove that the method can extract ironic clues even when the size of a training corpus is limited, and has stronger low-resource learning capability.

Drawings

FIG. 1 is a diagram of a system model architecture of the present application.

Fig. 2 is a real example of the embodiment of the present application.

Detailed Description

The following describes in detail the embodiments of the present application, which are developed based on the technical solution of the present application, and provide detailed embodiments and specific operation procedures, and further explain the technical solution of the present application.

Experiments were performed on the MMSD2.0 dataset, the ironic intent of which was labeled as the ironic/non-ironic classification. One of the data sets is shown in FIG. 2, for example, and includes a "irony" tag with a corresponding text modality of "What a successful toast, it looks so delicious-! "

MMSD2.0 dataset sizes were as follows: the dataset was divided into training, validation and test sets, the test set containing 2409 data tuples of text and picture information, containing 1037 positive examples (ironic) and 1372 negative examples (non-ironic).

Given test set using multi-modal irony intent recognition method based on multi-view CLIPWherein (1)>Represents the number of samples in the test set, +.in this example>2409. For->Comprising data tuples (x, y) of text information and image information, comprising the following steps as shown in fig. 1:

step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; wherein, adopt the CLIP model to encode and obtain text information vector representation and image information vector representation respectively, decode and obtain the irony score distribution based on text view and image view respectively.

(1) Ironic recognition based on text views

Text encoder employing CLIP modelEncoding the text information to obtain a vector representation T of the text information, as shown in formula (1):

y ^t ＝softmax(W _t t _CLS +b _t ) (2)

(2) Ironic recognition based on image views

Image encoder employing CLIP modelEncoding the image information to obtain a vector representation I of the image information as shown in formula (3):

y ^v ＝softmax(W _v v _CLS +b _v ) (4)

Step 2, ironic recognition based on text-image interactive views: splicing the text information vector representation and the image information vector representation obtained by the encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight of the spliced vectors by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the text and image interaction view.

F＝(v _CLS ,v ₁ ,…,v _m ,t ₁ ,…,t _n ,t _CLS )＝Concat(T,I)

wherein Concat (T, I) represents a splicing operation;

then, feature fusion is performed by using a transducer as a composite feature vector F, specifically:

(1) F is subjected to different linear transformations by using a self-attention mechanism in a transducer to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and an updated vector is further obtainedAs shown in formula (5):

wherein d _k Is a dimension mapped to K and V through linear transformation.

(2) After obtaining the feature vector of the composite of the updated image and the text informationAfter that, use->And->Obtaining a characteristic vector f of text and image interaction, as shown in a formula (6) and a formula (7):

wherein p is _t And p _v Respectively isAnd->The corresponding attention weights, W and b, are the weight matrix and bias parameters of the linear classifier, respectively.

Finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view ^f As shown in formula (8):

y ^f ＝softmax(W _f f+b _f ) (8)。

step 3, aggregate multiview irony recognition: and (3) aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained in the steps 1 and 2, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.

Step 3, adopting a post-fusion method to aggregate 3 ironic score distributions to obtain ironic score distribution y considering multiple views ^o As shown in formula (9):

y ^o ＝y ^t +y ^v +y ^f (9)

then, the ironic score distribution y is obtained by treatment with argmax ^o The higher-probability subscript serves as the ironic intention recognition result. A result of 0 indicates non-irony and a result of 1 indicates irony.

The embodiment of the application is based on the implementation of the multi-mode ironic intention recognition method of the multi-view CLIP, and in the training stage, the whole model is optimized simultaneously by adopting a combined optimization strategy. In particular, standard binary cross entropy loss functions, i.e. model predictive value y and true label, are utilized for image view, text view and image-text interaction view, respectivelyThe difference between them, and thus the total loss is calculated cumulatively>As shown in formula (10):

then, by minimizing the loss functionThe parameters of the model are optimized using a back propagation algorithm.

The irony recognition accuracy of this example reached 85.64% and the F1 value reached 84.10%. Wherein, the number of true examples 833, true negative examples 1211, false positive examples 204 and false negative examples 161 are included. For example, fig. 2 is a true example, where the method identifies the sarcasm intent from the text and images in the sample by aggregating text views, image-text interactive views.

TABLE 1

As shown in Table 1, compared with the prior art, the accuracy, precision, recall and F1 values of the method on the MMSD2.0 data set are all improved. The effectiveness of integrating features from different modality views and simplifying the network architecture is demonstrated.

The above embodiments are preferred embodiments of the present application, and various changes or modifications may be made thereto by those skilled in the art, which should be construed as falling within the scope of the present application as claimed herein, without departing from the general inventive concept.

Claims

1. A multi-modal ironic intent recognition method based on multi-view CLIP, comprising:

step 2, splicing the text information vector representation and the image information vector representation obtained by encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interaction views;

2. The ironic intent recognition method of claim 1, wherein the textual information is encoded using a CLIP model to obtain a vector representation T of the textual information, as shown in equation (1):

y ^t ＝softmax(W _t t _CLS +b _t ) (2)

3. The ironic intent recognition method of claim 1, wherein the image information is encoded using a CLIP model to obtain a vector representation I of the image information as shown in equation (3):

y ^v ＝softmax(W _v v _CLS +b _v ) (4)

4. The ironic intent recognition method of claim 1, wherein step 2 comprises:

F＝(v _CLS ，v ₁ ，…，v _m ，t ₁ ，…，t _n ，t _CLS )＝Concat(T，I)

y ^f ＝softmax(W _f f+b _f ) (8)

5. The method for ironic intent recognition as claimed in claim 1, wherein step 3 uses a post-fusion method to aggregate 3 ironic score distributions to obtain a multi-view considered ironic score distribution y ^o As shown in formula (9):

y ^o ＝y ^t +y ^v +y ^f (9)

wherein y is ^t ，y ^v ，y ^f Ironic score distribution based on text view, image view, text and image interaction view, respectively;

6. A multi-modal ironic intent recognition device based on multi-view CLIP, comprising:

7. An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the method of any of claims 1-5.