CN116702091A - Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP - Google Patents

Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP Download PDF

Info

Publication number
CN116702091A
CN116702091A CN202310737347.3A CN202310737347A CN116702091A CN 116702091 A CN116702091 A CN 116702091A CN 202310737347 A CN202310737347 A CN 202310737347A CN 116702091 A CN116702091 A CN 116702091A
Authority
CN
China
Prior art keywords
text
image
view
ironic
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310737347.3A
Other languages
Chinese (zh)
Other versions
CN116702091B (en
Inventor
覃立波
周璟轩
黄仕爵
陈麒光
蔡晨冉
张钰迪
梁斌
车万翔
徐睿峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202310737347.3A priority Critical patent/CN116702091B/en
Publication of CN116702091A publication Critical patent/CN116702091A/en
Application granted granted Critical
Publication of CN116702091B publication Critical patent/CN116702091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a multi-mode ironic intention recognition method, a device and equipment based on a multi-view CLIP, wherein the method comprises the following steps: sequentially encoding and decoding text information and image information in the data tuples; the method comprises the steps of adopting a CLIP model to encode to respectively obtain respective vector representations of a text and an image, and decoding to respectively obtain ironic score distribution based on a text view and an image view; after the respective vector representations of the text and the image obtained by encoding are spliced, feeding the text and the image into a transducer for modal fusion, determining the attention weight of the text and the image by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the interactive view of the text and the image; 3 ironic score distributions based on the text view, the image view and the text-image interaction view are aggregated, and an ironic intention recognition result of the data tuple is obtained according to the aggregation result. The application improves the ironic intention recognition accuracy and has good interpretation.

Description

Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
Technical Field
The application belongs to the technical field of data identification, and particularly relates to a multi-mode ironic intention identification method, device and equipment based on a multi-view CLIP.
Background
Irony is a language method for expressing strong discontents, opposition, and jeers by using the convincing techniques such as bilingual, exaggeration, and metaphors. Irony has a long history in human society, from drama in the ancient greek period to modern cartoon and internet segments, irony is always an important way for people to express criticism. However, since the true emotion intended to be expressed ironically may be contrary to the superficial speech, conventional emotion analysis methods may have erroneous emotion classification in analyzing ironically text, thereby affecting the accuracy thereof. Thus, ironic intent recognition can help to identify the true emotion contained in the information, facilitating tasks such as emotion analysis and opinion mining.
Irony's meaning is often understood by context, which is often multi-level, ambiguous and ambiguous. This makes irony difficult to understand and identify accurately. In addition, the techniques of congratulation commonly used in irony also increase the difficulty of identifying irony. In recent years, due to the rapid development of social media, multimodal irony recognition, which aims at recognizing irony emotion in a multimodal scene, has attracted more and more research attention. Unlike traditional text-based ironic recognition methods, multi-modal ironic recognition comprehensively utilizes information of multiple modalities to perform feature fusion, adapts to various ironic manifestations, and has more accurate and comprehensive recognition performance in ironic recognition tasks.
With the rapid development of deep neural networks, multimodal irony recognition has achieved significant results. There are a number of multimodal irony recognition techniques including explicitly connecting text features and image features, implicitly employing an attention mechanism to merge features from different modalities, graph-based approaches, and the like. However, whether the results of these models faithfully reflect their multimodal understanding capabilities remains questionable. In fact, when a text-only modal model is applied to multi-modal irony recognition, its performance significantly exceeds the current most advanced multi-modal model. This illustrates that the performance of current multi-modal irony recognition models may rely heavily on false cues in the text data rather than actually capturing the essential features of irony by actually capturing the relationships between the different modes.
Disclosure of Invention
The application provides a multi-mode ironic intention recognition method, a device and equipment based on multi-view CLIP, which utilize texts, images and information provided by a plurality of visual angles of image-text interaction to capture interaction relations among the texts, complete multi-mode ironic intention recognition and have high recognition accuracy.
In order to achieve the technical purpose, the application adopts the following technical scheme:
a multi-modal irony intent recognition method based on multi-view CLIP, comprising:
step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; the method comprises the steps of adopting CLIP model coding to respectively obtain text information vector representation and image information vector representation, and decoding to respectively obtain irony score distribution based on a text view and an image view;
step 2, splicing the text information vector representation and the image information vector representation obtained by the encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the text and image interaction view;
and 3, aggregating the 3 irony score distributions based on the text view, the image view and the text-image interaction view obtained in the steps 1 and 2, and obtaining irony intention recognition results of the data tuples according to the aggregation results.
Further, the text information is encoded by using the CLIP model, and a vector representation T of the text information is obtained, as shown in formula (1):
where x is the text information in the data tuple,representing a text encoder in the CLIP model, n is the sequence length of text x, t i Vector representation for the ith word in x, t CLS A vector representation of semantic information for the entire text x;
using a linear classifier to divide t CLS Mapping after linear transformation to obtain ironic score distribution y based on text view t As shown in formula (2):
y t =softmax(W t t CLS +b t ) (2)
wherein W is t And b t Respectively for text semantic information t CLS The weight matrix and bias parameters of the decoded linear classifier.
Further, the image information is encoded by using the CLIP model, and a vector representation I of the image information is obtained, as shown in formula (3):
where y is the entire image information in the data tuple, v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
using a linear classifier to classify v CLS Mapping after linear transformation to obtain ironic score distribution y based on image view v As shown in formula (4):
y v =softmax(W v v CLS +b v ) (4)
wherein W is v And b v Respectively for the semantic information v of the image CLS The weight matrix and bias parameters of the decoded linear classifier.
Further, step 2 includes:
firstly, splicing the text information vector representation obtained by encoding and the image information vector representation to obtain a vector F of a composite image and text information, namely:
F=(v CLS ,v 1 ,…,v m ,t 1 ,…,t n ,t CLS )=Concat(T,I)
wherein, T, I are text information vector representation and image information vector representation respectively, concat (T, I) represents splicing operation; n is the sequence length of the text x, t i For the vector representation of the i-th word in the text information, t CLS A vector representation of semantic information for the entire text x; v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
then, feature fusion is carried out by using a transducer as a composite feature vector F, and F is subjected to different linear transformations by an internal self-attention mechanism to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and thus an updated vector is further obtainedAs shown in formula (5):
wherein d k Is mapped to the dimensions of K and V through linear transformation;
after obtaining the feature vector of the composite of the updated image and the text informationAfter that, use key-less attention mechanism to +.>And->Further fusing to obtain a feature vector f of text and image interaction, wherein the feature vector f is shown in a formula (6) and a formula (7):
wherein p is t And p v Respectively isAnd->Corresponding attention weights, W and b are respectively a weight matrix and a bias parameter of the linear classifier;
finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view f Such as a maleFormula (8):
y f =softmax(W f f+b f ) (8)
wherein W is f And b f The weight matrix and bias parameters of the linear classifier for decoding the feature vector f, respectively.
Further, step 3, the 3 ironic score distributions are aggregated by adopting a post-fusion method to obtain ironic score distribution y considering multiple views o As shown in formula (9):
y o =y t +y v +y f (9)
wherein y is t ,y v ,y f Ironic score distribution based on text view, image view, text and image interaction view, respectively;
irony score distribution y is then taken o The higher-probability subscript serves as the ironic intention recognition result.
A multi-modal ironic intent recognition device based on multi-view CLIP, comprising:
a text view identification module for: sequentially encoding and decoding text information in the acquired data tuples, wherein a text encoder adopting a CLIP model is used for encoding to obtain text information vector representation, and decoding to obtain ironic score distribution based on a text view;
an image view identification module for: sequentially encoding and decoding the image information in the acquired data tuples, wherein an image encoder adopting a CLIP model is used for encoding to obtain an image information vector representation, and decoding to obtain ironic score distribution based on image views;
the text and image interaction view identification module is used for: splicing the text information vector representation and the image information vector representation obtained by encoding, feeding the spliced vectors into a transducer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interactive views;
an aggregation module for: and aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained by the recognition modules, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.
An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the multi-modal irony intent recognition method based on multi-view CLIP as claimed in any one of the preceding claims.
Advantageous effects
Compared with the prior art, the method has the advantages that:
(1) Compared with the prior art, the method has the advantages that the accuracy rate is improved by more than 5.6% and the F1 value is improved by more than 7.0% on the MMSD2.0 data set. The application has proven its effectiveness in integrating features from different modality views and simplifying the network architecture.
(2) The method does not need any image preprocessing step, and simplifies the training process.
(3) The method does not need to use a complex network structure, can naturally utilize knowledge in the CLIP model to perform multi-modal irony recognition, and naturally fuses information provided by different viewing angles to improve performance. Possess better interpretability.
(4) Experiments are carried out by adopting different training scales, and experimental results prove that the method can extract ironic clues even when the size of a training corpus is limited, and has stronger low-resource learning capability.
Drawings
FIG. 1 is a diagram of a system model architecture of the present application.
Fig. 2 is a real example of the embodiment of the present application.
Detailed Description
The following describes in detail the embodiments of the present application, which are developed based on the technical solution of the present application, and provide detailed embodiments and specific operation procedures, and further explain the technical solution of the present application.
Experiments were performed on the MMSD2.0 dataset, the ironic intent of which was labeled as the ironic/non-ironic classification. One of the data sets is shown in FIG. 2, for example, and includes a "irony" tag with a corresponding text modality of "What a successful toast, it looks so delicious-! "
MMSD2.0 dataset sizes were as follows: the dataset was divided into training, validation and test sets, the test set containing 2409 data tuples of text and picture information, containing 1037 positive examples (ironic) and 1372 negative examples (non-ironic).
Given test set using multi-modal irony intent recognition method based on multi-view CLIPWherein (1)>Represents the number of samples in the test set, +.in this example>2409. For->Comprising data tuples (x, y) of text information and image information, comprising the following steps as shown in fig. 1:
step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; wherein, adopt the CLIP model to encode and obtain text information vector representation and image information vector representation respectively, decode and obtain the irony score distribution based on text view and image view respectively.
(1) Ironic recognition based on text views
Text encoder employing CLIP modelEncoding the text information to obtain a vector representation T of the text information, as shown in formula (1):
where x is the text information in the data tuple,representing a text encoder in the CLIP model, n is the sequence length of text x, t i Vector representation for the ith word in x, t CLS A vector representation of semantic information for the entire text x;
using a linear classifier to divide t CLS Mapping after linear transformation to obtain ironic score distribution y based on text view t As shown in formula (2):
y t =softmax(W t t CLS +b t ) (2)
wherein W is t And b t Respectively for text semantic information t CLS The weight matrix and bias parameters of the decoded linear classifier.
(2) Ironic recognition based on image views
Image encoder employing CLIP modelEncoding the image information to obtain a vector representation I of the image information as shown in formula (3):
where y is the entire image information in the data tuple, v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
using a linear classifier to classify v CLS Mapping after linear transformation to obtain ironic score distribution y based on image view v As shown in formula (4):
y v =softmax(W v v CLS +b v ) (4)
wherein W is v And b v Respectively for the semantic information v of the image CLS The weight matrix and bias parameters of the decoded linear classifier.
Step 2, ironic recognition based on text-image interactive views: splicing the text information vector representation and the image information vector representation obtained by the encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight of the spliced vectors by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the text and image interaction view.
Firstly, splicing the text information vector representation obtained by encoding and the image information vector representation to obtain a vector F of a composite image and text information, namely:
F=(v CLS ,v 1 ,…,v m ,t 1 ,…,t n ,t CLS )=Concat(T,I)
wherein Concat (T, I) represents a splicing operation;
then, feature fusion is performed by using a transducer as a composite feature vector F, specifically:
(1) F is subjected to different linear transformations by using a self-attention mechanism in a transducer to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and an updated vector is further obtainedAs shown in formula (5):
wherein d k Is a dimension mapped to K and V through linear transformation.
(2) After obtaining the feature vector of the composite of the updated image and the text informationAfter that, use->And->Obtaining a characteristic vector f of text and image interaction, as shown in a formula (6) and a formula (7):
wherein p is t And p v Respectively isAnd->The corresponding attention weights, W and b, are the weight matrix and bias parameters of the linear classifier, respectively.
Finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view f As shown in formula (8):
y f =softmax(W f f+b f ) (8)。
step 3, aggregate multiview irony recognition: and (3) aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained in the steps 1 and 2, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.
Step 3, adopting a post-fusion method to aggregate 3 ironic score distributions to obtain ironic score distribution y considering multiple views o As shown in formula (9):
y o =y t +y v +y f (9)
then, the ironic score distribution y is obtained by treatment with argmax o The higher-probability subscript serves as the ironic intention recognition result. A result of 0 indicates non-irony and a result of 1 indicates irony.
The embodiment of the application is based on the implementation of the multi-mode ironic intention recognition method of the multi-view CLIP, and in the training stage, the whole model is optimized simultaneously by adopting a combined optimization strategy. In particular, standard binary cross entropy loss functions, i.e. model predictive value y and true label, are utilized for image view, text view and image-text interaction view, respectivelyThe difference between them, and thus the total loss is calculated cumulatively>As shown in formula (10):
then, by minimizing the loss functionThe parameters of the model are optimized using a back propagation algorithm.
The irony recognition accuracy of this example reached 85.64% and the F1 value reached 84.10%. Wherein, the number of true examples 833, true negative examples 1211, false positive examples 204 and false negative examples 161 are included. For example, fig. 2 is a true example, where the method identifies the sarcasm intent from the text and images in the sample by aggregating text views, image-text interactive views.
TABLE 1
As shown in Table 1, compared with the prior art, the accuracy, precision, recall and F1 values of the method on the MMSD2.0 data set are all improved. The effectiveness of integrating features from different modality views and simplifying the network architecture is demonstrated.
The above embodiments are preferred embodiments of the present application, and various changes or modifications may be made thereto by those skilled in the art, which should be construed as falling within the scope of the present application as claimed herein, without departing from the general inventive concept.

Claims (7)

1. A multi-modal ironic intent recognition method based on multi-view CLIP, comprising:
step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; the method comprises the steps of adopting CLIP model coding to respectively obtain text information vector representation and image information vector representation, and decoding to respectively obtain irony score distribution based on a text view and an image view;
step 2, splicing the text information vector representation and the image information vector representation obtained by encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interaction views;
and 3, aggregating the 3 irony score distributions based on the text view, the image view and the text-image interaction view obtained in the steps 1 and 2, and obtaining irony intention recognition results of the data tuples according to the aggregation results.
2. The ironic intent recognition method of claim 1, wherein the textual information is encoded using a CLIP model to obtain a vector representation T of the textual information, as shown in equation (1):
where x is the text information in the data tuple,representing a text encoder in the CLIP model, n is the sequence length of text x, t i Vector representation for the ith word in x, t CLS A vector representation of semantic information for the entire text x;
using a linear classifier to divide t CLS Mapping after linear transformation to obtain ironic score distribution y based on text view t As shown in formula (2):
y t =softmax(W t t CLS +b t ) (2)
wherein W is t And b t Respectively for text semantic information t CLS The weight matrix and bias parameters of the decoded linear classifier.
3. The ironic intent recognition method of claim 1, wherein the image information is encoded using a CLIP model to obtain a vector representation I of the image information as shown in equation (3):
where y is the entire image information in the data tuple, v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
using a linear classifier to classify v CLS Mapping after linear transformation to obtain ironic score distribution y based on image view v As shown in formula (4):
y v =softmax(W v v CLS +b v ) (4)
wherein W is v And b v Respectively for the semantic information v of the image CLS The weight matrix and bias parameters of the decoded linear classifier.
4. The ironic intent recognition method of claim 1, wherein step 2 comprises:
firstly, splicing the text information vector representation obtained by encoding and the image information vector representation to obtain a vector F of a composite image and text information, namely:
F=(v CLS ,v 1 ,…,v m ,t 1 ,…,t n ,t CLS )=Concat(T,I)
wherein, T, I are text information vector representation and image information vector representation respectively, concat (T, I) represents splicing operation; n is the sequence length of the text x, t i For the vector representation of the i-th word in the text information, t CLS A vector representation of semantic information for the entire text x; v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
then, feature fusion is carried out by using a transducer as a composite feature vector F, and F is subjected to different linear transformations by an internal self-attention mechanism to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and thus an updated vector is further obtainedAs shown in formula (5):
wherein d k Is mapped to the dimensions of K and V through linear transformation;
after obtaining the feature vector of the composite of the updated image and the text informationAfter that, use key-less attention mechanism to +.>And->Further fusing to obtain a feature vector f of text and image interaction, wherein the feature vector f is shown in a formula (6) and a formula (7):
wherein p is t And p v Respectively isAnd->Corresponding attention weights, W and b are respectively a weight matrix and a bias parameter of the linear classifier;
finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view f As shown in formula (8):
y f =softmax(W f f+b f ) (8)
wherein W is f And b f The weight matrix and bias parameters of the linear classifier for decoding the feature vector f, respectively.
5. The method for ironic intent recognition as claimed in claim 1, wherein step 3 uses a post-fusion method to aggregate 3 ironic score distributions to obtain a multi-view considered ironic score distribution y o As shown in formula (9):
y o =y t +y v +y f (9)
wherein y is t ,y v ,y f Ironic score distribution based on text view, image view, text and image interaction view, respectively;
irony score distribution y is then taken o The higher-probability subscript serves as the ironic intention recognition result.
6. A multi-modal ironic intent recognition device based on multi-view CLIP, comprising:
a text view identification module for: sequentially encoding and decoding text information in the acquired data tuples, wherein a text encoder adopting a CLIP model is used for encoding to obtain text information vector representation, and decoding to obtain ironic score distribution based on a text view;
an image view identification module for: sequentially encoding and decoding the image information in the acquired data tuples, wherein an image encoder adopting a CLIP model is used for encoding to obtain an image information vector representation, and decoding to obtain ironic score distribution based on image views;
the text and image interaction view identification module is used for: splicing the text information vector representation and the image information vector representation obtained by encoding, feeding the spliced vectors into a transducer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interactive views;
an aggregation module for: and aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained by the recognition modules, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.
7. An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the method of any of claims 1-5.
CN202310737347.3A 2023-06-21 2023-06-21 Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP Active CN116702091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310737347.3A CN116702091B (en) 2023-06-21 2023-06-21 Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310737347.3A CN116702091B (en) 2023-06-21 2023-06-21 Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP

Publications (2)

Publication Number Publication Date
CN116702091A true CN116702091A (en) 2023-09-05
CN116702091B CN116702091B (en) 2024-03-08

Family

ID=87823664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310737347.3A Active CN116702091B (en) 2023-06-21 2023-06-21 Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP

Country Status (1)

Country Link
CN (1) CN116702091B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701637A (en) * 2023-06-29 2023-09-05 中南大学 Zero sample text classification method, system and medium based on CLIP
CN117371456A (en) * 2023-10-10 2024-01-09 国网江苏省电力有限公司南通供电分公司 Multi-mode irony detection method and system based on feature fusion
CN117892205A (en) * 2024-03-15 2024-04-16 华南师范大学 Multi-modal irony detection method, apparatus, device and storage medium
CN118093896A (en) * 2024-04-12 2024-05-28 中国科学技术大学 Ironic detection method, ironic detection device, electronic equipment and storage medium
CN117892205B (en) * 2024-03-15 2024-07-09 华南师范大学 Multi-modal irony detection method, apparatus, device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021232589A1 (en) * 2020-05-21 2021-11-25 平安国际智慧城市科技股份有限公司 Intention identification method, apparatus and device based on attention mechanism, and storage medium
CN113837083A (en) * 2021-09-24 2021-12-24 焦点科技股份有限公司 Video segment segmentation method based on Transformer
CN115408517A (en) * 2022-07-21 2022-11-29 中国科学院软件研究所 Knowledge injection-based multi-modal irony recognition method of double-attention network
US20230022550A1 (en) * 2021-10-12 2023-01-26 Beijing Baidu Netcom Science Technology Co., Ltd. Image processing method, method for training image processing model devices and storage medium
CN115661713A (en) * 2022-11-01 2023-01-31 华南农业大学 Suckling piglet counting method based on self-attention spatiotemporal feature fusion
CN115661594A (en) * 2022-10-19 2023-01-31 海南港航控股有限公司 Image-text multi-mode feature representation method and system based on alignment and fusion
CN116028846A (en) * 2022-12-20 2023-04-28 北京信息科技大学 Multi-mode emotion analysis method integrating multi-feature and attention mechanisms
CN116259075A (en) * 2023-01-16 2023-06-13 安徽大学 Pedestrian attribute identification method based on prompt fine tuning pre-training large model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021232589A1 (en) * 2020-05-21 2021-11-25 平安国际智慧城市科技股份有限公司 Intention identification method, apparatus and device based on attention mechanism, and storage medium
CN113837083A (en) * 2021-09-24 2021-12-24 焦点科技股份有限公司 Video segment segmentation method based on Transformer
US20230022550A1 (en) * 2021-10-12 2023-01-26 Beijing Baidu Netcom Science Technology Co., Ltd. Image processing method, method for training image processing model devices and storage medium
CN115408517A (en) * 2022-07-21 2022-11-29 中国科学院软件研究所 Knowledge injection-based multi-modal irony recognition method of double-attention network
CN115661594A (en) * 2022-10-19 2023-01-31 海南港航控股有限公司 Image-text multi-mode feature representation method and system based on alignment and fusion
CN115661713A (en) * 2022-11-01 2023-01-31 华南农业大学 Suckling piglet counting method based on self-attention spatiotemporal feature fusion
CN116028846A (en) * 2022-12-20 2023-04-28 北京信息科技大学 Multi-mode emotion analysis method integrating multi-feature and attention mechanisms
CN116259075A (en) * 2023-01-16 2023-06-13 安徽大学 Pedestrian attribute identification method based on prompt fine tuning pre-training large model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BIN LIANG, 等: "Multimodal sarcasm detection via cross-modal graph convolutional network", 《IN PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (VOLUME 1: LONG PAPERS), 》, 31 December 2022 (2022-12-31), pages 1767 *
孙宇冲 等: "多模态与文本预训练模型的文本嵌入差异研究", 《北京大学学报(自然科学版)》, 24 August 2022 (2022-08-24), pages 1 - 11 *
张鹏飞;李冠宇;贾彩燕;: "面向自然语言推理的基于截断高斯距离的自注意力机制", 计算机科学, no. 04, 30 April 2020 (2020-04-30), pages 178 - 183 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701637A (en) * 2023-06-29 2023-09-05 中南大学 Zero sample text classification method, system and medium based on CLIP
CN116701637B (en) * 2023-06-29 2024-03-08 中南大学 Zero sample text classification method, system and medium based on CLIP
CN117371456A (en) * 2023-10-10 2024-01-09 国网江苏省电力有限公司南通供电分公司 Multi-mode irony detection method and system based on feature fusion
CN117892205A (en) * 2024-03-15 2024-04-16 华南师范大学 Multi-modal irony detection method, apparatus, device and storage medium
CN117892205B (en) * 2024-03-15 2024-07-09 华南师范大学 Multi-modal irony detection method, apparatus, device and storage medium
CN118093896A (en) * 2024-04-12 2024-05-28 中国科学技术大学 Ironic detection method, ironic detection device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116702091B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN116702091B (en) Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
CN111581405A (en) Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning
CN110888980B (en) Knowledge enhancement-based implicit chapter relation recognition method for attention neural network
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN110866129A (en) Cross-media retrieval method based on cross-media uniform characterization model
CN113032601A (en) Zero sample sketch retrieval method based on discriminant improvement
CN113392265A (en) Multimedia processing method, device and equipment
CN114764566B (en) Knowledge element extraction method for aviation field
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN114691864A (en) Text classification model training method and device and text classification method and device
Sargar et al. Image captioning methods and metrics
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN115408488A (en) Segmentation method and system for novel scene text
CN115934883A (en) Entity relation joint extraction method based on semantic enhancement and multi-feature fusion
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN116822513A (en) Named entity identification method integrating entity types and keyword features
CN116756363A (en) Strong-correlation non-supervision cross-modal retrieval method guided by information quantity
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph
CN114996442A (en) Text abstract generation system combining abstract degree judgment and abstract optimization
CN114693949A (en) Multi-modal evaluation object extraction method based on regional perception alignment network
CN114722798A (en) Ironic recognition model based on convolutional neural network and attention system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant