CN116702091B - Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP - Google Patents
Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP Download PDFInfo
- Publication number
- CN116702091B CN116702091B CN202310737347.3A CN202310737347A CN116702091B CN 116702091 B CN116702091 B CN 116702091B CN 202310737347 A CN202310737347 A CN 202310737347A CN 116702091 B CN116702091 B CN 116702091B
- Authority
- CN
- China
- Prior art keywords
- text
- image
- view
- ironic
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 239000013598 vector Substances 0.000 claims abstract description 86
- 238000009826 distribution Methods 0.000 claims abstract description 37
- 230000003993 interaction Effects 0.000 claims abstract description 25
- 230000007246 mechanism Effects 0.000 claims abstract description 12
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 230000002776 aggregation Effects 0.000 claims abstract description 8
- 238000004220 aggregation Methods 0.000 claims abstract description 8
- 230000002452 interceptive effect Effects 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000009466 transformation Effects 0.000 claims description 11
- 239000002131 composite material Substances 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 230000004931 aggregating effect Effects 0.000 claims description 6
- 238000007500 overflow downdraw method Methods 0.000 claims description 3
- 238000000844 transformation Methods 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 2
- 230000008451 emotion Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a multi-mode ironic intention recognition method, a device and equipment based on a multi-view CLIP, wherein the method comprises the following steps: sequentially encoding and decoding text information and image information in the data tuples; the method comprises the steps of adopting a CLIP model to encode to respectively obtain respective vector representations of a text and an image, and decoding to respectively obtain ironic score distribution based on a text view and an image view; after the respective vector representations of the text and the image obtained by encoding are spliced, feeding the text and the image into a transducer for modal fusion, determining the attention weight of the text and the image by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the interactive view of the text and the image; 3 ironic score distributions based on the text view, the image view and the text-image interaction view are aggregated, and an ironic intention recognition result of the data tuple is obtained according to the aggregation result. The invention improves the ironic intention recognition accuracy and has good interpretation.
Description
Technical Field
The invention belongs to the technical field of data identification, and particularly relates to a multi-mode ironic intention identification method, device and equipment based on a multi-view CLIP.
Background
Irony is a language method for expressing strong discontents, opposition, and jeers by using the convincing techniques such as bilingual, exaggeration, and metaphors. Irony has a long history in human society, from drama in the ancient greek period to modern cartoon and internet segments, irony is always an important way for people to express criticism. However, since the true emotion intended to be expressed ironically may be contrary to the superficial speech, conventional emotion analysis methods may have erroneous emotion classification in analyzing ironically text, thereby affecting the accuracy thereof. Thus, ironic intent recognition can help to identify the true emotion contained in the information, facilitating tasks such as emotion analysis and opinion mining.
Irony's meaning is often understood by context, which is often multi-level, ambiguous and ambiguous. This makes irony difficult to understand and identify accurately. In addition, the techniques of congratulation commonly used in irony also increase the difficulty of identifying irony. In recent years, due to the rapid development of social media, multimodal irony recognition, which aims at recognizing irony emotion in a multimodal scene, has attracted more and more research attention. Unlike traditional text-based ironic recognition methods, multi-modal ironic recognition comprehensively utilizes information of multiple modalities to perform feature fusion, adapts to various ironic manifestations, and has more accurate and comprehensive recognition performance in ironic recognition tasks.
With the rapid development of deep neural networks, multimodal irony recognition has achieved significant results. There are a number of multimodal irony recognition techniques including explicitly connecting text features and image features, implicitly employing an attention mechanism to merge features from different modalities, graph-based approaches, and the like. However, whether the results of these models faithfully reflect their multimodal understanding capabilities remains questionable. In fact, when a text-only modal model is applied to multi-modal irony recognition, its performance significantly exceeds the current most advanced multi-modal model. This illustrates that the performance of current multi-modal irony recognition models may rely heavily on false cues in the text data rather than actually capturing the essential features of irony by actually capturing the relationships between the different modes.
Disclosure of Invention
The invention provides a multi-mode ironic intention recognition method, a device and equipment based on multi-view CLIP, which utilize texts, images and information provided by a plurality of visual angles of image-text interaction to capture interaction relations among the texts, complete multi-mode ironic intention recognition and have high recognition accuracy.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
a multi-modal irony intent recognition method based on multi-view CLIP, comprising:
step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; the method comprises the steps of adopting CLIP model coding to respectively obtain text information vector representation and image information vector representation, and decoding to respectively obtain irony score distribution based on a text view and an image view;
step 2, splicing the text information vector representation and the image information vector representation obtained by the encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the text and image interaction view;
and 3, aggregating the 3 irony score distributions based on the text view, the image view and the text-image interaction view obtained in the steps 1 and 2, and obtaining irony intention recognition results of the data tuples according to the aggregation results.
Further, the text information is encoded by using the CLIP model, and a vector representation T of the text information is obtained, as shown in formula (1):
where x is the text information in the data tuple,representing a text encoder in the CLIP model, n is the sequence length of text x, t i Vector representation for the ith word in x, t CLS A vector representation of semantic information for the entire text x;
using a linear classifier to divide t CLS Mapping after linear transformation to obtain ironic score distribution y based on text view t As shown in formula (2):
y t =softmax(W t t CLS +b t ) (2)
wherein W is t And b t Respectively for text semantic information t CLS The weight matrix and bias parameters of the decoded linear classifier.
Further, the image information is encoded by using the CLIP model, and a vector representation I of the image information is obtained, as shown in formula (3):
where y is the entire image information in the data tuple, v CLS For a vector representation of the entire image, m is the number of blocks of image y,v i a vector representation for the i-th block of the image;
using a linear classifier to classify v CLS Mapping after linear transformation to obtain ironic score distribution y based on image view v As shown in formula (4):
y v =softmax(W v v CLS +b v ) (4)
wherein W is v And b v Respectively for the semantic information v of the image CLS The weight matrix and bias parameters of the decoded linear classifier.
Further, step 2 includes:
firstly, splicing the text information vector representation obtained by encoding and the image information vector representation to obtain a vector F of a composite image and text information, namely:
F=(v CLS ,v 1 ,…,v m ,t 1 ,…,t n ,t CLS )=Concat(T,I)
wherein, T, I are text information vector representation and image information vector representation respectively, concat (T, I) represents splicing operation; n is the sequence length of the text x, t i For the vector representation of the i-th word in the text information, t CLS A vector representation of semantic information for the entire text x; v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
then, feature fusion is carried out by using a transducer as a composite feature vector F, and F is subjected to different linear transformations by an internal self-attention mechanism to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and thus an updated vector is further obtainedAs shown in formula (5):
wherein d k Is mapped to by linear transformationDimensions of K and V;
after obtaining the feature vector of the composite of the updated image and the text informationAfter that, use key-less attention mechanism to +.>And->Further fusing to obtain a feature vector f of text and image interaction, wherein the feature vector f is shown in a formula (6) and a formula (7):
wherein p is t And p v Respectively isAnd->Corresponding attention weights, W and b are respectively a weight matrix and a bias parameter of the linear classifier;
finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view f As shown in formula (8):
y f =softmax(W f f+b f ) (8)
wherein W is f And b f The weight matrix and bias parameters of the linear classifier for decoding the feature vector f, respectively.
Further, step 3 adopts post fusion method pairThe 3 ironic score distributions are aggregated to obtain a ironic score distribution y that takes into account multiple views o As shown in formula (9):
y o =y t +y v +y f (9)
wherein y is t ,y v ,y f Ironic score distribution based on text view, image view, text and image interaction view, respectively;
irony score distribution y is then taken o The higher-probability subscript serves as the ironic intention recognition result.
A multi-modal ironic intent recognition device based on multi-view CLIP, comprising:
a text view identification module for: sequentially encoding and decoding text information in the acquired data tuples, wherein a text encoder adopting a CLIP model is used for encoding to obtain text information vector representation, and decoding to obtain ironic score distribution based on a text view;
an image view identification module for: sequentially encoding and decoding the image information in the acquired data tuples, wherein an image encoder adopting a CLIP model is used for encoding to obtain an image information vector representation, and decoding to obtain ironic score distribution based on image views;
the text and image interaction view identification module is used for: splicing the text information vector representation and the image information vector representation obtained by encoding, feeding the spliced vectors into a transducer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interactive views;
an aggregation module for: and aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained by the recognition modules, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.
An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the multi-modal irony intent recognition method based on multi-view CLIP as claimed in any one of the preceding claims.
Advantageous effects
Compared with the prior art, the method has the advantages that:
(1) Compared with the prior art, the method has the advantages that the accuracy rate is improved by more than 5.6% and the F1 value is improved by more than 7.0% on the MMSD2.0 data set. The invention has proven its effectiveness in integrating features from different modality views and simplifying the network architecture.
(2) The method does not need any image preprocessing step, and simplifies the training process.
(3) The method does not need to use a complex network structure, can naturally utilize knowledge in the CLIP model to perform multi-modal irony recognition, and naturally fuses information provided by different viewing angles to improve performance. Possess better interpretability.
(4) Experiments are carried out by adopting different training scales, and experimental results prove that the method can extract ironic clues even when the size of a training corpus is limited, and has stronger low-resource learning capability.
Drawings
FIG. 1 is a diagram of a system model architecture of the present invention.
Fig. 2 is a real example of the embodiment of the present invention.
Detailed Description
The following describes in detail the embodiments of the present invention, which are developed based on the technical solution of the present invention, and provide detailed embodiments and specific operation procedures, and further explain the technical solution of the present invention.
Experiments were performed on the MMSD2.0 dataset, the ironic intent of which was labeled as the ironic/non-ironic classification. One of the data sets is shown in FIG. 2, for example, and includes a "irony" tag with a corresponding text modality of "What a successful toast, it looks so delicious-! "
MMSD2.0 dataset sizes were as follows: the dataset was divided into training, validation and test sets, the test set containing 2409 data tuples of text and picture information, containing 1037 positive examples (ironic) and 1372 negative examples (non-ironic).
Given test set using multi-modal irony intent recognition method based on multi-view CLIPWherein (1)>Represents the number of samples in the test set, +.in this example>2409. For->Comprising data tuples (x, y) of text information and image information, comprising the following steps as shown in fig. 1:
step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; wherein, adopt the CLIP model to encode and obtain text information vector representation and image information vector representation respectively, decode and obtain the irony score distribution based on text view and image view respectively.
(1) Ironic recognition based on text views
Text encoder employing CLIP modelEncoding the text information to obtain a vector representation T of the text information, as shown in formula (1):
where x is the text information in the data tuple,representing a text encoder in the CLIP model, n is the sequence length of text x, t i Vector representation for the ith word in x, t CLS A vector representation of semantic information for the entire text x;
using a linear classifier to divide t CLS Mapping after linear transformation to obtain ironic score distribution y based on text view t As shown in formula (2):
y t =softmax(W t t CLS +b t ) (2)
wherein W is t And b t Respectively for text semantic information t CLS The weight matrix and bias parameters of the decoded linear classifier.
(2) Ironic recognition based on image views
Image encoder employing CLIP modelEncoding the image information to obtain a vector representation I of the image information as shown in formula (3):
where y is the entire image information in the data tuple, v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
using a linear classifier to classify v CLS Mapping after linear transformation to obtain ironic score distribution y based on image view v As shown in formula (4):
y v =softmax(W v v CLS +b v ) (4)
wherein W is v And b v Respectively for the semantic information v of the image CLS The weight matrix and bias parameters of the decoded linear classifier.
Step 2, ironic recognition based on text-image interactive views: splicing the text information vector representation and the image information vector representation obtained by the encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight of the spliced vectors by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on the text and image interaction view.
Firstly, splicing the text information vector representation obtained by encoding and the image information vector representation to obtain a vector F of a composite image and text information, namely:
F=(v CLS ,v 1 ,…,v m ,t 1 ,…,t n ,t CLS )=Concat(T,I)
wherein Concat (T, I) represents a splicing operation;
then, feature fusion is performed by using a transducer as a composite feature vector F, specifically:
(1) F is subjected to different linear transformations by using a self-attention mechanism in a transducer to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and an updated vector is further obtainedAs shown in formula (5):
wherein d k Is a dimension mapped to K and V through linear transformation.
(2) After obtaining the feature vector of the composite of the updated image and the text informationAfter that, use->And->Obtaining a characteristic vector f of text and image interaction, as shown in a formula (6) and a formula (7):
wherein p is t And p v Respectively isAnd->The corresponding attention weights, W and b, are the weight matrix and bias parameters of the linear classifier, respectively.
Finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view f As shown in formula (8):
y f =softmax(W f f+b f ) (8)。
step 3, aggregate multiview irony recognition: and (3) aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained in the steps 1 and 2, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.
Step 3, adopting a post-fusion method to aggregate 3 ironic score distributions to obtain ironic score distribution y considering multiple views o As shown in formula (9):
y o =y t +y v +y f (9)
then, the ironic score distribution y is obtained by treatment with argmax o The higher-probability subscript serves as the ironic intention recognition result. A result of 0 indicates non-irony and a result of 1 indicates irony.
The embodiment of the invention adopts the implementation of the multi-mode ironic intention recognition method based on the multi-view CLIP, and adopts the combination in the training stageThe optimization strategy optimizes the entire model simultaneously. In particular, standard binary cross entropy loss functions, i.e. model predictive value y and true label, are utilized for image view, text view and image-text interaction view, respectivelyThe difference between them, and thus the total loss is calculated cumulatively>As shown in formula (10):
then, by minimizing the loss functionThe parameters of the model are optimized using a back propagation algorithm.
The irony recognition accuracy of this example reached 85.64% and the F1 value reached 84.10%. Wherein, the number of true examples 833, true negative examples 1211, false positive examples 204 and false negative examples 161 are included. For example, fig. 2 is a true example, where the method identifies the sarcasm intent from the text and images in the sample by aggregating text views, image-text interactive views.
TABLE 1
As shown in Table 1, compared with the prior art, the accuracy, precision, recall and F1 values of the method on the MMSD2.0 data set are all improved. The effectiveness of integrating features from different modality views and simplifying the network architecture is demonstrated.
The above embodiments are preferred embodiments of the present application, and various changes or modifications may be made on the basis thereof by those skilled in the art, and such changes or modifications should be included within the scope of the present application without departing from the general inventive concept.
Claims (7)
1. A multi-modal ironic intent recognition method based on multi-view CLIP, comprising:
step 1, acquiring a data tuple comprising text information and image information, and sequentially encoding and decoding the text information and the image information in the data tuple; the method comprises the steps of adopting CLIP model coding to respectively obtain text information vector representation and image information vector representation, and decoding to respectively obtain irony score distribution based on a text view and an image view;
step 2, splicing the text information vector representation and the image information vector representation obtained by encoding in the step 1, feeding the spliced vectors into a transformer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interaction views;
and 3, aggregating the 3 irony score distributions based on the text view, the image view and the text-image interaction view obtained in the steps 1 and 2, and obtaining irony intention recognition results of the data tuples according to the aggregation results.
2. The ironic intent recognition method of claim 1, wherein the textual information is encoded using a CLIP model to obtain a vector representation T of the textual information, as shown in equation (1):
where x is the text information in the data tuple,representing a text encoder in the CLIP model, n is the sequence length of text x, t i Vector representation for the ith word in x, t CLS A vector representation of semantic information for the entire text x;
using a linear classifier to divide t CLS Mapping after linear transformation to obtain ironic score distribution y based on text view t As shown in formula (2):
y t =softmax(W t t CLS +b t ) (2)
wherein W is t And b t Respectively for text semantic information t CLS The weight matrix and bias parameters of the decoded linear classifier.
3. The ironic intent recognition method of claim 1, wherein the image information is encoded using a CLIP model to obtain a vector representation I of the image information as shown in equation (3):
where y is the entire image information in the data tuple, v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
using a linear classifier to classify v CLS Mapping after linear transformation to obtain ironic score distribution y based on image view v As shown in formula (4):
y v =softmax(W v v CLS +b v ) (4)
wherein W is v And b v Respectively for the semantic information v of the image CLS The weight matrix and bias parameters of the decoded linear classifier.
4. The ironic intent recognition method of claim 1, wherein step 2 comprises:
firstly, splicing the text information vector representation obtained by encoding and the image information vector representation to obtain a vector F of a composite image and text information, namely:
F=(v CLS ,v 1 ,…,v m ,t 1 ,…,t n ,t CLS )=Concat(T,I)
wherein, T, I are text information vector representation and image information vector representation respectively, concat (T, I) represents splicing operation; n is the sequence length of the text x, t i For the vector representation of the i-th word in the text information, t CLS A vector representation of semantic information for the entire text x; v CLS For vector representation of the whole image, m is the number of blocks of image y, v i A vector representation for the i-th block of the image;
then, feature fusion is carried out by using a transducer as a composite feature vector F, and F is subjected to different linear transformations by an internal self-attention mechanism to respectively obtain a corresponding query matrix Q, a key matrix K and a value matrix V, and thus an updated vector is further obtainedAs shown in formula (5):
wherein d k Is mapped to the dimensions of K and V through linear transformation;
after obtaining the feature vector of the composite of the updated image and the text informationAfter that, use key-less attention mechanism to +.>And->Further fusing to obtain a feature vector f of text and image interaction, wherein the feature vector f is shown in a formula (6) and a formula (7):
wherein p is t And p v Respectively isAnd->Corresponding attention weights, W and b are respectively a weight matrix and a bias parameter of the linear classifier;
finally, decoding the characteristic vector f of the text and image interaction, namely, linearly transforming f, and mapping to obtain irony recognition result y based on the text and image interaction view f As shown in formula (8):
y f =softmax(W f f+b f ) (8)
wherein W is f And b f The weight matrix and bias parameters of the linear classifier for decoding the feature vector f, respectively.
5. The method for ironic intent recognition as claimed in claim 1, wherein step 3 uses a post-fusion method to aggregate 3 ironic score distributions to obtain a multi-view considered ironic score distribution y o As shown in formula (9):
y o =y t +y v +y f (9)
wherein y is t ,y v ,y f Ironic score distribution based on text view, image view, text and image interaction view, respectively;
irony score distribution y is then taken o With a higher medium probabilitySubscripts serve as ironic intent recognition results.
6. A multi-modal ironic intent recognition device based on multi-view CLIP, comprising:
a text view identification module for: sequentially encoding and decoding text information in the acquired data tuples, wherein a text encoder adopting a CLIP model is used for encoding to obtain text information vector representation, and decoding to obtain ironic score distribution based on a text view;
an image view identification module for: sequentially encoding and decoding the image information in the acquired data tuples, wherein an image encoder adopting a CLIP model is used for encoding to obtain an image information vector representation, and decoding to obtain ironic score distribution based on image views;
the text and image interaction view identification module is used for: splicing the text information vector representation and the image information vector representation obtained by encoding, feeding the spliced vectors into a transducer for modal fusion, determining the attention weight by adopting a key-less attention mechanism, and decoding to obtain irony score distribution based on text and image interactive views;
an aggregation module for: and aggregating the 3 ironic score distributions based on the text view, the image view and the text and image interaction view obtained by the recognition modules, and obtaining ironic intention recognition results of the data tuples according to the aggregation results.
7. An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to implement the method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310737347.3A CN116702091B (en) | 2023-06-21 | 2023-06-21 | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310737347.3A CN116702091B (en) | 2023-06-21 | 2023-06-21 | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116702091A CN116702091A (en) | 2023-09-05 |
CN116702091B true CN116702091B (en) | 2024-03-08 |
Family
ID=87823664
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310737347.3A Active CN116702091B (en) | 2023-06-21 | 2023-06-21 | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116702091B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701637B (en) * | 2023-06-29 | 2024-03-08 | 中南大学 | Zero sample text classification method, system and medium based on CLIP |
CN117371456B (en) * | 2023-10-10 | 2024-07-16 | 国网江苏省电力有限公司南通供电分公司 | Multi-mode irony detection method and system based on feature fusion |
CN117892205B (en) * | 2024-03-15 | 2024-07-09 | 华南师范大学 | Multi-modal irony detection method, apparatus, device and storage medium |
CN118093896B (en) * | 2024-04-12 | 2024-07-26 | 中国科学技术大学 | Ironic detection method, ironic detection device, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021232589A1 (en) * | 2020-05-21 | 2021-11-25 | 平安国际智慧城市科技股份有限公司 | Intention identification method, apparatus and device based on attention mechanism, and storage medium |
CN113837083A (en) * | 2021-09-24 | 2021-12-24 | 焦点科技股份有限公司 | Video segment segmentation method based on Transformer |
CN115408517A (en) * | 2022-07-21 | 2022-11-29 | 中国科学院软件研究所 | Knowledge injection-based multi-modal irony recognition method of double-attention network |
CN115661594A (en) * | 2022-10-19 | 2023-01-31 | 海南港航控股有限公司 | Image-text multi-mode feature representation method and system based on alignment and fusion |
CN115661713A (en) * | 2022-11-01 | 2023-01-31 | 华南农业大学 | Suckling piglet counting method based on self-attention spatiotemporal feature fusion |
CN116028846A (en) * | 2022-12-20 | 2023-04-28 | 北京信息科技大学 | Multi-mode emotion analysis method integrating multi-feature and attention mechanisms |
CN116259075A (en) * | 2023-01-16 | 2023-06-13 | 安徽大学 | Pedestrian attribute identification method based on prompt fine tuning pre-training large model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113963087B (en) * | 2021-10-12 | 2023-10-27 | 北京百度网讯科技有限公司 | Image processing method, image processing model training method, device and storage medium |
-
2023
- 2023-06-21 CN CN202310737347.3A patent/CN116702091B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021232589A1 (en) * | 2020-05-21 | 2021-11-25 | 平安国际智慧城市科技股份有限公司 | Intention identification method, apparatus and device based on attention mechanism, and storage medium |
CN113837083A (en) * | 2021-09-24 | 2021-12-24 | 焦点科技股份有限公司 | Video segment segmentation method based on Transformer |
CN115408517A (en) * | 2022-07-21 | 2022-11-29 | 中国科学院软件研究所 | Knowledge injection-based multi-modal irony recognition method of double-attention network |
CN115661594A (en) * | 2022-10-19 | 2023-01-31 | 海南港航控股有限公司 | Image-text multi-mode feature representation method and system based on alignment and fusion |
CN115661713A (en) * | 2022-11-01 | 2023-01-31 | 华南农业大学 | Suckling piglet counting method based on self-attention spatiotemporal feature fusion |
CN116028846A (en) * | 2022-12-20 | 2023-04-28 | 北京信息科技大学 | Multi-mode emotion analysis method integrating multi-feature and attention mechanisms |
CN116259075A (en) * | 2023-01-16 | 2023-06-13 | 安徽大学 | Pedestrian attribute identification method based on prompt fine tuning pre-training large model |
Non-Patent Citations (3)
Title |
---|
Bin Liang, 等.Multimodal sarcasm detection via cross-modal graph convolutional network.《In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),》.2022,1767– 1777. * |
多模态与文本预训练模型的文本嵌入差异研究;孙宇冲 等;《北京大学学报(自然科学版)》;20220824;1-11 * |
面向自然语言推理的基于截断高斯距离的自注意力机制;张鹏飞;李冠宇;贾彩燕;;计算机科学;20200430(第04期);178-183 * |
Also Published As
Publication number | Publication date |
---|---|
CN116702091A (en) | 2023-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116702091B (en) | Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP | |
CN111581961B (en) | Automatic description method for image content constructed by Chinese visual vocabulary | |
CN111581405B (en) | Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning | |
CN110888980B (en) | Knowledge enhancement-based implicit chapter relation recognition method for attention neural network | |
WO2023065617A1 (en) | Cross-modal retrieval system and method based on pre-training model and recall and ranking | |
Cornia et al. | Explaining digital humanities by aligning images and textual descriptions | |
CN114139551A (en) | Method and device for training intention recognition model and method and device for recognizing intention | |
CN112699686B (en) | Semantic understanding method, device, equipment and medium based on task type dialogue system | |
CN111581943A (en) | Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph | |
CN113032601A (en) | Zero sample sketch retrieval method based on discriminant improvement | |
CN116611024A (en) | Multi-mode trans mock detection method based on facts and emotion oppositivity | |
CN113239159A (en) | Cross-modal retrieval method of videos and texts based on relational inference network | |
CN114764566B (en) | Knowledge element extraction method for aviation field | |
CN114004220A (en) | Text emotion reason identification method based on CPC-ANN | |
CN115408488A (en) | Segmentation method and system for novel scene text | |
Sargar et al. | Image captioning methods and metrics | |
CN117217277A (en) | Pre-training method, device, equipment, storage medium and product of language model | |
CN114722798A (en) | Ironic recognition model based on convolutional neural network and attention system | |
CN114020871A (en) | Multi-modal social media emotion analysis method based on feature fusion | |
CN116822513A (en) | Named entity identification method integrating entity types and keyword features | |
CN116756363A (en) | Strong-correlation non-supervision cross-modal retrieval method guided by information quantity | |
CN115659242A (en) | Multimode emotion classification method based on mode enhanced convolution graph | |
CN115712869A (en) | Multi-modal rumor detection method and system based on layered attention network | |
CN114996442A (en) | Text abstract generation system combining abstract degree judgment and abstract optimization | |
CN114282537A (en) | Social text-oriented cascade linear entity relationship extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |