CN117454873A

CN117454873A - A sarcasm detection method and system based on knowledge-enhanced neural network model

Info

Publication number: CN117454873A
Application number: CN202311374400.4A
Authority: CN
Inventors: 任亚峰; 王子霖
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-01-26
Anticipated expiration: 2043-10-23
Also published as: CN117454873B

Abstract

The invention discloses a irony detection method and a irony detection system based on a knowledge-enhanced neural network model, comprising the following steps: s1, screening context information highly related to a text to be detected from an external knowledge source and integrating the context information with an original text; s2, word embedding is carried out on the integrated text data by using a pre-training language model RoBERTa, and the weight of the BiLSTM network is memorized in a bidirectional long-short time manner; s3, constructing a coding model consisting of an 8-layer two-way long-short time memory BiLSTM network, embedding a multi-head self-attention mechanism in the two-way long-short time memory BiLSTM network, and capturing long-distance dependency and local semantic features in a text; s4, classifying and training, and improving the model through an optimization algorithm to obtain a final knowledge-reinforced ironic detection model; s5, acquiring a text to be detected, inputting a knowledge-enhanced ironic detection model, and outputting ironic detection results; the invention enhances the ironic understanding of the model, enables the model to effectively capture more complex language modes, and remarkably improves the ironic detection accuracy and robustness.

Description

A sarcasm detection method and system based on knowledge-enhanced neural network model

技术领域Technical field

本发明涉及自然语言处理和文本挖掘技术领域，更具体的说是涉及一种基于知识增强神经网络模型的讽刺检测方法及系统。The present invention relates to the technical fields of natural language processing and text mining, and more specifically to a sarcasm detection method and system based on a knowledge-enhanced neural network model.

背景技术Background technique

随着社交媒体和在线平台的普及，文本数据(尤其是用户生成的内容)的量呈指数级增长，而这些文本数据中经常包含讽刺和暗讽，对情感分析、舆情监控和自然语言理解等任务构成了挑战，因此，讽刺检测成为自然语言处理领域的一个重要研究方向。With the popularity of social media and online platforms, the amount of text data (especially user-generated content) has increased exponentially, and these text data often contain sarcasm and innuendo, which is very important for sentiment analysis, public opinion monitoring and natural language understanding. The task poses a challenge, and therefore, sarcasm detection becomes an important research direction in the field of natural language processing.

早期的研究方法主要依赖于手工提取如词频、情感词汇的特征，以及传统的机器学习算法，例如支持向量机(SVM)、决策树和随机森林。Early research methods mainly relied on manual extraction of features such as word frequency and emotional vocabulary, as well as traditional machine learning algorithms such as support vector machines (SVM), decision trees and random forests.

随着深度学习技术的飞速发展，基于卷积神经网络(CNN)、循环神经网络(RNN)和长短时记忆网络(LSTM)的模型在文本表示学习、特征自动提取和分类准确性方面展示了优越的性能。With the rapid development of deep learning technology, models based on convolutional neural networks (CNN), recurrent neural networks (RNN), and long short-term memory networks (LSTM) have demonstrated superiority in text representation learning, automatic feature extraction, and classification accuracy. performance.

但是，目前大多数讽刺检测模型仍然主要集中在分析文本内部的语义和结构特征，而相对忽视了文本与外部知识源之间可能存在的密切关联；更为复杂的是，讽刺和暗讽通常是高度依赖上下文的，仅仅依靠单一的文本分析方法往往难以获得令人满意的检测性能。However, most current sarcasm detection models still mainly focus on analyzing the semantic and structural features within the text, while relatively ignoring the possible close relationship between the text and external knowledge sources; to make things more complicated, sarcasm and innuendo are usually It is highly context-dependent and it is often difficult to obtain satisfactory detection performance by relying solely on a single text analysis method.

因此，如何有效地整合外部知识源，并在讽刺检测任务中实现高准确性和鲁棒性是本领域技术人员亟需解决的问题。Therefore, how to effectively integrate external knowledge sources and achieve high accuracy and robustness in sarcasm detection tasks is an urgent problem that needs to be solved by those skilled in the art.

发明内容Contents of the invention

有鉴于此，本发明提供了一种基于知识增强神经网络模型的讽刺检测方法及系统以解决背景技术中提到的问题。In view of this, the present invention provides a sarcasm detection method and system based on a knowledge-enhanced neural network model to solve the problems mentioned in the background art.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above objects, the present invention adopts the following technical solutions:

一种基于知识增强神经网络模型的讽刺检测方法，包括以下步骤：A sarcasm detection method based on knowledge-enhanced neural network model, including the following steps:

S1.从外部知识源筛选与待检测文本高度相关的上下文信息，并将筛选出的上下文信息与原始文本进行整合；S1. Filter contextual information that is highly relevant to the text to be detected from external knowledge sources, and integrate the filtered contextual information with the original text;

S2.使用预训练语言模型RoBERTa对整合后的文本数据进行词嵌入，并初始化双向长短时记忆BiLSTM网络的权值；S2. Use the pre-trained language model RoBERTa to perform word embedding on the integrated text data, and initialize the weights of the bidirectional long short-term memory BiLSTM network;

S3.构建由8层双向长短时记忆BiLSTM网络组成的编码模型，在双向长短时记忆BiLSTM网络中嵌入多头自注意力机制，捕捉文本中的长距离依赖关系和局部语义特征；S3. Construct a coding model consisting of an 8-layer bidirectional long short-term memory BiLSTM network, and embed a multi-head self-attention mechanism in the bidirectional long short-term memory BiLSTM network to capture long-distance dependencies and local semantic features in the text;

S4.将预训练语言模型RoBERTa的词嵌入输入到编码模型中，通过全连接层和多头自注意力机制，以及由softmax激活函数构成的分类器进行多分类训练，并通过优化算法来改进模型，获得最终的知识增强讽刺检测模型；S4. Input the word embedding of the pre-trained language model RoBERTa into the encoding model, perform multi-classification training through the fully connected layer and multi-head self-attention mechanism, and the classifier composed of the softmax activation function, and improve the model through the optimization algorithm. Obtain the final knowledge-enhanced sarcasm detection model;

S5.采集待检测文本并输入知识增强讽刺检测模型，输出讽刺检测结果。S5. Collect the text to be detected and input the knowledge-enhanced sarcasm detection model to output the sarcasm detection results.

优选的，步骤S1的具体内容为：Preferably, the specific content of step S1 is:

S11.从不同的外部知识源分别获取与原始文本最相关的句子作为候选上下文；S11. Obtain the sentences most relevant to the original text from different external knowledge sources as candidate contexts;

S12.使用BERTScore文本相似性算法对所有候选上下文进行排序，根据文本相似性度得分，选择与原始文本最匹配的候选上下文作为原始文本的上下文；S12. Use the BERTScore text similarity algorithm to sort all candidate contexts, and select the candidate context that best matches the original text as the context of the original text based on the text similarity score;

S13.使用表示序列结束的EOS标签将提取的上下文与原始文本连接。S13. Connect the extracted context with the original text using the EOS tag indicating the end of the sequence.

优选的，BERTScore文本相似性算法具体为：Preferably, the BERTScore text similarity algorithm is specifically:

其中，A、B分别为待检测文本、外部知识源中的文本。Among them, A and B are the text to be detected and the text in the external knowledge source respectively.

优选的，获得预训练语言模型RoBERTa的具体步骤为：Preferably, the specific steps to obtain the pre-trained language model RoBERTa are:

收集大量无标签的文本数据D＝(d₁,d₂,…d_n)；Collect a large amount of unlabeled text data D = (d ₁ , d ₂ ,...d _n );

对每个文档d_i进行分词，得到词序列T＝{t₁,t₂,…,t_ni}；Perform word segmentation on each document d _i to obtain word sequence T={t ₁ , t ₂ ,...,t _ni };

初始化词嵌入矩阵E＝{e₁,e₂,…,e_V}，其中，e_i为词汇表中第i个词的嵌入向量，V为词汇表的大小；Initialize the word embedding matrix E = {e ₁ , e ₂ ,..., e _V }, where e _i is the embedding vector of the i-th word in the vocabulary, and V is the size of the vocabulary;

基于Transformer架构及其自注意力机制，进行预训练语言模型的训练并使用最大似然估计优化模型参数，在训练完成后获得预训练语言模型RoBERTa。Based on the Transformer architecture and its self-attention mechanism, the pre-trained language model is trained and the model parameters are optimized using maximum likelihood estimation. After the training is completed, the pre-trained language model RoBERTa is obtained.

优选的，步骤S4模型训练的具体内容为：Preferably, the specific content of model training in step S4 is:

S41.将整合后的文本数据划分训练集和测试集，使用预训练语言模型RoBERTa，将训练集的词转换为词嵌入矩阵X＝{x₁,x₂,…,x_N}输入到编码模型中；S41. Divide the integrated text data into a training set and a test set, use the pre-trained language model RoBERTa, and convert the words in the training set into a word embedding matrix X = {x ₁ , x ₂ ,..., x _N } and input it into the coding model middle;

S42.通过编码模型的8层双向长短时记忆BiLSTM网络，获得隐藏状态序列；S42. Obtain the hidden state sequence through the 8-layer bidirectional long short-term memory BiLSTM network of the encoding model;

S43.引入多头注意力机制，获得每一层的注意力权重和平均权重；S43. Introduce a multi-head attention mechanism to obtain the attention weight and average weight of each layer;

S44.根据平均权重，输入全连接层，并使用激活函数softmax构建分类器，对全连接层的输出进行分类；S44. According to the average weight, input the fully connected layer, and use the activation function softmax to build a classifier to classify the output of the fully connected layer;

S45.根据分类结果，使用交叉熵损失函数和Adam优化器进行模型优化；S45. Based on the classification results, use the cross-entropy loss function and Adam optimizer to optimize the model;

S46.根据损失函数的梯度来更新模型的权重，以改进模型的性能。S46. Update the weight of the model according to the gradient of the loss function to improve the performance of the model.

优选的，自注意力机制具体为：Preferably, the self-attention mechanism is specifically:

其中，Q、K、V分别为查询、键、值矩阵，d_k为键的维度。Among them, Q, K, and V are query, key, and value matrices respectively, and d _k is the dimension of the key.

优选的，步骤S4之后还包括模型测试与评估：对模型的测试结果通过准确率、召回率和F值进行评估、分析和概括，并进行模型性能的优化和改进。Preferably, step S4 also includes model testing and evaluation: evaluating, analyzing and summarizing the test results of the model through accuracy, recall and F-value, and optimizing and improving model performance.

优选的，构建的8层双向长短时记忆BiLSTM网络具体为：Preferably, the constructed 8-layer bidirectional long short-term memory BiLSTM network is specifically:

其中，H_t为第t个时间步的隐藏状态，X_t为第t个时间步的输入；Among them, H _t is the hidden state of the t-th time step, and X _t is the input of the t-th time step;

每一层的注意力权重α^l为：The attention weight α ^l of each layer is:

其中，T为序列长度，为目标时间步t和源时间步j在第l层BiLSTM中的相似度；Among them, T is the sequence length, is the similarity between the target time step t and the source time step j in the l-th layer BiLSTM;

其中，为目标序列在时间t-1在第l层的隐藏状态；/>为源序列在时间j在第l层的隐藏状态；a为一个可学习的函数；in, is the hidden state of the target sequence in layer l at time t-1;/> is the hidden state of the source sequence in layer l at time j; a is a learnable function;

平均权重为：average weight for:

全连接层和softmax分类器的输入表示为：The input of the fully connected layer and softmax classifier is expressed as:

全连接层F为：The fully connected layer F is:

F(D)＝W_f·D+b_f F(D)＝W _f ·D+b _f

其中，W_f、b_f分别为全连接层的权重、偏置；Among them, W _f and b _f are the weights and biases of the fully connected layer respectively;

使用激活函数softmax构建的分类器C为：The classifier C built using the activation function softmax is:

C(F)＝soft max(W_c·F+b_c)C(F)=soft max(W _c ·F+b _c )

其中，W_c、b_c分别为分类器的权重、偏置；Among them, W _c and b _c are the weight and bias of the classifier respectively;

交叉熵损失函数L为：The cross entropy loss function L is:

其中，N为标记样本个数，y_i为第i个样本的实际标签，C(x_i)表示第i个样本的模型预测值，范围在[0,1]。Among them, N is the number of labeled samples, y _i is the actual label of the i-th sample, and C(xi ₎ represents the model prediction value of the i-th sample, ranging from [0,1].

优选的，使用Adam优化器进行优化的规则具体为：Preferably, the rules for optimization using the Adam optimizer are as follows:

m_t＝β₁m_t-1+(1-β₁)g_t m _t =β ₁ m _t-1 +(1-β ₁ )g _t

其中，m_t、v_t分别为一阶矩、二阶矩的估计值，β₁和β₂为衰减因子，g_t为损失函数L关于模型参数σ的梯度，α表示学习率，ε为防止除以零的小常数，σ_t为在时间t的模型参数。Among them, m _t and v _t are the estimated values of the first-order moment and the second-order moment respectively, β ₁ and β ₂ are attenuation factors, g _t is the gradient of the loss function L with respect to the model parameter σ, α represents the learning rate, and ε is the prevention factor Divided by a small constant of zero, _σt is the model parameter at time t.

一种基于知识增强神经网络模型的讽刺检测系统，基于所述的一种基于知识增强神经网络模型的讽刺检测方法，包括：文本采集模块、文本整合模块、知识增强讽刺检测模型和模型构建训练模块；A sarcasm detection system based on a knowledge-enhanced neural network model, based on the sarcasm detection method based on a knowledge-enhanced neural network model, including: a text collection module, a text integration module, a knowledge-enhanced sarcasm detection model and a model construction training module ;

知识增强讽刺检测模型包括预训练语言模型RoBERTa和双向长短时记忆BiLSTM模型；The knowledge-enhanced sarcasm detection model includes the pre-trained language model RoBERTa and the bidirectional long short-term memory BiLSTM model;

文本采集模块，用于采集待检测文本和外部知识源；Text collection module, used to collect text to be detected and external knowledge sources;

文本整合模块，用于从外部知识源筛选与待检测文本高度相关的上下文信息，并将筛选出的上下文信息与原始文本进行整合；The text integration module is used to filter contextual information that is highly relevant to the text to be detected from external knowledge sources, and integrate the filtered contextual information with the original text;

预训练语言模型RoBERTa，用于对整合后的文本数据进行词嵌入，并初始化双向长短时记忆BiLSTM网络的权值；The pre-trained language model RoBERTa is used to embed words on the integrated text data and initialize the weights of the bidirectional long short-term memory BiLSTM network;

模型构建训练模块，用于构建由8层双向长短时记忆BiLSTM网络组成的编码模型，在双向长短时记忆BiLSTM网络中嵌入多头自注意力机制，捕捉文本中的长距离依赖关系和局部语义特征；将预训练语言模型RoBERTa的词嵌入输入到编码模型中，通过全连接层和多头自注意力机制，以及由softmax激活函数构成的分类器进行多分类训练，并通过优化算法来改进模型，获得双向长短时记忆BiLSTM模型，以及最终的知识增强讽刺检测模型；The model construction training module is used to build a coding model composed of an 8-layer bidirectional long short-term memory BiLSTM network. A multi-head self-attention mechanism is embedded in the bidirectional long short-term memory BiLSTM network to capture long-distance dependencies and local semantic features in the text; The word embedding of the pre-trained language model RoBERTa is input into the encoding model, and multi-classification training is performed through the fully connected layer and the multi-head self-attention mechanism, as well as the classifier composed of the softmax activation function, and the model is improved through the optimization algorithm to obtain two-way Long short-term memory BiLSTM model, and the final knowledge-enhanced sarcasm detection model;

知识增强讽刺检测模型，用于输入采集的待检测文本，输出讽刺检测结果。The knowledge-enhanced sarcasm detection model is used to input the collected text to be detected and output the sarcasm detection results.

经由上述的技术方案可知，与现有技术相比，本发明公开提供了一种基于知识增强神经网络模型的讽刺检测方法及系统，结合外部知识源来提供上下文信息，增强模型对讽刺的理解，使其与不使用外部信息的模型相比具有多功能性和更强健性；It can be seen from the above technical solutions that compared with the existing technology, the present invention discloses a sarcasm detection method and system based on a knowledge-enhanced neural network model, which combines external knowledge sources to provide contextual information and enhance the model's understanding of sarcasm. making it versatile and more robust than models that do not use external information;

与之前主要使用更简单的神经网络架构的工作不同，本发明采用了多层方法，包括预训练语言模型RoBERTa、8层双向长短时记忆BiLSTM网络和多头注意力机制，使模型能够有效捕获更复杂的语言模式，显著提高了讽刺检测的准确性和鲁棒性。Different from previous work that mainly used simpler neural network architectures, this invention adopts a multi-layer approach, including a pre-trained language model RoBERTa, an 8-layer bidirectional long short-term memory BiLSTM network and a multi-head attention mechanism, so that the model can effectively capture more complex The language model significantly improves the accuracy and robustness of sarcasm detection.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

图1附图为本发明提供的一种基于知识增强神经网络模型的讽刺检测方法的总体框架拓扑结构示意图；Figure 1 is a schematic diagram of the overall framework topology of a sarcasm detection method based on a knowledge-enhanced neural network model provided by the present invention;

图2附图为本发明提供的预训练语言模型RoBERTa的拓扑示意图；Figure 2 is a schematic topology diagram of the pre-trained language model RoBERTa provided by the present invention;

图3附图为本发明提供的双向长短时记忆BiLSTM网络模型和多头注意力机制的拓扑示意图。Figure 3 is a schematic topology diagram of the bidirectional long short-term memory BiLSTM network model and the multi-head attention mechanism provided by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

本发明实施例公开了一种基于知识增强神经网络模型的讽刺检测方法，包括以下步骤：The embodiment of the present invention discloses a sarcasm detection method based on a knowledge-enhanced neural network model, which includes the following steps:

为了进一步实施上述技术方案，步骤S1的具体内容为：In order to further implement the above technical solution, the specific content of step S1 is:

在本实施例中，外部知识源包括维基百科Wikipedia、纽约时报NewYork Times和英国广播公司数据BBC；In this embodiment, external knowledge sources include Wikipedia, New York Times, and BBC data;

对于维基百科Wikipedia，找到与原始文本最相关的句子作为候选上下文；For Wikipedia, find the sentence most relevant to the original text as a candidate context;

对于纽约时报NewYorkTimes，首先使用自然语言处理工具spacy识别原始文本中的命名实体，并使用接口NYTAPI检索最相关的句子作为候选上下文；For NewYorkTimes, the natural language processing tool spacy is first used to identify named entities in the original text, and the interface NYTAPI is used to retrieve the most relevant sentences as candidate contexts;

对于英国广播公司数据BBC，使用接口GDELTDocAPI搜索与原始文本中的实体最相关的前十个头条新闻作为候选上下文。For the British Broadcasting Corporation data BBC, the interface GDELTDocAPI is used to search for the top ten headlines most relevant to entities in the original text as candidate contexts.

为了进一步实施上述技术方案，BERTScore文本相似性算法具体为：In order to further implement the above technical solution, the BERTScore text similarity algorithm is specifically:

为了进一步实施上述技术方案，获得预训练语言模型RoBERTa的具体步骤为：In order to further implement the above technical solution, the specific steps to obtain the pre-trained language model RoBERTa are:

在本实施例中，最大似然估计MLE优化模型参数的方法为：In this embodiment, the method of maximum likelihood estimation of MLE optimization model parameters is:

其中，θ为预训练语言模型RoBERTa的模型参数，N为训练集的大小。Among them, θ is the model parameter of the pre-trained language model RoBERTa, and N is the size of the training set.

为了进一步实施上述技术方案，步骤S4模型训练的具体内容为：In order to further implement the above technical solution, the specific content of model training in step S4 is:

S42.进行8层BiLSTM模型的构建，可以表示为：S42. Construct an 8-layer BiLSTM model, which can be expressed as:

S43.在训练模型中，引入多头注意力机制，以更有效地捕获文本中的关键信息：S43. In the training model, a multi-head attention mechanism is introduced to capture key information in the text more effectively:

对于8层BiLSTM的每一层l，有一个隐藏状态序列每一层的注意力权重α^l为：For each layer l of the 8-layer BiLSTM, there is a hidden state sequence The attention weight α ^l of each layer is:

在得到每一层的注意力权重α^l后，可以计算平均权重 After obtaining the attention weight α ^l of each layer, the average weight can be calculated

S44.然后将这些作为全连接层和softmax分类器的输入，可以表示为：S44. Then add these As the input of the fully connected layer and softmax classifier, it can be expressed as:

为了进一步整合双向长短时记忆网络BiLSTM的输出，并准备数据以供分类器使用，使用一个全连接层F：In order to further integrate the output of the bidirectional long short-term memory network BiLSTM and prepare the data for use by the classifier, a fully connected layer F is used:

F(D)＝W_f·D+b_f F(D)＝W _f ·D+b _f

接下来，使用激活函数softmax构建的分类器C，以对全连接层的输出进行分类：Next, a classifier C built using the activation function softmax is used to classify the output of the fully connected layer:

C(F)＝softmax(W_c·F+b_c)C(F)=softmax(W _c ·F+b _c )

S45.为了使模型量化模型的性能，并通过优化算法来改进模型使用交叉熵损失函数L：S45. In order to quantify the performance of the model and improve the model through optimization algorithms, the cross-entropy loss function L is used:

其中，N为标记样本个数，y_i为第i个样本的实际标签，C(x_i)表示第i个样本的模型预测值，范围在[0,1]；Among them, N is the number of labeled samples, y _i is the actual label of the i-th sample, and C(xi ₎ represents the model prediction value of the i-th sample, ranging from [0,1];

优化使用Adam优化器进行优化，其规则如下：Optimization uses the Adam optimizer for optimization, and its rules are as follows:

m_t＝β₁m_t-1+(1-β₁)g_t m _t =β ₁ m _t-1 +(1-β ₁ )g _t

其中，m_t、v_t分别为一阶矩、二阶矩的估计值，β₁和β₂为衰减因子，通常设置为0.9和0.999，g_t为损失函数L关于双向长短时记忆网络模型参数σ的梯度，α表示学习率，ε为防止除以零的小常数，通常设置为1×10^-8，σ_t为在时间t的模型参数；Among them, m _t and v _t are the estimated values of the first-order moment and the second-order moment respectively, β ₁ and β ₂ are attenuation factors, usually set to 0.9 and 0.999, g _t is the loss function L about the bidirectional long short-term memory network model parameters The gradient of σ, α represents the learning rate, ε is a small constant to prevent division by zero, usually set to 1×10 ^-8 , and σ _t is the model parameter at time t;

S46.最后在每个训练周期结束后，根据损失函数的梯度来更新模型的权重，以改进模型的性能。S46. Finally, after each training cycle, update the weight of the model according to the gradient of the loss function to improve the performance of the model.

为了进一步实施上述技术方案，自注意力机制具体为：In order to further implement the above technical solutions, the self-attention mechanism is specifically:

为了进一步实施上述技术方案，步骤S4之后还包括模型测试与评估：对模型的测试结果通过准确率、召回率和F值进行评估、分析和概括，并进行模型性能的优化和改进。In order to further implement the above technical solution, step S4 also includes model testing and evaluation: the test results of the model are evaluated, analyzed and summarized through accuracy, recall and F-value, and the model performance is optimized and improved.

在本实施例中，性能指标使用准确率为：In this embodiment, the performance index usage accuracy is:

召回率为：The recall rate is:

F1分数为：The F1 score is:

在本实施例中，通过外部知识源(如维基百科Wikipedia、纽约时报New YorkTimes和英国广播公司数据BBC)生成上下文信息，然后与原始文本数据进行融合；接着，在词嵌入层将文本转换为数值向量，随后通过编码层和多头注意力机制层进行处理。In this embodiment, contextual information is generated through external knowledge sources (such as Wikipedia, New York Times, and BBC data), and then fused with the original text data; then, the text is converted into numerical values in the word embedding layer The vectors are then processed through the encoding layer and the multi-head attention mechanism layer.

在编码层，使用了一种知识增强的神经网络模型，该模型能够更有效地捕捉文本中的复杂模式；多头注意力机制层进一步提取了文本中的关键信息，并将其用于分类层，以预测文本是否具有讽刺意味。In the encoding layer, a knowledge-enhanced neural network model is used, which can more effectively capture complex patterns in the text; the multi-head attention mechanism layer further extracts key information in the text and uses it in the classification layer, to predict whether a text is sarcastic or not.

在另一实施例中，为了验证模型有效性进行了多组对比实验：In another embodiment, multiple sets of comparative experiments were conducted to verify the effectiveness of the model:

首先，在Semeval-2018Task3数据集上，实验地探讨了不同网络模型(包括基于预训练语言模型和门控循环神经网络GRU的BERT-GRU-Softmax，基于预训练语言模型和基础的长短时记忆LSTM网络的BERT-LSTM-Softmax，三个基准模型，以及本发明模型)在讽刺检测任务上的性能。First, on the Semeval-2018Task3 data set, we experimentally explored different network models (including BERT-GRU-Softmax based on the pre-trained language model and the gated recurrent neural network GRU, based on the pre-trained language model and the basic long short-term memory LSTM The performance of BERT-LSTM-Softmax network, three baseline models, and the inventive model) on the sarcasm detection task.

本发明采用了双向长短时记忆BiLSTM模型如图3所示，预训练RoBERTa的神经网络模型如图2所示，采用来自Semeval-2018Task3数据集的Twitter评论文本，主要用于训练和评估讽刺检测模型，该数据集按照8:1:1的比列被随机分为训练集、测试集和验证集。This invention adopts the bidirectional long short-term memory BiLSTM model as shown in Figure 3. The pre-trained RoBERTa neural network model is shown in Figure 2. The Twitter comment text from the Semeval-2018Task3 data set is mainly used to train and evaluate the sarcasm detection model. , the data set was randomly divided into training set, test set and verification set according to the ratio of 8:1:1.

在模型构建方面，本发明使用了一个知识增强的神经网络模型，该模型从外部知识源(如维基百科、纽约时报和英国广播公司数据BBC)中获取上下文信息，模型主要由嵌入层、编码层、多头注意力机制层和分类层组成，嵌入层使用了预训练的RoBERTa模型，编码层则使用了8层的双向长短时记忆BiLSTM网络；In terms of model construction, the present invention uses a knowledge-enhanced neural network model that obtains contextual information from external knowledge sources (such as Wikipedia, the New York Times, and the British Broadcasting Corporation data BBC). The model mainly consists of an embedding layer and a coding layer. , composed of a multi-head attention mechanism layer and a classification layer. The embedding layer uses the pre-trained RoBERTa model, and the encoding layer uses an 8-layer bidirectional long short-term memory BiLSTM network;

对于优化使用Adam优化器，并设置权重衰减为1.0e-2；使用OneCycleLR作为学习率调度器，最大学习率设置为1.0e-5；训练周期设置为20个，每个周期包含256个步骤；隐藏层的大小设置为512，dropout率设置为0.5；在训练过程中，微批次micro-batch大小设置为8，最大训练周期设置为20；词嵌入的存储模式设置为GPU，并添加了500个预热步骤；For optimization, use the Adam optimizer and set the weight decay to 1.0e-2; use OneCycleLR as the learning rate scheduler, and set the maximum learning rate to 1.0e-5; set the training cycles to 20, each cycle containing 256 steps; The size of the hidden layer is set to 512, and the dropout rate is set to 0.5; during the training process, the micro-batch size is set to 8, and the maximum training period is set to 20; the storage mode of the word embedding is set to GPU, and 500 is added a preheating step;

最后，模型使用softmax层进行分类，输出样本属于讽刺或非讽刺两个类别的概率，通过与真实标签的比较评估模型的性能。Finally, the model uses a softmax layer for classification, outputs the probability that the sample belongs to the two categories of sarcasm or non-sarcasm, and evaluates the performance of the model by comparing it with the true label.

构建其他对比模型，进行对比测试，Semeval2018数据集上的实验结果如Construct other comparison models and conduct comparative tests. The experimental results on the Semeval2018 data set are as follows:

表1所示:Table 1 shows:

不同网络模型方法Different network model methods Marco-F1(％)Marco-F1(%) 预训练语言模型和门控循环神经网络Pretrained language models and gated recurrent neural networks 79.5079.50 预训练语言模型和基础长短时记忆网络Pre-trained language model and basic long short-term memory network 80.6580.65 Singh et al.(2019)注意力机制和表情符号文本化Singh et al. (2019) Attention mechanism and emoticon textualization 80.3180.31 Potamias et al.(2020)转换器模型和循环卷积网络Potamias et al. (2020) Converter model and recurrent convolutional network 80.0080.00 Turban et al.(2022)多模型集成训练网络Turban et al. (2022) Multi-model ensemble training network 79.8079.80 本发明(维基百科)Invention (Wikipedia) 82.9782.97

表1为各个模型在Semeval-2018 Task3数据集上的表现，从实验结果数据中看出，本发明在引入维基百科Wikipedia上下文信息后，以82.97％的F1分数明显超越了其他基线模型，证明模型在讽刺识别方面的优越性。Table 1 shows the performance of each model on the Semeval-2018 Task3 data set. From the experimental result data, it can be seen that after introducing Wikipedia context information, the present invention significantly surpassed other baseline models with an F1 score of 82.97%, proving that the model Superiority in sarcasm recognition.

不同知识源作为上下文的实验结果如表2所示：The experimental results using different knowledge sources as context are shown in Table 2:

外部知识源external knowledge sources Macro-F1(％)Macro-F1(%) 没有引入外部知识No external knowledge is introduced 80.6580.65 纽约时报New York Times 80.7780.77 英国广播公司数据BBC data 81.3981.39 维基百科Wikipedia 82.9782.97

表2为不同知识源的实验结果；本发明提出的模型即使没有引入外部知识，也达到了80.65％的F1分数，这充分证明了通过整合预训练语言模型RoBERTa、长短时记忆LSTM和注意力机制，本神经网络模型具有强大的能力；当引入原文的上下文信息后，所有的知识源，包括维基百科Wikipedia、纽约时报和英国广播公司数据BBC，都能提升模型的性能；具体来说，使用Wikipedia作为知识源的模型达到了82.97％的F1分数，与没有使用外部知识源的80.65％相比，有了显著的提升；纽约时报和英国广播公司数据BBC作为知识源也分别使模型的F1分数达到了80.77％和81.39％；这些数据明确地表明，外部知识源中的上下文信息在预测讽刺方面是非常关键的。Table 2 shows the experimental results of different knowledge sources; even if the model proposed by the present invention does not introduce external knowledge, it still reaches an F1 score of 80.65%, which fully proves that by integrating the pre-trained language model RoBERTa, long short-term memory LSTM and attention mechanism , this neural network model has powerful capabilities; when the contextual information of the original text is introduced, all knowledge sources, including Wikipedia, New York Times and BBC data, can improve the performance of the model; specifically, using Wikipedia The model as a knowledge source achieved an F1 score of 82.97%, which is a significant improvement compared to 80.65% without using external knowledge sources; the New York Times and British Broadcasting Corporation data BBC as knowledge sources also respectively made the model's F1 score reach 82.97%. achieved 80.77% and 81.39%; these data clearly show that contextual information from external knowledge sources is very critical in predicting sarcasm.

本发明所提模型在不同知识增强方法的实验结果如表3所示：The experimental results of the model proposed in this invention on different knowledge enhancement methods are shown in Table 3:

数据增强策略Data augmentation strategy Macro-F1(％)Macro-F1(%) 反向翻译back translation 77.5877.58 同义词替换synonym replacement 80.6980.69 词序替换word order replacement 79.2179.21 本发明this invention 82.9782.97

表3为不同数据增强策略的实验结果；从中看车，三种策略在讽刺检测任务上有不同的效果。其中，同义词替换策略略有提升，达到了80.69％的F1分数，轻微超过了没有数据增强的80.65％，这表明同义词替换作为一种数据增强策略，在一定程度上可以提高模型的泛化能力；对于反向翻译和词序替换，模型只达到了77.58％和79.21％的F1分数，这略低于没有使用数据增强的模型，可能是因为这两种策略在增加数据多样性的同时，也可能引入了一些噪声，从而影响了模型的性能；当本发明将维基百科Wikipedia作为上下文信息引入模型时，性能得到了显著提升，F1分数达到了82.97％，进一步证明了外部知识源中的上下文信息能提供更丰富的语义信息，有助于模型更好地理解文本中的语义。Table 3 shows the experimental results of different data enhancement strategies; looking at the car from it, the three strategies have different effects on the sarcasm detection task. Among them, the synonym replacement strategy has slightly improved, reaching an F1 score of 80.69%, slightly exceeding 80.65% without data enhancement, which shows that synonym replacement, as a data enhancement strategy, can improve the generalization ability of the model to a certain extent; For back translation and word order replacement, the model only achieved F1 scores of 77.58% and 79.21%, which are slightly lower than the model without data augmentation, probably because these two strategies may also introduce data diversity while increasing data diversity. Some noise was added, thus affecting the performance of the model; when the present invention introduced Wikipedia as contextual information into the model, the performance was significantly improved, and the F1 score reached 82.97%, further proving that contextual information in external knowledge sources can provide Richer semantic information helps the model better understand the semantics in the text.

一种基于知识增强神经网络模型的讽刺检测系统，基于一种基于知识增强神经网络模型的讽刺检测方法，包括：文本采集模块、文本整合模块、知识增强讽刺检测模型和模型构建训练模块；A sarcasm detection system based on a knowledge-enhanced neural network model, based on a sarcasm detection method based on a knowledge-enhanced neural network model, including: a text acquisition module, a text integration module, a knowledge-enhanced sarcasm detection model and a model construction training module;

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple. For relevant details, please refer to the description in the method section.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be practiced in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A sarcasm detection method based on a knowledge-enhanced neural network model, which is characterized by including the following steps:

S1. Filter contextual information that is highly relevant to the text to be detected from external knowledge sources, and integrate the filtered contextual information with the original text;

S2. Use the pre-trained language model RoBERTa to perform word embedding on the integrated text data, and initialize the weights of the bidirectional long short-term memory BiLSTM network;

S3. Construct a coding model consisting of an 8-layer bidirectional long short-term memory BiLSTM network, and embed a multi-head self-attention mechanism in the bidirectional long short-term memory BiLSTM network to capture long-distance dependencies and local semantic features in the text;

S4. Input the word embedding of the pre-trained language model RoBERTa into the encoding model, perform multi-classification training through the fully connected layer and multi-head self-attention mechanism, and the classifier composed of the softmax activation function, and improve the model through the optimization algorithm. Obtain the final knowledge-enhanced sarcasm detection model;

S5. Collect the text to be detected and input the knowledge-enhanced sarcasm detection model to output the sarcasm detection results.

2. A sarcasm detection method based on knowledge-enhanced neural network model according to claim 1, characterized in that the specific content of step S1 is:

S11. Obtain the sentences most relevant to the original text from different external knowledge sources as candidate contexts;

S12. Use the BERTScore text similarity algorithm to sort all candidate contexts, and select the candidate context that best matches the original text as the context of the original text based on the text similarity score;

S13. Connect the extracted context with the original text using the EOS tag indicating the end of the sequence.

3. A sarcasm detection method based on knowledge-enhanced neural network model according to claim 2, characterized in that the BERTScore text similarity algorithm is specifically:

Among them, A and B are the text to be detected and the text in the external knowledge source respectively.

4. A sarcasm detection method based on knowledge-enhanced neural network model according to claim 1, characterized in that the specific steps of obtaining the pre-trained language model RoBERTa are:

Collect a large amount of unlabeled text data D = (d ₁ , d ₂ ,...d _n );

Perform word segmentation on each document d _i to obtain word sequence T={t ₁ , t ₂ ,...,t _ni };

Initialize the word embedding matrix E = {e ₁ , e ₂ ,..., e _V }, where e _i is the embedding vector of the i-th word in the vocabulary, and V is the size of the vocabulary;

Based on the Transformer architecture and its self-attention mechanism, the pre-trained language model is trained and the model parameters are optimized using maximum likelihood estimation. After the training is completed, the pre-trained language model RoBERTa is obtained.

5. A kind of sarcasm detection method based on knowledge-enhanced neural network model according to claim 1, characterized in that the specific content of step S4 model multi-classification training is:

S41. Divide the integrated text data into a training set and a test set, use the pre-trained language model RoBERTa, and convert the words in the training set into a word embedding matrix X = {x ₁ , x ₂ ,..., x _N } and input it into the coding model middle;

S42. Obtain the hidden state sequence through the 8-layer bidirectional long short-term memory BiLSTM network of the encoding model;

S43. Introduce a multi-head attention mechanism to obtain the attention weight and average weight of each layer;

S44. According to the average weight, input the fully connected layer, and use the activation function softmax to build a classifier to classify the output of the fully connected layer;

S45. Based on the classification results, use the cross-entropy loss function and Adam optimizer to optimize the model;

S46. Update the weight of the model according to the gradient of the loss function to improve the performance of the model.

6. A sarcasm detection method based on knowledge-enhanced neural network model according to claim 4 or 5, characterized in that the self-attention mechanism is specifically:

Among them, Q, K, and V are query, key, and value matrices respectively, and d _k is the dimension of the key.

7. A sarcasm detection method based on knowledge-enhanced neural network model according to claim 1, characterized in that, after step S4, it also includes model testing and evaluation: the test results of the model pass accuracy, recall and F value. Conduct evaluation, analysis and generalization, and optimize and improve model performance.

8. A sarcasm detection method based on knowledge-enhanced neural network model according to claim 5, characterized in that the constructed 8-layer bidirectional long short-term memory BiLSTM network is specifically:

Among them, H _t is the hidden state at the t-th time step, and X _t is the word embedding input at the t-th time step;

The attention weight α ^l of each layer is:

Among them, T is the length of the word sequence, is the similarity between the target time step t and the source time step j in the l-th layer BiLSTM;

in, is the hidden state of the target sequence in layer l at time t-1;/> is the hidden state of the source sequence in layer l at time j; a is a learnable function;

average weight for:

The input of the fully connected layer and softmax classifier is expressed as:

The fully connected layer F is:

F(D)＝W _f ·D+b _f

Among them, W _f and b _f are the weights and biases of the fully connected layer respectively;

The classifier C built using the activation function softmax is:

C(F)=softmax(W _c ·F+b _c )

Among them, W _c and b _c are the weight and bias of the classifier respectively;

The cross entropy loss function L is:

Among them, N is the number of labeled samples, y _i is the actual label of the i-th sample, and C(xi ₎ represents the model prediction value of the i-th sample, ranging from [0,1].

9. A sarcasm detection method based on knowledge-enhanced neural network model according to claim 8, characterized in that the rules for optimization using the Adam optimizer are specifically:

m _t =β ₁ m _t-1 +(1-β ₁ )g _t

Among them, m _t and v _t are the estimated values of the first-order moment and the second-order moment respectively, β ₁ and β ₂ are attenuation factors, g _t is the gradient of the loss function L with respect to the model parameter σ, α represents the learning rate, and ε is the prevention factor Divided by a small constant of zero, _σt is the model parameter at time t.

10. A sarcasm detection system based on a knowledge-enhanced neural network model, characterized in that, based on a sarcasm detection method based on a knowledge-enhanced neural network model according to any one of claims 1-9, it includes: a text collection module, Text integration module, knowledge-enhanced sarcasm detection model and model building training module;

The knowledge-enhanced sarcasm detection model includes the pre-trained language model RoBERTa and the bidirectional long short-term memory BiLSTM model;

Text collection module, used to collect text to be detected and external knowledge sources;

The text integration module is used to filter contextual information that is highly relevant to the text to be detected from external knowledge sources, and integrate the filtered contextual information with the original text;

The pre-trained language model RoBERTa is used to embed words on the integrated text data and initialize the weights of the bidirectional long short-term memory BiLSTM network;

The model construction training module is used to build a coding model composed of an 8-layer bidirectional long short-term memory BiLSTM network. A multi-head self-attention mechanism is embedded in the bidirectional long short-term memory BiLSTM network to capture long-distance dependencies and local semantic features in the text; The word embedding of the pre-trained language model RoBERTa is input into the encoding model, and multi-classification training is performed through the fully connected layer and the multi-head self-attention mechanism, as well as the classifier composed of the softmax activation function, and the model is improved through the optimization algorithm to obtain two-way Long short-term memory BiLSTM model, and the final knowledge-enhanced sarcasm detection model;

The knowledge-enhanced sarcasm detection model is used to input the collected text to be detected and output the sarcasm detection results.