CN111966786A

CN111966786A - A Weibo Rumor Detection Method

Info

Publication number: CN111966786A
Application number: CN202010757089.1A
Authority: CN
Inventors: 宋玉蓉; 潘德宇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-20
Anticipated expiration: 2040-07-31
Also published as: CN111966786B

Abstract

The invention provides a microblog rumor detection method, which considers the attention mechanism and comprises the following steps: collecting microblog events and corresponding comment data sets as sample data; preprocessing the sample data, and respectively extracting text contents of the original microblog and the comment; pre-training the text by adopting a BERT pre-training model, and generating a sentence vector with a fixed length for each sentence of the text; constructing a dictionary, and extracting an original microblog and a plurality of corresponding comments to form a microblog event vector matrix; training the vector matrix by adopting a deep learning method Text CNN-Attention, and constructing a multi-level training model; and carrying out classification detection on the vector matrix according to the multi-level training model to obtain a rumor detection result corresponding to the social network data. Compared with the traditional rumor detection method, the method improves the accuracy.

Description

A Weibo Rumor Detection Method

技术领域technical field

本发明属于自然语言处理技术领域，尤其涉及一种微博谣言检测方法。The invention belongs to the technical field of natural language processing, and in particular relates to a microblog rumor detection method.

背景技术Background technique

谣言一般是指未经核实的陈述或说明，往往与某一事件相关。随着社交媒体的迅速发展，谣言可以通过社交媒体以核裂变的速度迅速传播。社交媒体之一的微博,即微型博客，是 Web2.0时代新兴的一类开放互联网社交服务。用户可以借助于互联网或手机等传播媒介，随时随地的用简短的文字更新自己的微博，同更多的用户分享信息。微博与传统博客相比，在传播特性上表现出：即时的博文分享、创新的交互方式、生动的现场演绎。在传播效应上表现出：人气积累、经济快捷的品牌营销。但是，多元化的传播中，自由化的传播内容、平民化的传播者和广泛的受众、多样化的传播渠道，推动了谣言在微博上的传播与扩散。微博上谣言的传播多通过用户与用户之间关于信息的评论与转发来进行，若虚假谣言被广泛传播，则对社会产生一定的负面影响。Rumors generally refer to unverified statements or statements, often related to an event. With the rapid development of social media, rumors can spread rapidly through social media at the speed of nuclear fission. Microblog, one of the social media, is a new type of open Internet social service in the Web2.0 era. Users can update their Weibo with short text anytime and anywhere with the help of communication media such as the Internet or mobile phones, and share information with more users. Compared with traditional blogs, microblogs have the following characteristics: instant blog sharing, innovative interactive methods, and lively on-site interpretation. In terms of communication effect, it shows: popularity accumulation, economical and fast brand marketing. However, in the diversified communication, liberalized communication content, popular disseminators and wide audiences, and diversified communication channels have promoted the spread and spread of rumors on Weibo. The spread of rumors on Weibo is mostly carried out through comments and forwarding of information between users. If false rumors are widely spread, it will have a certain negative impact on society.

关于谣言检测的办法一般分为两类：一类是机器学习基于传统的人工提取特征的方法，从谣言内容、谣言用户、谣言传播三个方面搭配情感极性、用户影响力等因素挖掘特征并通过贝叶斯、决策树等分类器进行谣言检测；另一类是基于深度学习方法通过构造神经网络并搭配非线性函数学习文本中的潜在特征，通过CNN、RNN等神经网络模型对文本序列进行特征表示学习，最后通过非线性分类器进行谣言检测。目前通过深度学习构造神经网络对谣言检测的研究中预训练模型大多采用的是word2vec词向量或ELMo，但前者中得出的词向量无法解决多义词的问题使得训练出的每个词只能对应一个向量表示，而后者可以根据上下文动态调整词嵌入，但是使用LSTM进行特征抽取而不是Transformer，并且ELMo使用上下文向量拼接作为当前向量，这样融合出的向量特征较差。训练模型多采用CNN或RNN网络，但CNN网络虽然可以提取句义特征却忽略了上下文语序特征，并且CNN网络经过全连接操作后将池化所得特征拼接时无法对影响较明显的特征进行区分。本发明针对目前存在的挑战提出一种新的考虑注意力机制的谣言检测模型，在文本预处理方面选用能够提取文本潜在特征的BERT预训练模型，训练模型上在CNN模型中引入了注意力机制，能够自动根据事件影响力不同分配不同的权重，最后使用Softmax分类器进行谣言检测。The methods of rumor detection are generally divided into two categories: one is the method of machine learning based on traditional manual extraction of features, which mines features from the three aspects of rumor content, rumor users, and rumor dissemination, with factors such as emotional polarity and user influence. Rumor detection is carried out through classifiers such as Bayesian and decision trees; the other is based on deep learning methods by constructing neural networks and matching nonlinear functions to learn potential features in texts, and through neural network models such as CNN, RNN and other neural network models. Feature representation learning, and finally rumor detection through a nonlinear classifier. At present, most of the pre-training models in the research on rumor detection by constructing neural networks through deep learning use word2vec word vectors or ELMo, but the word vectors obtained in the former cannot solve the problem of polysemy, so that each word trained can only correspond to one Vector representation, which can dynamically adjust word embeddings according to context, but uses LSTM for feature extraction instead of Transformer, and ELMo uses context vector splicing as the current vector, so the fused vector features are poor. Most of the training models use CNN or RNN networks, but although the CNN network can extract semantic features, it ignores the contextual word order features, and the CNN network cannot distinguish the more obvious features when the pooled features are spliced after the full connection operation. In view of the existing challenges, the present invention proposes a new rumor detection model that considers the attention mechanism, selects the BERT pre-training model that can extract the latent features of the text in the text preprocessing, and introduces the attention mechanism into the CNN model in the training model. , which can automatically assign different weights according to the influence of events, and finally use the Softmax classifier to detect rumors.

有鉴于此，有必要设计一种微博谣言检测方法，以解决上述问题。In view of this, it is necessary to design a microblog rumor detection method to solve the above problems.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种准确率较高的微博谣言检测方法。The purpose of the present invention is to provide a microblog rumor detection method with high accuracy.

为实现以上目的，本发明提供了一种微博谣言检测方法，包含如下步骤：In order to achieve the above purpose, the present invention provides a microblog rumor detection method, comprising the following steps:

A、收集微博事件和相应评论数据集作为样本数据；A. Collect Weibo events and corresponding comment datasets as sample data;

B、对样本数据进行预处理，分别提取原微博与评论的文本内容；B. Preprocess the sample data, and extract the text content of the original Weibo and comments respectively;

C、采用BERT预训练模型对文本进行预训练，每句文本生成固定长度的句向量；C. The BERT pre-training model is used to pre-train the text, and each sentence of text generates a fixed-length sentence vector;

D、构建字典，提取原微博与对应数条评论组成微博事件向量矩阵；D. Construct a dictionary, extract the original microblog and the corresponding comments to form a microblog event vector matrix;

E、采用深度学习方法Text CNN-Attention对向量矩阵进行训练，构建多层次训练模型；E. Use the deep learning method Text CNN-Attention to train the vector matrix to build a multi-level training model;

F、根据多层次训练模型对向量矩阵进行分类检测，得到对应社交网络数据的谣言检测结果。F. Classify and detect the vector matrix according to the multi-level training model, and obtain the rumor detection result corresponding to the social network data.

作为本发明的进一步改进，所述样本数据包括谣言样本数据和非谣言样本数据。As a further improvement of the present invention, the sample data includes rumor sample data and non-rumor sample data.

作为本发明的进一步改进，所述步骤B中，使用正则表达式清除json文件中的噪声。As a further improvement of the present invention, in the step B, regular expressions are used to remove noise in the json file.

作为本发明的进一步改进，所述进行完预训练的全部文本按照训练数据与测试数据按照4：1的比例用于后续模型的处理。As a further improvement of the present invention, all the pre-trained texts are used for subsequent model processing according to the training data and the test data in a ratio of 4:1.

作为本发明的进一步改进，预训练的BERT模型与代码能够实现词向量的嵌入。As a further improvement of the present invention, the pre-trained BERT model and code can implement the embedding of word vectors.

作为本发明的进一步改进，所述BERT模型作为词向量模型，能够充分描述字符级、词级、句子级以至于句子间关系特征，将NLP任务逐渐移到预训练产生句向量上。As a further improvement of the present invention, as a word vector model, the BERT model can fully describe the character-level, word-level, sentence-level and even the relationship between sentences, and gradually move the NLP task to the sentence vector generated by pre-training.

作为本发明的进一步改进，所述BERT模型提出预训练目标：遮蔽语言模型(maskedlanguage model，MLM)，克服传统的单向性局限，MLM目标允许表征融合左右两侧的语境，从而可以预训练一个深度双向的Transformer。As a further improvement of the present invention, the BERT model proposes a pre-training target: a masked language model (MLM), which overcomes the traditional one-way limitation. The MLM target allows the representation of the context of fusion of the left and right sides, so that pre-training is possible A deep bidirectional Transformer.

作为本发明的进一步改进，所述BERT模型引入了“下一句预测”任务，可以和MLM共同训练文本对的表示。As a further improvement of the present invention, the BERT model introduces the task of "predicting the next sentence", which can jointly train the representation of text pairs with MLM.

作为本发明的进一步改进，所述BERT模型运用句子级负采样，预测输入BERT 的两端文本是否连续；在训练过程中，输入模型的第二段将从所有文本中随机选择，概率为50％，其余50％将选择第一段的后续文本。As a further improvement of the present invention, the BERT model uses sentence-level negative sampling to predict whether the texts at both ends of the input BERT are continuous; during the training process, the second paragraph of the input model will be randomly selected from all texts with a probability of 50% , the remaining 50% will select the subsequent text of the first paragraph.

作为本发明的进一步改进，所述构建多层次训练模型由Text CNN和注意力机制两部分组成；其中，Text CNN模型使用三个卷积尺寸分别为3,4,5的卷积核对待测向量矩阵进行卷积操作，得到关于不同卷积核基于向量矩阵的不同的特征表示，通过池化操作每个卷积核对应输入矩阵只产生一个最大特征，再通过全连接操作将不同尺寸卷积核所得特征表示相连；注意力机制对全连接后产生的特征表示根据每个特征按照对输出影响力的不同赋予不同的权重，使得影响力大的特征在进行谣言检测时会拥有更大的影响力。As a further improvement of the present invention, the construction of the multi-level training model is composed of Text CNN and an attention mechanism; wherein, the Text CNN model uses three convolution kernels with convolution sizes of 3, 4, and 5 to be tested. The matrix performs the convolution operation to obtain different feature representations based on the vector matrix for different convolution kernels. Through the pooling operation, each convolution kernel corresponds to the input matrix to generate only one maximum feature, and then the convolution kernels of different sizes are connected through the full connection operation. The obtained feature representations are connected; the attention mechanism assigns different weights to the feature representations generated after full connection according to the influence of each feature on the output, so that the features with greater influence will have greater influence in rumor detection. .

本发明的有益效果如下：本发明微博谣言检测方法，在文本预处理阶段运用了BERT预训练模型，使用Transformer能更高效的捕捉更长距离的依赖，可以挖掘深层的上下文信息，使得预训练出来的句向量具有更好的潜在特征；训练模型引入了注意力机制通过给不同的特征根据其影响力赋予不同的权重，这样对输出结果影响较大的特征就会被赋予更多的权重，从而对结果产生更重要的影响，有利于进行谣言检测，提高检测的准确率。The beneficial effects of the present invention are as follows: the microblog rumor detection method of the present invention uses the BERT pre-training model in the text preprocessing stage, using the Transformer can more efficiently capture longer-distance dependencies, and can mine deep context information, so that pre-training The resulting sentence vector has better latent features; the training model introduces an attention mechanism by assigning different weights to different features according to their influence, so that features that have a greater impact on the output result will be given more weights, Thus, it has a more important impact on the results, which is conducive to the detection of rumors and improves the accuracy of detection.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单的介绍，显而易见地，下面描述中的附图仅仅只是本发明的一些实施例。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention.

其中：in:

图1为谣言检测的通用流程图；Figure 1 is a general flow chart of rumor detection;

图2为BERT模型的结构示意图；Figure 2 is a schematic diagram of the structure of the BERT model;

图3为本发明考虑注意力机制的微博谣言检测方法的流程图；Fig. 3 is the flow chart of the microblog rumor detection method considering the attention mechanism of the present invention;

图4为神经网络Text CNN模型的结构示意图；Figure 4 is a schematic diagram of the structure of the neural network Text CNN model;

图5为引入注意力机制的结构示意图；Figure 5 is a schematic structural diagram of introducing an attention mechanism;

图6为实施例一实验结果MATLAB仿真图；Fig. 6 is the MATLAB simulation diagram of the experimental result of embodiment one;

图7为实施例二实验结果MATLAB仿真图。FIG. 7 is a MATLAB simulation diagram of the experimental result of the second embodiment.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图和具体实施例对本发明进行详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

本发明一种微博谣言检测方法，考虑了注意力机制，该方法整体流程如图1所示，主要包含以下步骤：A microblog rumor detection method of the present invention takes into account the attention mechanism. The overall flow of the method is shown in Figure 1, which mainly includes the following steps:

步骤1，收集微博事件及相应评论数据作为样本数据；Step 1, collect Weibo events and corresponding comment data as sample data;

这里的样本数据包含谣言样本数据和非谣言样本数据；The sample data here includes rumor sample data and non-rumor sample data;

谣言样本数据标签为“1”，非谣言样本数据标签为“0”。The label of rumor sample data is "1", and the label of non-rumor sample data is "0".

步骤2，对样本数据进行预处理，使用正则表达式提取相应文本内容；Step 2: Preprocess the sample data, and use regular expressions to extract corresponding text content;

预处理主要目的是去除文本中的噪声，包括非中文字符、标点、停用词等。由于样本数据都是以json格式的文件进行储存；json文件是以“键值对”的形式储存数据，将数据名称作为json文件中的键，爬取到的数据值作为json文件中的值，例如“text:早餐。不许联想，以免跨省。”；The main purpose of preprocessing is to remove noise in the text, including non-Chinese characters, punctuation, stop words, etc. Since the sample data is stored in the json format file; the json file stores the data in the form of "key-value pairs", the data name is used as the key in the json file, and the crawled data value is used as the value in the json file. For example "text: breakfast. Do not associate, so as not to cross the province.";

单个微博原事件的全部数据为一个json文件，单个事件的所有评论的全部数据为一个json文件；All the data of a single Weibo original event is a json file, and all the data of all comments of a single event is a json file;

使用正则表达式去除json文件中的噪声，对应提取微博原事件及其所有评论的text 文本内容储存下来；Use regular expressions to remove noise in the json file, and store the text content corresponding to the original microblog event and all its comments;

全部文本按照训练数据与测试数据4：1的比例用于后续模型的处理。All texts are used for subsequent model processing according to the ratio of training data to test data 4:1.

步骤3，下载BERT预训练模型，把文本转化成相应句向量；Step 3, download the BERT pre-training model, and convert the text into the corresponding sentence vector;

BERT模型通过下载谷歌的BERT预训练模型可以获得，预训练的中文BERT模型与代码均来自于Google Research的BERT，能够实现词向量的嵌入，基本结构模型如图2所示；The BERT model can be obtained by downloading Google's BERT pre-training model. The pre-trained Chinese BERT model and code are all from Google Research's BERT, which can implement word vector embedding. The basic structure model is shown in Figure 2;

BERT：全称是Bidirectional Encoder Representation from Transformers，即Transformer的双向编码表示来改进基于架构微调的方法。BERT模型作为词向量模型，能够充分描述字符级、词级、句子级以至于句子间关系特征，目的是将下游的NLP任务逐渐移到预训练产生句向量上；BERT: The full name is Bidirectional Encoder Representation from Transformers, which is the bidirectional encoding representation of Transformer to improve methods based on architecture fine-tuning. As a word vector model, the BERT model can fully describe the character-level, word-level, sentence-level and even the relationship between sentences. The purpose is to gradually move the downstream NLP tasks to the sentence vectors generated by pre-training;

BERT模型包括以下特征：BERT模型提出了一种新的预训练目标：遮蔽语言模型(masked language model，MLM)，克服传统的单向性局限，MLM目标允许表征融合左右两侧的语境，从而可以预训练一个深度双向的Transformer；BERT模型引入了“下一句预测”任务，可以和MLM共同训练文本对的表示；BERT模型运用了句子级负采样，对于句子级的连续性预测，就是预测输入BERT的两端文本是否连续。在训练过程中，输入模型的第二段将从所有文本中随机选择，概率为50％，其余50％将选择第一段的后续文本。The BERT model includes the following features: The BERT model proposes a new pre-training objective: a masked language model (MLM), which overcomes the traditional one-way limitation. The MLM objective allows the representation to fuse the left and right contexts, thereby A deep bidirectional Transformer can be pre-trained; the BERT model introduces the "next sentence prediction" task, which can jointly train the representation of text pairs with MLM; the BERT model uses sentence-level negative sampling, and for sentence-level continuity prediction, it is the prediction input Whether the text at both ends of BERT is continuous. During training, the second paragraph of the input model will be randomly selected from all texts with a probability of 50%, and the remaining 50% will choose the subsequent text of the first paragraph.

步骤4，按照选用句子长度和句向量维度构建相应输入矩阵；Step 4, construct the corresponding input matrix according to the selected sentence length and sentence vector dimension;

本文采用BERT base模型，网络层数为12层，训练出的句向量维度为768维；This paper adopts the BERT base model, the number of network layers is 12, and the dimension of the trained sentence vector is 768 dimensions;

从微博原文和对应所有评论的句向量中选取固定条数句向量组成输入矩阵。A fixed number of sentence vectors are selected from the original text of Weibo and sentence vectors corresponding to all comments to form an input matrix.

步骤5，采用深度学习方法，构建Text CNN-Attention的多层次训练模型。In step 5, a deep learning method is used to construct a multi-level training model of Text CNN-Attention.

图3所示为本发明提出的考虑注意力机制的谣言检测方法的详细流程图，模型的第一层为输入层，主要是输入经过BERT预训练模型生成的句向量组成，这里的整件微博事件构成是原微博加取出的对应随机数条评论；紧接着是卷积层，这里分别运用不同尺寸的过滤器来进行卷积对输入层的句向量进行学习，可以得到基于不同过滤器的特征表示。将属于同一窗口的特征进行拼接，就可以得到窗口的特征向量，根据先后顺序的不同就可以得到特征序列；第三层是在特征序列中引入注意力机制，可以根据注意力分配的不同，对每一个特征都赋予不同的权重，这样对输出结果影响较大的特征就会被赋予更多的权重，从而对结果产生更重要的影响，最后将输出传入分类器进行事件谣言与否的判别。Figure 3 shows a detailed flow chart of the rumor detection method considering the attention mechanism proposed by the present invention. The first layer of the model is the input layer, which is mainly composed of sentence vectors generated by the BERT pre-training model. The composition of the blog event is the corresponding random number of comments extracted from the original Weibo; followed by the convolution layer, where filters of different sizes are used to perform convolution to learn the sentence vectors of the input layer, and the results can be obtained based on different filters. feature representation. By splicing the features belonging to the same window, the feature vector of the window can be obtained, and the feature sequence can be obtained according to the different order. Each feature is given a different weight, so that the features that have a greater impact on the output result will be given more weight, which will have a more important impact on the result, and finally the output will be passed to the classifier to judge whether the event is rumored or not. .

图4所示为Text CNN模型结构说明，详细过程如下：Figure 4 shows the structure of the Text CNN model. The detailed process is as follows:

(1)对于数据集中所有的谣言与非谣言事件及其相应的评论，通过BERT预处理模型都训练成了句向量。对于每一个微博事件，选取其事件下的相应数条评论和原微博一起作为输入传入输入层，输入层为一个m×n的矩阵，m为选取的事件的总数量， n则为单条句向量的长度。(1) For all rumor and non-rumor events in the dataset and their corresponding comments, the BERT preprocessing model is trained into sentence vectors. For each microblog event, select the corresponding number of comments under the event and the original microblog as input to the input layer, where the input layer is an m×n matrix, m is the total number of selected events, and n is The length of a single sentence vector.

(2)通过使用尺寸不同的三种过滤器进行卷积，分别得到对应不同过滤器的特征，过滤器会在m×n的输入矩阵中不停的滑动，为了方便提取特征，设定过滤器的长度为k，宽度与输入矩阵宽度一样为n，一个过滤器提取出的特征就可以表示为h∈R^k×n，那么对应m中的任意一条u所获得的特征为：(2) By using three types of filters with different sizes for convolution, the features corresponding to different filters are obtained respectively. The filter will keep sliding in the m×n input matrix. In order to facilitate feature extraction, set the filter The length of is k, and the width is the same as the width of the input matrix. The feature extracted by a filter can be expressed as h∈R ^k×n , then the feature obtained corresponding to any u in m is:

w_u＝(x_u,x_u+1,…,x_u-k+1)w _u =(x _u ,x _u+1 ,...,x _u-k+1 )

在对输入矩阵卷积完之后就会生成一个特征列表c，每一次卷积生成的特征都会对应c：c_u＝f(w_u*h+b)，式中的f为ReLU函数，b为偏置项。After the input matrix is convolved, a feature list c will be generated, and the features generated by each convolution will correspond to c: c _u =f(w _u *h+b), where f is the ReLU function, and b is Bias term.

(3)当过滤器在长度为m的输入上滑过时，特征列表的长度为(m-k+1)，假设存在q个过滤器，则会产生q个特征列表，将q通过拼接得到矩阵：(3) When the filter slides over the input of length m, the length of the feature list is (m-k+1). Assuming that there are q filters, q feature lists will be generated, and q is spliced to obtain a matrix :

W₁＝[c₁,c₂,…,c_q]W ₁ =[c ₁ ,c ₂ ,...,c _q ]

c_q代表第q个过滤器产生的特征列表。而本文一共运用了三种不同尺寸的过滤器，最后产生的总的矩阵即为：c _q represents the list of features produced by the qth filter. In this paper, a total of three filters of different sizes are used, and the final total matrix is:

W＝[W₁,W₂,W₃]＝[c₁,c₂,…,c_q,c_q+1,…,c_2q,c_2q+1,…,c_3q]W=[W ₁ ,W ₂ ,W ₃ ]=[c ₁ ,c ₂ ,...,c _q ,c _q+1 ,...,c _2q ,c _2q+1 ,...,c _3q ]

(4)对每种过滤器获得的特征采取最大池化操作得到输出特征，将不同过滤器输出特征进行全连接得到CNN输出：(4) The maximum pooling operation is performed on the features obtained by each filter to obtain the output features, and the output features of different filters are fully connected to obtain the CNN output:

W'＝[c₁₁,c₂₂,…,c_kk]。W'=[c ₁₁ ,c ₂₂ ,...,c _kk ].

(5)采用注意力层对CNN层的输出进行加权求和，以获取微博序列的隐层表示，引入注意力机制的结构图如图5所示。对CNN网络引入注意力机制能给CNN网络输出的隐状态序列W'赋予不同的权重，这样在学习微博序列的表示时模型能够有侧重的利用微博序列信息。该注意力层将CNN网络的输出c_kk作为输入，输出微博序列对应的表示v_kk，(5) The output of the CNN layer is weighted and summed by the attention layer to obtain the hidden layer representation of the microblog sequence. The structure diagram of introducing the attention mechanism is shown in Figure 5. Introducing the attention mechanism to the CNN network can give different weights to the hidden state sequence W' output by the CNN network, so that the model can focus on the use of the microblog sequence information when learning the representation of the microblog sequence. The attention layer takes the output c _kk of the CNN network as input, and outputs the corresponding representation v _kk of the microblog sequence,

h_i＝tanh(W_A*c_kk+b_A)h _i =tanh(W _A *c _kk +b _A )

组成矩阵V＝[v₁₁,v₂₂,…,v_kk]，W_A为权重矩阵，b_A为偏置值，h_i为c_kk的隐层表示，α_i为h_i与上下文h_A的相似度，v_i为输出向量。Composition matrix V=[v ₁₁ , v ₂₂ ,...,v _kk ], W _A is the weight matrix, b _A is the bias value, _hi is the hidden layer representation of c _kk , α _i is the relationship between _hi and context h _A Similarity, v _i is the output vector.

(6)将输出送入全连接层，通过Softmax得到谣言与非谣言的概率输出，从而达到判断谣言事件的目的。(6) The output is sent to the fully connected layer, and the probability output of rumor and non-rumor is obtained through Softmax, so as to achieve the purpose of judging rumor events.

步骤6，用多层次训练模型对输入矩阵进行训练并测试，得到相应谣言检测结果。Step 6: Use the multi-level training model to train and test the input matrix to obtain the corresponding rumor detection result.

实施例一：Example 1:

为了证明本发明的有效性，我们选取了Ma等人整理并用于论文里的一系列基于微博平台的事件数据，该数据集是通过微博API捕获的原始信息以及给定事件的所有转发和回复，还抓取了未经报道为谣言的一般主题帖子并收集类似数量的谣言事件，详细的统计情况如下表所列：In order to demonstrate the effectiveness of the present invention, we selected a series of event data based on the microblogging platform organized by Ma et al. In response, we also grabbed general topic posts that were not reported as rumors and collected a similar number of rumor incidents. The detailed statistics are listed in the following table:

我们将所有数据按照训练集与测试集4：1的比例进行划分，具体划分情况如下表所列：We divide all data according to the ratio of training set and test set 4:1. The specific division is listed in the following table:

我们采用的用来评估模型有效性的评价指标分别是准确率、精确率、召回率和F1值四个，预测结果与实际结果所产生的情况如下表所列：The evaluation indicators we use to evaluate the effectiveness of the model are the accuracy rate, precision rate, recall rate and F1 value. The predicted results and the actual results are listed in the following table:

我们用作对比的基线方法有四种，分别是SVM-TS、CNN-1、CNN-2、CNN-GRU，关于我们方法与基线方法在谣言检测中的效果比较详细数据如下表所列，实验结果 MATLAB仿真图如图6所示：There are four baseline methods we use for comparison, namely SVM-TS, CNN-1, CNN-2, and CNN-GRU. The detailed data about the comparison between our method and the baseline method in rumor detection are listed in the following table. The resulting MATLAB simulation diagram is shown in Figure 6:

由表可知传统的SVM-TS方法使用分类器进行谣言检测最后的准确率仅仅只有85.7％，效果并不是特别优秀，对比GRU-1、GRU-2、CNN-GRU三种模型的最后结果可以发现训练模型中加入卷积神经网络后因为可以通过过滤器提取到输入中不同的潜在特征，所以在准确率上有更好的表现达到了95.7％，而我们的模型在引入注意力机制后，将CNN的输出作为输入赋予不同的权重，这样对输出结果影响较大的特征就会被赋予更多的权重，从而对结果产生更重要的影响有助于进行谣言检测，结果表明我们的模型准确率达到了96.8％，并且在召回率和F1值上也有不错的提升。It can be seen from the table that the final accuracy rate of the traditional SVM-TS method using the classifier for rumor detection is only 85.7%, and the effect is not particularly good. Comparing the final results of the three models of GRU-1, GRU-2, and CNN-GRU, it can be found that After adding the convolutional neural network to the training model, because different latent features in the input can be extracted through the filter, it has a better performance of 95.7% in accuracy. After introducing the attention mechanism, our model will The output of CNN is given different weights as input, so that the features that have a greater impact on the output results will be given more weights, which will have a more important impact on the results, which is helpful for rumor detection. The results show that the accuracy of our model is It reaches 96.8%, and also has a good improvement in recall and F1 value.

实施例二：Embodiment 2:

为了证明我们方法的可行性，我们还选用了另一个微博数据集CED_Data set[23]进行试验，通过使用相同预训练模型获得的句向量在不同训练模型上训练得到准确率进行比较。该数据集包含了1538条谣言事件和1849条非谣言事件，我们按照训练集与测试集4：1的比例进行实验，实验数据如下表所列，实验结果MATLAB仿真图如图7所示：In order to demonstrate the feasibility of our method, we also selected another microblog dataset CED_Data set [23] for experiments, and compared the accuracy obtained by training on different training models using sentence vectors obtained from the same pre-training model. The data set contains 1538 rumor events and 1849 non-rumor events. We conduct experiments according to the ratio of training set to test set of 4:1. The experimental data is listed in the following table. The MATLAB simulation diagram of the experimental results is shown in Figure 7:

实验结果表明，通过BERT预训练模型获得的句向量在不同的训练模型上进行训练在准确率方面仍然会有偏差，但是偏差幅度对比之前使用不同预训练模型要小。通过实验可以得出，SVM-TS的准确率大概为86.7％，其次依次是GRU-1、CNN-GRU、 GRU-2模型，效果最好的是我们提出的CNN-Attention模型，准确率达到了95.3％，并且在召回率和F1值上体现出的效果也是众多模型中最好的。The experimental results show that the sentence vectors obtained by the BERT pre-training model will still have deviations in accuracy when trained on different training models, but the deviation is smaller than that of different pre-training models. Through experiments, it can be concluded that the accuracy rate of SVM-TS is about 86.7%, followed by GRU-1, CNN-GRU, and GRU-2 models. The best effect is our proposed CNN-Attention model, and the accuracy rate has reached 95.3%, and the effect on recall and F1 value is also the best among many models.

综上所述，我们的模型在两个不同的数据集上都表现出了最好的效果，通过使用BERT预训练模型可以大幅度提高预处理出的句向量的特征表现效果，搭配融入了注意力机制的CNN模型可以更有效的提取出文本中的潜在特征，对谣言检测任务意义重大。In summary, our model has shown the best results on two different datasets. By using the BERT pre-training model, the feature performance of the preprocessed sentence vectors can be greatly improved. The CNN model of the force mechanism can more effectively extract the latent features in the text, which is of great significance to the task of rumor detection.

本发明主要从预训练模型和训练模型两个方面阐述微博谣言事件检测问题，主要说明了预训练模型一样会对实验结果产生影响，当把部分下游NLP任务转移到预训练模型进行时可以取得更好的效果；在训练模型上，基于传统的Text CNN模型本文提出了一种新的引入注意力机制的谣言检测模型，可以对输入的句向量根据其对输入的影响程度赋予不同的权重，从而对预测事件谣言与否产生积极影响。本方法在真实微博数据集上经过实验验证具有较好的谣言检测效果。The present invention mainly expounds the problem of microblog rumor event detection from the two aspects of the pre-training model and the training model, and mainly illustrates that the pre-training model will also affect the experimental results, and when some downstream NLP tasks are transferred to the pre-training model, it can be obtained Better effect; in the training model, based on the traditional Text CNN model, this paper proposes a new rumor detection model that introduces an attention mechanism, which can assign different weights to the input sentence vector according to its influence on the input, This will have a positive impact on predicting whether the event is rumored or not. This method has been experimentally verified on the real Weibo dataset and has a good rumor detection effect.

以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified or equivalently replaced. Without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. a microblog rumor detection method, is characterized in that, comprises the steps:

A. Collect Weibo events and corresponding comment datasets as sample data;

B. Preprocess the sample data, and extract the text content of the original microblog and comments respectively;

C. The BERT pre-training model is used to pre-train the text, and each sentence of text generates a fixed-length sentence vector;

D. Construct a dictionary, extract the original microblog and the corresponding comments to form a microblog event vector matrix;

E. Use the deep learning method Text CNN-Attention to train the vector matrix to build a multi-level training model;

F. Classify and detect the vector matrix according to the multi-level training model, and obtain the rumor detection result corresponding to the social network data.

2 . The microblog rumor detection method according to claim 1 , wherein the sample data includes rumor sample data and non-rumor sample data. 3 .

3. microblog rumor detection method according to claim 1, is characterized in that: in step B, use regular expression to clear the noise in json file.

4 . The microblog rumor detection method according to claim 3 , wherein the pre-trained all texts are used for subsequent model processing according to the training data and the test data in a ratio of 4:1. 5 .

5. The microblog rumor detection method according to claim 4, wherein the pre-trained BERT model and code can realize the embedding of word vectors.

6. The microblog rumor detection method according to claim 5, characterized in that: the BERT model, as a word vector model, can fully describe character level, word level, sentence level and even the relationship features between sentences, and the NLP task is gradually Move to pre-training to generate sentence vectors.

7. microblog rumor detection method according to claim 1, is characterized in that: described BERT model proposes pre-training target: masked language model (masked language model, MLM), overcomes traditional one-way limitation, MLM target allows The representation fuses the left and right contexts so that a deep bidirectional Transformer can be pre-trained.

8 . The microblog rumor detection method according to claim 7 , wherein the BERT model introduces a task of “predicting the next sentence”, which can jointly train the representation of text pairs with MLM. 9 .

9. microblog rumor detection method according to claim 8, is characterized in that: described BERT model uses sentence-level negative sampling, predicts whether the two ends text of input BERT is continuous; In training process, the second paragraph of input model It will be randomly selected from all texts with a probability of 50%, and the remaining 50% will choose the subsequent text of the first paragraph.

10. microblog rumor detection method according to claim 1, is characterized in that: described building multi-level training model is made up of Text CNN and attention mechanism two parts; Wherein, Text CNN model uses three convolution sizes to be respectively. The convolution kernels of 3, 4, and 5 perform the convolution operation on the vector matrix to be tested, and obtain different feature representations based on the vector matrix for different convolution kernels. Through the pooling operation, each convolution kernel corresponding to the input matrix only produces one maximum feature. , and then connect the feature representations obtained by convolution kernels of different sizes through the full connection operation; the attention mechanism assigns different weights to the feature representations generated after full connection according to the influence of each feature on the output, so that the features with large influence Will have more influence when conducting rumor detection.