CN111966786B

CN111966786B - Microblog rumor detection method

Info

Publication number: CN111966786B
Application number: CN202010757089.1A
Authority: CN
Inventors: 宋玉蓉; 潘德宇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2022-10-25
Anticipated expiration: 2040-07-31
Also published as: CN111966786A

Abstract

The invention provides a microblog rumor detection method, which considers the attention mechanism and comprises the following steps: collecting microblog events and corresponding comment data sets as sample data; preprocessing the sample data, and respectively extracting text contents of the original microblog and the comment; pre-training the text by adopting a BERT pre-training model, and generating a sentence vector with a fixed length for each sentence of the text; constructing a dictionary, and extracting an original microblog and a plurality of corresponding comments to form a microblog event vector matrix; training the vector matrix by adopting a deep learning method Text CNN-Attention, and constructing a multi-level training model; and carrying out classification detection on the vector matrix according to the multi-level training model to obtain a rumor detection result corresponding to the social network data. Compared with the traditional rumor detection method, the method improves the accuracy.

Description

Microblog rumor detection method

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a microblog rumor detection method.

Background

Rumors generally refer to statements or descriptions that are not verified, often in relation to an event. With the rapid development of social media, rumors can rapidly spread through social media at the rate of nuclear fission. Microblogs, namely micro blogs, which are one of social media, are an emerging class of open internet social services in the web2.0 era. Users can update their microblogs with short characters at any time and any place by means of the internet or the propagation media such as mobile phones, and share information with more users. Compared with the traditional blog, the microblog shows the following propagation characteristics: instant blog sharing, innovative interactive mode and vivid on-site deduction. The following are shown in the propagation effect: accumulated human qi, economy and quickness. However, in the diversified distribution, the distribution and diffusion of rumors on microblogs are promoted by the free distribution content, civilized distributors, and wide audience and diversified distribution channels. The propagation of the rumors on the microblog is mostly performed through the comment and forwarding of information between users, and if the false rumors are widely propagated, certain negative effects are generated on the society.

Approaches to rumor detection generally fall into two categories: one is a method for machine learning based on traditional artificial feature extraction, which is characterized by mining features from factors such as rumor content, rumor users, rumor propagation, emotion polarity and user influence and carrying out rumor detection through classifiers such as Bayes and decision trees; the other type is based on a deep learning method, potential features in a text are learned by constructing a neural network and matching with a nonlinear function, feature representation learning is carried out on a text sequence through neural network models such as CNN and RNN, and finally rumor detection is carried out through a nonlinear classifier. At present, word2vec word vectors or ELMo are mostly adopted in a pre-training model in the research of constructing a neural network for rumor detection through deep learning, but the word vectors obtained in the former cannot solve the problem of polysemous words, so that each word obtained through training can only correspond to one vector to represent, the latter can dynamically adjust word embedding according to the context, but LSTM is used for feature extraction instead of transform, and ELMo uses context vector splicing as a current vector, so that the fused vector features are poor. The training model mostly adopts a CNN or an RNN, but although the CNN can extract sentence meaning characteristics, the CNN ignores context and word order characteristics, and the CNN can not distinguish the characteristics with obvious influence when the CNN splices the pooled characteristics after full-connection operation. The invention provides a new rumor detection model considering an attention mechanism aiming at the existing challenges, selects a BERT pre-training model capable of extracting potential features of a text in the aspect of text preprocessing, introduces the attention mechanism into a CNN model on the training model, can automatically distribute different weights according to different event influence forces, and finally uses a Softmax classifier to carry out rumor detection.

In view of the above, a method for detecting microblog rumors is needed to solve the above problems.

Disclosure of Invention

The invention aims to provide a microblog rumor detection method with high accuracy.

In order to achieve the above object, the present invention provides a microblog rumor detection method, comprising the following steps:

A. collecting microblog events and corresponding comment data sets as sample data;

B. preprocessing sample data, and respectively extracting text contents of an original microblog and a comment;

C. pre-training the text by adopting a BERT pre-training model, and generating a sentence vector with a fixed length for each sentence of the text;

D. constructing a dictionary, and extracting an original microblog and a plurality of corresponding comments to form a microblog event vector matrix;

E. training the vector matrix by adopting a deep learning method Text CNN-Attention, and constructing a multi-level training model;

F. and carrying out classification detection on the vector matrix according to the multi-level training model to obtain a rumor detection result corresponding to the social network data.

As a further improvement of the invention, the sample data comprises rumor sample data and non-rumor sample data.

As a further improvement of the present invention, in the step B, a regular expression is used to remove noise in the json file.

As a further improvement of the present invention, the whole text after pre-training is processed according to the training data and the test data according to the following steps of 4: the ratio of 1 is used for subsequent model processing.

As a further improvement of the present invention, the pre-trained BERT model and code enable the embedding of word vectors.

As a further improvement of the invention, the BERT model is used as a word vector model, can fully describe character level, word level and sentence level so as to lead the relation characteristics between sentences, and gradually moves the NLP task to the pre-training generated sentence vector.

As a further improvement of the invention, the BERT model proposes a pre-training target: a Masked Language Model (MLM) overcomes the traditional unidirectional limitation, and an MLM target allows the representation of contexts fusing the left side and the right side, so that a deep bidirectional Transformer can be pre-trained.

As a further improvement of the present invention, the BERT model introduces a "Next sentence prediction" task, which can be used to train the representation of the text pairs with MLM.

As a further improvement of the invention, the BERT model predicts whether texts at two ends of the input BERT are continuous or not by using sentence-level negative sampling; during the training process, the second segment of the input model will be randomly selected from all the texts with a probability of 50%, and the remaining 50% will select the subsequent text of the first segment.

As a further improvement of the invention, the constructed multi-level training model consists of Text CNN and an attention mechanism; the Text CNN model uses three convolution kernels with convolution sizes of 3,4 and 5 respectively to perform convolution operation on a vector matrix to be detected to obtain different feature representations of different convolution kernels based on the vector matrix, only one maximum feature is generated in each convolution kernel according to an input matrix through pooling operation, and the feature representations obtained by the convolution kernels with different sizes are connected through full-connection operation; the attention mechanism gives different weights to the characteristics generated after full connection according to different output influences of each characteristic, so that the characteristics with large influence have larger influence when rumor detection is carried out.

The invention has the following beneficial effects: according to the microblog rumor detection method, a BERT pre-training model is applied in a text preprocessing stage, dependence of a longer distance can be captured more efficiently by using a Transformer, deep context information can be mined, and the sentence vectors obtained through pre-training have better potential characteristics; the training model introduces an attention mechanism, different weights are given to different characteristics according to influence of the characteristics, so that the characteristics which have larger influence on an output result are given more weights, more important influence is generated on the result, rumor detection is facilitated, and the accuracy of detection is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention. Wherein:

FIG. 1 is a general flow chart for rumor detection;

FIG. 2 is a schematic diagram of the structure of the BERT model;

FIG. 3 is a flow chart of a microblog rumor detection method in consideration of attention mechanism according to the present invention;

FIG. 4 is a schematic structural diagram of a neural network Text CNN model;

FIG. 5 is a schematic diagram of a drawing attention mechanism;

FIG. 6 is a MATLAB simulation chart of an experimental result of the embodiment;

FIG. 7 is a MATLAB simulation chart of the results of the second experiment in example II.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention discloses a microblog rumor detection method, which considers the attention mechanism, and the overall flow of the method is shown as a figure 1, and mainly comprises the following steps:

step 1, collecting microblog events and corresponding comment data as sample data;

sample data here includes rumor sample data and non-rumor sample data;

the rumor sample data label is "1" and the non-rumor sample data label is "0".

Step 2, preprocessing sample data, and extracting corresponding text content by using a regular expression;

the main purpose of preprocessing is to remove noise in the text, including non-chinese characters, punctuation, stop words, etc. The sample data is stored in a json format file; the json file stores data in the form of "key-value pairs," with the data name as the key in the json file, and the crawled data values as the values in the json file, such as "text: breakfast. No association is allowed to avoid crossing provinces. ";

all data of a single microblog original event are a json file, and all data of all comments of the single event are the json file;

removing noise in the json file by using a regular expression, and correspondingly extracting and storing text contents of the microblog original events and all comments of the microblog original events;

all texts were as per training data and test data 4: a ratio of 1 is used for subsequent model processing.

Step 3, downloading a BERT pre-training model, and converting the text into corresponding sentence vectors;

the BERT model can be obtained by downloading a BERT pre-training model of Google, the pre-trained Chinese BERT model and codes are from the BERT of Google Research, word vector embedding can be realized, and a basic structure model is shown in figure 2;

BERT: a method for improving architecture-based fine-tuning is disclosed, which is called bi-directional encoding Representation of Bidirectional Encoder repetition from transforms. The BERT model is used as a word vector model, can fully describe character level, word level and sentence level so as to lead the relation characteristics among sentences, and aims to gradually move a downstream NLP task to a pre-training generated sentence vector;

the BERT model includes the following features: the BERT model proposes a new pre-training target: a Masked Language Model (MLM) overcomes the traditional unidirectional limitation, and an MLM target allows the representation of contexts on the left side and the right side of fusion, so that a deep bidirectional Transformer can be pre-trained; the BERT model introduces a task of 'next sentence prediction', and can jointly train the representation of a text pair with MLM; the BERT model applies sentence-level negative sampling, and for sentence-level continuity prediction, whether texts at two ends of input BERT are continuous or not is predicted. During the training process, the second segment of the input model will be randomly selected from all the text with a probability of 50%, with the remaining 50% selecting the subsequent text of the first segment.

Step 4, constructing a corresponding input matrix according to the selected sentence length and the sentence vector dimension;

a BERT base model is adopted, the number of network layers is 12, and the dimensionality of a trained sentence vector is 768 dimensions;

and selecting a fixed sentence vector from the microblog original text and the sentence vectors corresponding to all the comments to form an input matrix.

And 5, constructing a Text CNN-Attention multi-level training model by adopting a deep learning method.

Fig. 3 is a detailed flowchart of a rumor detection method considering attention mechanism proposed by the present invention, where the first layer of the model is an input layer, and mainly consists of sentence vectors generated by entering a BERT pre-training model, and the whole microblog event is formed by adding corresponding random number comments from an original microblog; next, convolution layers are followed, where the sentence vectors of the input layer are learned by performing convolution using filters of different sizes, respectively, and feature representations based on the different filters can be obtained. Splicing the features belonging to the same window to obtain a feature vector of the window, and obtaining a feature sequence according to different sequences; the third layer introduces an attention mechanism into the feature sequence, each feature can be given different weights according to different attention distribution, so that features with larger influence on the output result are given more weights, the result is influenced more importantly, and finally the output is transmitted to a classifier to judge whether the event rumor is not.

FIG. 4 shows a structural description of the Text CNN model, and the detailed process is as follows:

(1) For all rumor and non-rumor events in the dataset and their corresponding reviews, sentence vectors were trained by the BERT pre-processing model. For each microblog event, selecting a plurality of corresponding comments under the event and the original microblog as input and transmitting the input and the input into an input layer, wherein the input layer is an m x n matrix, m is the total number of the selected events, and n is the length of a single sentence vector.

(2) The method comprises the steps of performing convolution by using three filters with different sizes to respectively obtain characteristics corresponding to different filters, enabling the filters to continuously slide in an m multiplied by n input matrix, setting the length of each filter as k and the width of each filter as n as the width of the input matrix in order to extract the characteristics conveniently, and expressing the characteristics extracted by one filter as h epsilon R ^k×n Then the feature obtained for any u in m is:

w _u ＝(x _u ,x _u+1 ,…,x _u-k+1 )

after the convolution is completed on the input matrix, a feature list c is generated, and the feature generated by each convolution corresponds to c: c. C _u ＝f(w _u * h + b), where f is the ReLU function and b is the offset term.

(3) When the filter slides over an input with a length of m, the length of the feature list is (m-k + 1), and if q filters exist, q feature lists are generated, and q is spliced to obtain a matrix:

W ₁ ＝[c ₁ ,c ₂ ,…,c _q ]

c _q representing the list of features generated by the qth filter. And a total of three filters of different sizes are used, the total matrix generated finally is:

W＝[W ₁ ,W ₂ ,W ₃ ]＝[c ₁ ,c ₂ ,…,c _q ,c _q+1 ,…,c _2q ,c _2q+1 ,…,c _3q ]

(4) And (3) performing maximum pooling operation on the characteristics obtained by each filter to obtain output characteristics, and fully connecting the output characteristics of different filters to obtain CNN output:

W'＝[c ₁₁ ,c ₂₂ ,…,c _kk ]。

(5) The attention layer is adopted to perform weighted summation on the output of the CNN layer to obtain a hidden layer representation of a microblog sequence, and a structure drawing introducing an attention mechanism is shown in fig. 5. Different weights can be given to the hidden state sequence W' output by the CNN by introducing an attention mechanism to the CNN, so that the microblog sequence information can be utilized by the model with emphasis when the representation of the microblog sequence is learned. The attention layer will output c of the CNN network _kk As input, a representation v corresponding to the microblog sequence is output _kk ，

h _i ＝tanh(W _A *c _kk +b _A )

Composition matrix V = [ V = ₁₁ ,v ₂₂ ,…,v _kk ]，W _A As a weight matrix, b _A Is an offset value, h _i Is c _kk Hidden layer of (a) _i Is h _i And context h _A Similarity of (v) _i Is the output vector.

(6) And sending the output to a full connection layer, and obtaining the probability output of rumors and non-rumors through Softmax so as to achieve the purpose of judging rumor events.

And 6, training and testing the input matrix by using a multi-level training model to obtain a corresponding rumor detection result.

The first embodiment is as follows:

to demonstrate the effectiveness of the present invention, we selected a series of microblog platform based event data collated by Ma et al and used in the paper, the data set being raw information captured through the microblog API and all forwarding and replying to a given event, and also captured general subject posts that were not reported as rumors and collected a similar number of rumor events, detailed statistics are listed in the following table:

we fit all data to the training set and test set 4:1, the specific division is as listed in the following table:

the evaluation indexes used for evaluating the effectiveness of the model are four values of accuracy, precision, recall and F1, and the conditions generated by the predicted result and the actual result are listed in the following table:

four baseline methods, SVM-TS, CNN-1, CNN-2, CNN-GRU, were used for comparison, and the detailed data for the effect of our method on rumor testing compared to the baseline method are shown in the following table, and the MATLAB simulation graph of the experimental results is shown in fig. 6:

it can be known from the table that the final accuracy of rumor detection performed by using a classifier in the conventional SVM-TS method is only 85.7%, the effect is not particularly excellent, and the final result of comparing three models, namely, GRU-1, GRU-2 and CNN-GRU, shows that after a convolutional neural network is added to a training model, because different potential features in input can be extracted through a filter, the better performance in accuracy reaches 95.7%, and after an attention mechanism is introduced into the model, different weights are given to the input of CNN output, so that more weights are given to the features having larger influence on the output result, and the more important influence on the result is generated to facilitate the rumor detection, and the result shows that the accuracy of the model reaches 96.8%, and the recall rate and the F1 value are improved without errors.

Example two:

in order to prove the feasibility of the method, another microblog Data set CED _ Data set [23] is selected for testing, and sentence vectors obtained by using the same pre-training model are trained on different training models to obtain the accuracy for comparison. The data set contained 1538 rumor events and 1849 non-rumor events, according to training and testing set 4:1, the experimental data are listed in the following table, and the MATLAB simulation chart of the experimental results is shown in fig. 7:

experimental results show that sentence vectors obtained through the BERT pre-training model are trained on different training models and still have deviation in the aspect of accuracy, but the deviation amplitude is small when different pre-training models are used before comparison. The accuracy of the SVM-TS is about 86.7% through experiments, then GRU-1, CNN-GRU and GRU-2 models are sequentially adopted, the CNN-Attention model with the best effect is the CNN-Attention model, the accuracy reaches 95.3%, and the effect embodied in the recall rate and the F1 value is the best of the models.

In conclusion, the model shows the best effect on two different data sets, the characteristic expression effect of the preprocessed sentence vectors can be greatly improved by using the BERT pre-training model, the potential characteristics in the text can be more effectively extracted by matching with the CNN model integrated with the attention mechanism, and the model has great significance on rumor detection tasks.

The microblog rumor event detection problem is explained mainly from two aspects of a pre-training model and a training model, the influence of the pre-training model on an experimental result is mainly explained, and a better effect can be obtained when partial downstream NLP tasks are transferred to the pre-training model; on the basis of a training model, a novel rumor detection model introduced with an attention mechanism is provided based on a traditional Text CNN model, and different weights can be given to input sentence vectors according to the influence degree of the input sentence vectors on the input sentence vectors, so that positive influence is generated on predicting whether an event rumor occurs. The method has good rumor detection effect through experimental verification on a real microblog data set.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A microblog rumor detection method is characterized by comprising the following steps:

B. preprocessing the sample data, and respectively extracting text contents of the original microblog and the comment;

the constructed multi-level training model consists of a Text CNN and an attention mechanism; the Text CNN model performs convolution operation on a vector matrix to be detected by using three convolution kernels with convolution sizes of 3,4 and 5 respectively to obtain different feature representations of different convolution kernels based on the vector matrix, only one maximum feature is generated by each convolution kernel corresponding to an input matrix through pooling operation, and the feature representations obtained by the convolution kernels with different sizes are connected through full-connection operation; the attention mechanism gives different weights to the characteristics generated after full connection according to different output influences of each characteristic, so that the characteristics with large influences have larger influences when rumor detection is carried out;

2. The microblog rumor detection method according to claim 1, wherein: the sample data includes rumor sample data and non-rumor sample data.

3. The microblog rumor detection method of claim 1, wherein: in the step B, the noise in the json file is eliminated by using a regular expression.

4. The microblog rumor detection method of claim 3, wherein: and (4) all the texts which are subjected to pre-training according to training data and test data: the ratio of 1 is used for subsequent model processing.

5. The microblog rumor detection method according to claim 4, wherein: the pre-trained BERT model and code enable the embedding of word vectors.

6. The microblog rumor detection method according to claim 5, wherein: the BERT model is used as a word vector model, can fully describe character level, word level, sentence level and sentence-sentence relation characteristics, and gradually moves NLP tasks to pre-training generated sentence vectors.

7. The microblog rumor detection method according to claim 1, wherein: the BERT model proposes a pre-training target: a Mask Language Model (MLM) overcomes the traditional unidirectional limitation, and an MLM target allows the representation of contexts fusing the left side and the right side, so that a deep bidirectional Transformer can be pre-trained.

8. The microblog rumor detection method according to claim 7, wherein: the BERT model introduces a "next sentence prediction" task that can be used to train the representation of text pairs with MLM.

9. The microblog rumor detection method according to claim 8, wherein: the BERT model predicts whether texts at two ends of input BERT are continuous or not by using sentence-level negative sampling; during the training process, the second segment of the input model will be randomly selected from all the text with a probability of 50%, with the remaining 50% selecting the subsequent text of the first segment.