CN107180077A

CN107180077A - A kind of social networks rumour detection method based on deep learning

Info

Publication number: CN107180077A
Application number: CN201710252179.3A
Authority: CN
Inventors: 解男男; 王星; 刘吉强; 王伟; 韩臻
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2017-04-18
Filing date: 2017-04-18
Publication date: 2017-09-19

Abstract

The present invention discloses a kind of social networks rumour detection method based on deep learning, including：Collect social network data and be used as sample data；Sample data is marked and participle, builds dictionary and the sample term vector for the regular length for being element is numbered with numeral of the sample word in dictionary；The sample word number that sample sentence is included in sample data is defined to fixed value；Using Word2Vec methods build sample sentence matrix, sample sentence matrix row vector group for sample sentence in all sample words sample term vector；Sample sentence matrix is trained using deep learning method LSTM, multi-level training pattern is built；Sentence matrix to be detected is built using with building sample sentence matrix identical method；Classification and Detection is carried out to sentence matrix to be detected according to multi-level training pattern, the rumour testing result of social network data to be detected is obtained.The present invention can carry out effective detection to social networks rumour.

Description

A kind of social networks rumour detection method based on deep learning

Technical field

The present invention relates to natural language processing and machine learning techniques field.Depth is based on more particularly, to one kind The social networks rumour detection method of habit.

Background technology

In recent years, propagation of the rumour in network social intercourse media grows in intensity.The Chinese Academy of Social Sciences's issue in 2016《China New media development report》In, just microblogging and wechat rumour etc. are analyzed.The rumour propagated in social media causes society Can be panic and unstable, also damage the image of country and government.Therefore how social networks rumour is effectively detected, it is quickly right Its is handled, and is urgent problem to be solved.

At present, for the detection of rumour on social networks, mainly based on desk checking and keyword retrieval.It is micro- with Sina Exemplified by rich, the mode of user's report, official investigation and artificial judgment is mainly taken in the processing to rumour at present, and which results in processing Efficiency is low, time lag, tackle not in time the problems such as.How just classification or early warning are carried out to rumour to improving reply social network The effect of network rumour has great significance.

Existing machine learning method, it is all generally one-dimensional feature that it, which is inputted, and this gives processing substantial amounts of text data band Challenge is carried out.If the sentence in a file is all flattened as vector, dimension can be caused to explode and be difficult to handle.How to adopt Vocabulary and sentence are expressed with the mode quantized, is a major issue in natural language processing.It is this expression contribute to by Natural language is converted into the form for being available for computer or algorithm process.Word2Vector is a kind of method that Google puts forward, Its meaning is can be with high-dimensional vector from multiple directions one word of sign, so as to improve numeral using binary form Expression scope.On the basis of this, by calculating vectorial distance, it is possible to calculate the close degree of vocabulary.

Deep learning is one of study hotspot in machine learning field, and concept source is in artificial neural network.Depth Habit usually contains multiple hidden layers, so as to represent attribute classification or feature from more abstract angle, to find data Distributed nature is represented.Convolutional neural networks (CNN) are a kind of wide variety of deep learning algorithms, using local receptor field, The technologies such as weights are shared, sub-sampling, to realize the consistency to displacement, proportional zoom and some other deformation operations.This Outside, typical deep learning method also includes depth confidence network (DBN) and depth Boltzmann machine (DBM) etc..These depths Degree learning method has a wide range of applications in terms of image procossing, speech recognition.Different from more paying close attention to position in image procossing Relativeness, effectively integrating or reconstructing for adjacent position information is more paid close attention in text-processing field.Thus in this field Recurrent neural network (RNN) is then using more.A kind of implementation method of recurrent neural network is exactly length Memory Neural Networks (LSTM), a kind of simplest form is that a neuron has three door (Gate) compositions, using door control mechanism so that a god The information of a period of time can be operationally kept through member, and the interference that inner gradient is not changed is kept in training, this makes Relation of long standing relation must be learnt to be possibly realized, therefore LSTM is adapted to time series, the processing of elongated sequence, especially natural language.

Accordingly, it is desirable to provide a kind of social networks rumour detection method based on deep learning.

The content of the invention

It is an object of the invention to provide a kind of social networks rumour detection method based on deep learning, to pass through depth Learning art carries out multi-level training, improves the accuracy rate detected to social networks rumour.

To reach above-mentioned purpose, the present invention uses following technical proposals：

A kind of social networks rumour detection method based on deep learning, comprises the following steps：

Collect social network data and be used as sample data；

Sample data is marked and participle, built with the dictionary of the digital numbering representative sample word of regular length, structure Build each digital sample term vector for element in being numbered with numeral of the sample word in dictionary；

The sample word number that sample sentence is included in sample data is defined to fixed value；

Sample sentence matrix is built using Word2Vec methods, the row vector group of sample sentence matrix is all samples in sample sentence The sample term vector of this word, obtains the sample sentence matrix of all sample sentences in sample data；

Sample sentence matrix is trained using deep learning method LSTM, multi-level training pattern is built；

Social network data to be detected is marked and participle, the numeral numbering of each word to be detected is searched in dictionary, Each digital term vector to be detected for element in the numeral numbering with word to be detected is built, by treating that middle sample sentence to be detected is included Detection word number is defined to fixed value, and sentence matrix to be detected, the row vector group of sentence matrix to be detected are built using Word2Vec methods For the term vector to be detected of all words to be detected in sentence to be detected；

Classification and Detection is carried out to sentence matrix to be detected according to multi-level training pattern, social network data to be detected is obtained Rumour testing result.

Preferably, the sample data includes normal sample data and rumour sample data.

Preferably, the sample word number that sample sentence is included in sample data is defined to fixed value by step also includes：If some This sample word number included then gives up the word that rear portion has more in sample sentence more than the fixed value, if the sample that certain sample sentence is included This word number then carries out completion to sample sentence less than the fixed value by adding specific word in sentence tail.

Beneficial effects of the present invention are as follows：

Depth learning technology is applied in the detection of the rumour of social networks by technical scheme of the present invention, can be by nature Word and sentence in language are expressed as the data format for being available for deep learning to handle, and pass through multi-level training, effectively carry The treatment effect of social networks rumour in high current natural language processing field, at the same can for social networks rumour point Analysis provides a kind of feasible method.Technical scheme of the present invention can carry out just classifying to social networks rumour, to the danger of rumour Dangerous degree is evaluated, so as to reduce the current main complexity handled manually, improves the efficiency of reply rumour, Reduce the hysteresis quality of existing method.

Brief description of the drawings

The embodiment to the present invention is described in further detail below in conjunction with the accompanying drawings.

Fig. 1 shows the flow chart of the social networks rumour detection method based on deep learning.

Embodiment

In order to illustrate more clearly of the present invention, the present invention is done further with reference to preferred embodiments and drawings It is bright.Similar part is indicated with identical reference in accompanying drawing.It will be appreciated by those skilled in the art that institute is specific below The content of description is illustrative and be not restrictive, and should not be limited the scope of the invention with this.

As shown in figure 1, the social networks rumour detection method disclosed by the invention based on deep learning, including following step Suddenly：

Social network data is collected as sample data, sample data includes normal sample data and rumour sample data；

The sample word number that sample sentence is included in sample data is defined to fixed value, if the sample word number that certain sample sentence is included Then give up the word that rear portion has more in sample sentence more than the fixed value, if the sample word number that certain sample sentence is included is less than the fixed value Completion then is carried out by adding specific word in sentence tail to sample sentence, specific word is the word without rumour tendency such as " ", " etc. "；

Sample sentence matrix is built using Word2Vec methods, the row vector group of sample sentence matrix is all samples in sample sentence The sample term vector of this word, obtains the sample sentence matrix of all sample sentences in sample data, wherein, the row vector of sample sentence matrix Group be sample sentence in all sample words sample term vector for example sample sentence matrix be m*n matrix, then, and m be sample sentence bag The sample word number contained, n is the length of sample term vector, and the length of sample term vector is numeral numbering of the sample word in dictionary Length；

Specific social network data is substituted into below to examine the social networks rumour disclosed by the invention based on deep learning Survey method is further described.

Social networks rumour detection method based on deep learning, comprises the following steps：

Collect social network data and be used as sample data：Sample data uses the public data of Sina weibo, includes altogether 20228 datas.Wherein rumour sample data is in document《Chinese social media rumour statistical semantic analysis》Disclosed in, include altogether 9097.Normal sample packet as a comparison contains 11131, is collected from microblogging.In implementation process, upset at random The distribution of the public data of Sina weibo, using first 8000 therein as sample data, remainder is all as to be detected Social network data；

Sample data is marked and participle, built with the dictionary of the digital numbering representative sample word of regular length, structure Build each digital sample term vector for element in being numbered with numeral of the sample word in dictionary：The participle instrument increased income is called to sample Notebook data is marked and participle, is cut into the numeral numbering of each in single word, dictionary and represents a sample word；

The sample word number that sample sentence is included in sample data is defined to fixed value 50, when a sample sentence carries out participle Afterwards, the quantity of its sample word included then gives up the sample word behind the 50th sample word in sample sentence more than 50；When one Individual sample sentence is carried out after participle, and the quantity of its sample word included is less than 50, then is entered by adding specific word at sentence end Row completion.

Sample sentence matrix is built using Word2Vec methods, the row vector group of sample sentence matrix is all samples in sample sentence The sample term vector of this word, obtains the sample sentence matrix of all sample sentences in sample data, the square that sample sentence matrix is 50*256 Battle array；

Sample sentence matrix is trained using deep learning method LSTM, multi-level training pattern is built：Using opening The platforms of Keras 1.0 in source are as environment is realized, using the training of sequence (sequence) model realization at many levels.Wherein, sequence Row model is a kind of typical deep learning implementation method, and multiple training process is attached by it by the way of order, preceding One layer of output is as the input of current layer, and the input of current layer is used as next layer of output.In this manner it is achieved that in theory More multi-level training can be realized.In actual applications, it is contemplated that the factor such as situation and over-fitting of amount of training data, lead to The suitable number of plies is often used according to expertise.In implementation process, employ one layer, two layers, three layers, four layers and five layers this five Group experimental result is analyzed.Wherein, two layers of following (the output dimension (output_dim) of LSTM settings：128, abandon Rate (dropout)：0.5)：

Model.add (LSTM (output_dim=128, return_sequences=True))

model.add(Dropout(0.5))

Model.add (LSTM (output_dim=128, return_sequences=True))

model.add(Dropout(0.5))

It is identical with building sample sentence matrix identical method, social network data to be detected is marked and participle, The numeral numbering of each word to be detected is searched in dictionary, structure is with each numeral in the numeral numbering of word to be detected for the to be detected of element Term vector, the word number to be detected that middle sample sentence to be detected is included is defined to fixed value 50, and 50* is built using Word2Vec methods 256 sentence matrix to be detected, the row vector group of sentence matrix to be detected is the word to be detected of all words to be detected in sentence to be detected Vector；

Below the Detection results to rumour are weighed using three kinds of evaluation indexes：One is classification accuracy, refers to calculate just The normal and malicious data quantity really detected and the ratio of all data bulks；The second is rate of false alarm, refers to normal number According to the ratio for the quantity and all normal data quantity for being identified as malicious data；The third is verification and measurement ratio, refers to correctly identify Malicious data quantity and all malicious data quantity ratio.

In implementation process, the effect of deep learning is verified using one to five layer of LSTM models.Wherein one layer LSTM's As shown in table 1,10 groups of results represent that the iterations of experiment is 10,20 ..., 100 to experimental result respectively.Can from table 1 Go out, 10 groups of iteration can reach 90% to 98% classification accuracy.During each iteration of sample, its order is Upset at random, therefore under different iterations, the accuracy rate of sample can be different.Wherein maximum can reach 98.72%, rate of false alarm now is 0.21%.

Table 1 one layer model, ten groups of iteration accuracies

The characteristics of in view of deep learning, it is compared using the different training numbers of plies, is respectively compared one to five layer of training Under model, iterations is 10,30,50,70,90 five groups of situation.Table 2,3,4 presents accuracy rate, rate of false alarm and inspection respectively Survey rate.In the case of same training number of plies difference iterations, three groups of evaluation indexes have certain difference.But come from average value See, when iterations increases, accuracy rate and verification and measurement ratio substantially integrally present after first rising and gradually steadily omit downward trend.Five Best result appears in three layers of training pattern in group, its rate of accuracy reached to 98.34%, and now rate of false alarm is 0.22%, detection Rate is 96.59%.

2 five groups of embodiment accuracys rate of table

3 five groups of embodiment rate of false alarms of table

4 five groups of embodiment verification and measurement ratios of table

The above-mentioned set simply to a specific social network data carries out the citing of rumour detection, is specifically applying In, under different data sets and application environment, it can simply apply the social network disclosed by the invention based on deep learning Network rumour detection method, the algorithm parameter of selection obtains similar effect.The method of deep learning is applied to social activity by the present invention In media information processing, by handling the initial data obtained in social media, enable to meet deep learning Call format.The present invention can effectively improve the verification and measurement ratio of rumour information, in the work that can be applied to current public sentiment management, and And information of the detection in addition to rumour, the information such as violence, pornographic, reaction can be expanded to.

Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not pair The restriction of embodiments of the present invention, for those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms, all embodiments can not be exhaustive here, it is every to belong to this hair Row of the obvious changes or variations that bright technical scheme is extended out still in protection scope of the present invention.

Claims

1. a kind of social networks rumour detection method based on deep learning, it is characterised in that this method comprises the following steps：

Collect social network data and be used as sample data；

Sample data is marked and participle, built with the dictionary of the digital numbering representative sample word of regular length, structure with Numeral is the sample term vector of element in numeral numbering of the sample word in dictionary；

Sample sentence matrix is built using Word2Vec methods, the row vector group of sample sentence matrix is all sample words in sample sentence Sample term vector, obtain the sample sentence matrix of all samples sentence in sample data；

Social network data to be detected is marked and participle, the numeral numbering of each word to be detected is searched in dictionary, is built With to be detected term vector of each numeral in the numeral numbering of word to be detected for element, by middle sample sentence to be detected include it is to be detected Word number is defined to fixed value, and sentence matrix to be detected is built using Word2Vec methods, and the row vector group of sentence matrix to be detected is to treat Detect the term vector to be detected of all words to be detected in sentence；

Classification and Detection is carried out to sentence matrix to be detected according to multi-level training pattern, the ballad of social network data to be detected is obtained Say testing result.

2. the social networks rumour detection method according to claim 1 based on deep learning, it is characterised in that the sample Notebook data includes normal sample data and rumour sample data.

3. the social networks rumour detection method according to claim 1 based on deep learning, it is characterised in that step will The sample word number that sample sentence is included in sample data, which is defined to fixed value, also to be included：If the sample word number that certain sample sentence is included is more than The fixed value then gives up the word that rear portion has more in sample sentence, right if the sample word number that certain sample sentence is included is less than the fixed value Sample sentence carries out completion by adding specific word in sentence tail.