CN108090046A

CN108090046A - A kind of microblogging rumour recognition methods based on LDA and random forest

Info

Publication number: CN108090046A
Application number: CN201711483228.0A
Authority: CN
Inventors: 曾子明; 王婧
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2018-05-29
Anticipated expiration: 2037-12-29
Also published as: CN108090046B

Abstract

The invention discloses a kind of microblogging rumour recognition methods based on LDA and random forest, collect microblog data from microblogging official platform using reptile method and are manually marked；Microblog data is standardized by content of text data processing and z score to calculate User reliability feature and microblogging influence power feature；Content of text is optimized by LDA and calculates puzzlement degree with optimization content of text word distribution probability with theme distribution probability and LDA themes；Further structure structure microblogging feature vector；The input feature vector of Random Forest model is used as to establish microblogging rumour grader with theme distribution probability by User reliability feature, microblogging influence power feature, LDA optimization content of text.The present invention has deeply excavated microblogging text semantic information and rumour nicety of grading is high.

Description

A kind of microblogging rumour recognition methods based on LDA and random forest

Technical field

The present invention relates to fields such as social networks, text analyzings, more particularly to a kind of social activity based on LDA and random forest Network rumour recognition methods.

Background technology

With the rapid development of internet and mobile communication equipment, online social platform becomes people and issues and obtain letter Cease, develop and maintain the important channel of social relationships.Microblogging easily interactive mode, friendly Interactive Experience and is entered by it The influence gravitational attraction of famous person a large number of users.Index is irrigated according to the unicom of in August, 2017 to show, the microblogging moon, any active ues were up to 3.3 hundred million. One of social platform active as China, microblogging summarize a large amount of fragment type users and generate information.Due to the letter of social platform Cease the serious chaos state of presentation, the uncertain promotion of individual cognition, network rumour thus growth.Research finds to cause larger society The rumour that can be influenced largely is derived from microblog.Under the situation of official channel missing, rumour can alleviated to a certain degree The cognitive A-states of people.However, wreaking havoc for rumour often triggers negative passive network public opinion disturbance, to social stability and citizen Safety forms potential threat, and the identification work of network rumour is particularly critical.

The research of related rumour identification at present mainly around the research of rumour text feature, rumour issue user characteristics study with And communication network properties study analysis network rumour generates and mechanism of transmission.

In the above-mentioned methods, the Deep Semantics feature of rumour content, propagation User reliability and behavioural characteristic not yet obtain It is used to good.

The content of the invention

In order to solve deficiency of the prior art, the technical scheme is that a kind of micro- based on LDA and random forest Rich rumour recognition methods.Comprise the following steps：

Step 1, microblog data is collected from microblogging official platform using reptile method, the microblog data is included in text Hold, thumb up number, forwarding number, comment number, microblogging number, concern number, bean vermicelli number, authentication state, rumour state, official puts down according to microblogging Platform and the rumour information of national sector's issue manually mark microblog data；

Step 2, the content of text according to step 1 carries out unrelated character filtering, text participle, goes stop words, data Conversion process, so as to obtain optimization content of text and optimization content of text phrase, and the number of statistic op- timization content of text word Amount is standardized by optimizing content of text, optimization content of text word, the quantity for optimizing content of text word and z-score Number, forwarding number, comment number, microblogging number, concern number, bean vermicelli number are thumbed up so as to obtain z-score standards described in processing step 1 Change microblog data, and microblog data is standardized according to z-score and calculates User reliability feature and microblogging influence power feature；

Step 3, carried out by LDA topic models to optimizing content of text and optimization content of text word described in step 2 Modeling Calculation, so as to obtain LDA theme distributions probability, LDA optimization content of text and theme distribution probability and LDA optimization texts Content word and theme distribution probability, the text deep layer that LDA optimization content of text is identified with theme distribution probability as rumour Semantic feature, and content of text is optimized according to LDA and is distributed with theme distribution probability and LDA themes with optimization content of text word Probability calculation puzzlement degree；

Step 4, the User reliability feature according to step 2, microblogging influence power feature described in step 2, in step 3 The LDA theme distributions probability builds microblogging feature vector；

Step 5, the User reliability feature according to step 2, microblogging influence power feature described in step 2, in step 3 The LDA optimization content of text and input feature vector of the theme distribution probability as Random Forest model, use 10 folding cross validations Grid-search algorithms calculate the Random Forest model based on CART decision trees optimized parameter, the optimized parameter combination step Microblogging feature vector described in 4 designs microblogging rumour grader, and the microblog data manually marked according to step 1 carries out Training obtains final microblogging rumour grader, and work is screened applied to rumour.

Preferably, microblog data is described in step 1：

weibo_i={ doc_i,like_i,repost_i,comment_i,num_i,following_i,follower_i,verify_i, fake_i}

(1≤i≤M)

Wherein, M be microblog data item number, i be microblog data sequence number, doc_iFor content of text, like_iTo thumb up number, repost_iTo forward number, comment_iTo comment on number, num_iFor microblogging number, following_iTo pay close attention to number, follower_iFor powder Silk number, verify_iFor authentication state, fake_iFor rumour state；

It is manually labeled as described in step 1：

User Status is authenticated by microblogging official platform, verify_iRepresent issue weibo_iUser whether lead to Sina weibo personal authentication is crossed, if passing through verify_iFor 1, otherwise verify_iFor 0, pass through the rumour that national sector issues and believe Breath carries out rumour mark to microblog data, if microblogging weibo_iFor rumour microblogging, then fake_iFor 1, otherwise fake_iFor 0；

Preferably, the standardization of z-score described in step 2 microblog data is：

z_weibo_i={ op_doc_i,op_word_i,op_n_i,z_like_i,z_repost_i,z_comment_i,z_num_i,

z_following_i,z_follower_i,verify_i,fake_i}(1≤i≤M)

Wherein, op_doc_iTo optimize content of text, op_word_iTo optimize content of text word, op_n_iTo optimize text The quantity of content word, z_like_iNumber, z_repost are thumbed up for z-score standardization_iForwarding number, z_ are standardized for z-score comment_iComment number, z_num are standardized for z-score_iMicroblogging number, z_following are standardized for z-score_iFor z- Score standardization concern numbers, z_follower_iBean vermicelli number is standardized for z-score；

User reliability is characterized as described in step 2：

Microblogging influence power described in step 2 is characterized as：

Preferably, puzzlement degree described in step 3 is：

D={ op_word₁,...,op_word_M}

pweibo_i=(p_i,1,...,p_i,K)(1≤i≤M)

Wherein, M be step 1 described in microblog data item number, op_n_iTo optimize content of text word described in step 2 Quantity, op_word_iTo optimize content of text word, p (op_word described in step 2_i) literary to optimize in optimization content of text The probability of this content word, D represent the set of all optimization content of text words, p (z_j|op_doc_i) it is i-th described in step 2 The probability that j-th of theme occurs in the optimization content of text of z-score standardization microblog data, p (op_word_i|z_j) it is jth The probability that the optimization content of text word of i-th z-score standardization microblog data occurs described in step 2 in a theme, K are Theme number during puzzlement degree perplexity minimums, pweibo_iMicroblogging number is standardized for i-th z-score described in step 2 According to LDA theme distribution probability, p_i,1~p_i,KRespectively z₁~z_KThe probability of theme；

Preferably, microblogging feature vector described in step 4 is：

cweibo_i=(p_i,1,...,p_i,K,Reliability_i,Influence_i)(1≤i≤M)

Wherein, M be step 1 described in microblog data item number, p_i,1~p_i,KIt is respectively z described in step 3₁~z_KTheme Probability, Reliability_iFor User reliability feature described in step 2, Influence_iIt is influenced for microblogging described in step 2 Power feature.

Compared with prior art, the present invention is based on LDA topic identifications models deeply to excavate microblogging text semantic information, obtains LDA optimizes content of text and theme distribution probability, by itself and the User reliability feature of definition and microblogging influence power characteristic variable Input variable as random forest carries out classification based training, and rumour recognition effect of the present invention is notable, and the identification of model rumour is accurate Rate is high.

Description of the drawings

Fig. 1：It is the method flow diagram of the embodiment of the present invention；

Specific embodiment

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

Referring to Fig.1, the method flow diagram of the embodiment of the present invention, the present invention provides a kind of micro- based on LDA and random forest Rich rumour recognition methods, comprises the following steps：

Step 1, microblog data was collected on Sina weibo platform from 2016 using reptile method, the microblog data includes Content of text thumbs up number, forwarding number, comment number, microblogging number, concern number, bean vermicelli number, authentication state, rumour state, according to Sina Refute a rumour microblogging of refuting a rumour, Ministry of Environmental Protection's communication and education center, the Beijing environment of official account issue of the microblogging of microblogging is protected The haze rumours in 2016 that Publicity and Education Center is protected in joint exposure on December 30 judge the rumour information of benchmark to microblogging as rumour Data are manually marked；

Preferably, microblog data is described in step 1：

weibo_i={ doc_i,like_i,repost_i,comment_i,num_i,following_i,follower_i,verify_i, fake_i}(1)

(1≤i≤M)

Wherein, M=872 be microblog data item number, i be microblog data sequence number, doc_iFor content of text, like_iFor point Praise number, repost_iTo forward number, comment_iTo comment on number, num_iFor microblogging number, following_iTo pay close attention to number, follower_i For bean vermicelli number, verify_iFor authentication state, fake_iFor rumour state；

It is manually labeled as described in step 1：

z_weibo_i={ op_doc_i,op_word_i,op_n_i,z_like_i,z_repost_i,z_comment_i,z_num_i, (2)

z_following_i,z_follower_i,verify_i,fake_i}(1≤i≤M)

User reliability is characterized as described in step 2：

Microblogging influence power described in step 2 is characterized as：

Step 3, carried out by LDA topic models to optimizing content of text and optimization content of text word described in step 2 Modeling Calculation, so as to obtain LDA theme distributions probability, LDA optimization content of text and theme distribution probability and LDA themes with it is excellent Change content of text word distribution probability, the text deep layer that LDA optimization content of text is identified with theme distribution probability as rumour Semantic feature, and content of text is optimized according to LDA and is distributed with theme distribution probability and LDA themes with optimization content of text word Probability calculation puzzlement degree；

Preferably, puzzlement degree described in step 3 is：

D={ op_word₁,...,op_word_M}(6)

pweibo_i=(p_i,1,...,p_i,K)(1≤i≤M)(8)

Wherein, M=872 be step 1 described in microblog data item number, op_n_iTo optimize content of text described in step 2 The quantity of word, op_word_iTo optimize content of text word, p (op_word described in step 2_i) excellent in content of text to optimize Change the probability of content of text word, D represents the set of optimization content of text word, p (z_j|op_doc_i) it is i-th described in step 2 The probability that j-th of theme occurs in the optimization content of text of z-score standardization microblog data, p (op_word_i|z_j) it is jth The probability that the optimization content of text word of i-th z-score standardization microblog data occurs described in step 2 in a theme, K= 7 be puzzlement degree perplexity minimums when theme number, pweibo_iIt is micro- for i-th z-score standardization described in step 2 The LDA theme distribution probability of rich data, p_i,1~p_i,KRespectively z₁~z_KThe probability of theme；

Preferably, microblogging feature vector described in step 4 is：

cweibo_i=(p_i,1,...,p_i,K,Reliability_i,Influence_i)(1≤i≤M) (9)

Wherein, M=872 be step 1 described in microblog data item number, p_i,1~p_i,KIt is respectively z described in step 3₁~z_K The probability of theme, Reliability_iFor User reliability feature described in step 2, Influence_iFor microblogging described in step 2 Influence power feature.

It should be appreciated that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection scope, those of ordinary skill in the art are not departing from power of the present invention under the enlightenment of the present invention Profit is required under protected ambit, can also be made replacement or deformation, be each fallen within protection scope of the present invention, this hair It is bright scope is claimed to be determined by the appended claims.

Claims

1. a kind of microblogging rumour recognition methods based on LDA and random forest, which is characterized in that comprise the following steps：

Step 1, collect microblog data from microblogging official platform using reptile method, the microblog data include content of text, Number, forwarding number, comment number, microblogging number, concern number, bean vermicelli number, authentication state, rumour state are thumbed up, according to microblogging official platform The rumour information issued with national sector manually marks microblog data；

Step 2, the content of text according to step 1 carries out unrelated character filtering, text participle, goes stop words, data conversion Processing so as to obtain optimization content of text and optimization content of text phrase, and the quantity of statistic op- timization content of text word, is led to Cross optimization content of text, optimization content of text word, the quantity for optimizing content of text word and z-score standardizations step Number, forwarding number, comment number, microblogging number, concern number, bean vermicelli number are thumbed up so as to obtain z-score standardization microbloggings described in rapid 1 Data, and microblog data is standardized according to z-score and calculates User reliability feature and microblogging influence power feature；

Step 3, it is modeled by LDA topic models to optimizing content of text and optimization content of text word described in step 2 It calculates, so as to obtain LDA theme distributions probability, LDA optimization content of text and theme distribution probability and LDA optimization content of text Word and theme distribution probability, the text Deep Semantics that LDA optimization content of text is identified with theme distribution probability as rumour Feature, and content of text and theme distribution probability and LDA themes and optimization content of text word distribution probability are optimized according to LDA Calculate puzzlement degree；

Step 4, the User reliability feature according to step 2, microblogging influence power feature described in step 2, described in step 3 LDA theme distribution probability builds microblogging feature vector；

Step 5, the User reliability feature according to step 2, microblogging influence power feature described in step 2, described in step 3 LDA optimizes content of text and input feature vector of the theme distribution probability as Random Forest model, uses the net of 10 folding cross validations Lattice searching algorithm calculates the optimized parameter of the Random Forest model based on CART decision trees, in the optimized parameter combination step 4 The microblogging feature vector designs microblogging rumour grader, and the microblog data manually marked according to step 1 is instructed Final microblogging rumour grader is got, work is screened applied to rumour.

2. the microblogging rumour recognition methods according to claim 1 based on LDA and random forest, which is characterized in that step 1 Described in microblog data be：

weibo_i={ doc_i,like_i,repost_i,comment_i,num_i,following_i,follower_i,verify_i,fake_i} (1≤i≤M)

It is manually labeled as described in step 1：

User Status is authenticated by microblogging official platform, verify_iRepresent issue weibo_iUser whether by new Unrestrained microblogging personal authentication, if passing through verify_iFor 1, otherwise verify_iFor 0, pass through the rumour information pair that national sector issues Microblog data carries out rumour mark, if microblogging weibo_iFor rumour microblogging, then fake_iFor 1, otherwise fake_iFor 0；

Z-score described in step 2 standardizes microblog data：

z_weibo_i={ op_doc_i,op_word_i,op_n_i,z_like_i,z_repost_i,z_comment_i,z_num_i,z_ following_i,z_follower_i,verify_i,fake_i}(1≤i≤M)

Wherein, op_doc_iTo optimize content of text, op_word_iTo optimize content of text word, op_n_iTo optimize content of text The quantity of word, z_like_iNumber, z_repost are thumbed up for z-score standardization_iForwarding number, z_ are standardized for z-score comment_iComment number, z_num are standardized for z-score_iMicroblogging number, z_following are standardized for z-score_iFor z- Score standardization concern numbers, z_follower_iBean vermicelli number is standardized for z-score；

User reliability is characterized as described in step 2：

Microblogging influence power described in step 2 is characterized as：

<mrow> <msub> <mi>Influence</mi> <mi>i</mi> </msub> <mo>=</mo> <mi>log</mi> <mrow> <mo>(</mo> <msup> <mi>e</mi> <mrow> <mi>z</mi> <mo>_</mo> <msub> <mi>follower</mi> <mi>i</mi> </msub> </mrow> </msup> <mo>+</mo> <msup> <mi>e</mi> <mrow> <mi>z</mi> <mo>_</mo> <msub> <mi>repost</mi> <mi>i</mi> </msub> </mrow> </msup> <mo>+</mo> <msup> <mi>e</mi> <mrow> <mi>z</mi> <mo>_</mo> <msub> <mi>comment</mi> <mi>i</mi> </msub> </mrow> </msup> <mo>)</mo> </mrow> <mo>+</mo> <mi>z</mi> <mo>_</mo> <msub> <mi>like</mi> <mi>i</mi> </msub> </mrow>

Puzzlement degree described in step 3 is：

<mrow> <mi>p</mi> <mi>e</mi> <mi>r</mi> <mi>p</mi> <mi>l</mi> <mi>e</mi> <mi>x</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mrow> <mo>(</mo> <mi>D</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mo>{</mo> <mo>-</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mi>o</mi> <mi>p</mi> <mo>_</mo> <msub> <mi>word</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <mi>o</mi> <mi>p</mi> <mo>_</mo> <msub> <mi>n</mi> <mi>i</mi> </msub> </mrow> </mfrac> <mo>}</mo> </mrow>

D={ op_word₁,...,op_word_M}

<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>o</mi> <mi>p</mi> <mo>_</mo> <msub> <mi>word</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <mrow> <mo>(</mo> <mi>p</mi> <mo>(</mo> <mrow> <msub> <mi>z</mi> <mi>j</mi> </msub> <mo>|</mo> <mi>o</mi> <mi>p</mi> <mo>_</mo> <msub> <mi>doc</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> <mi>p</mi> <mo>(</mo> <mrow> <mi>o</mi> <mi>p</mi> <mo>_</mo> <msub> <mi>word</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>z</mi> <mi>j</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>

pweibo_i=(p_i,1,...,p_i,K)(1≤i≤M)

Wherein, M be step 1 described in microblog data item number, op_n_iTo optimize the number of content of text word described in step 2 Amount, op_word_iTo optimize content of text word, p (op_word described in step 2_i) in optimization text in optimization content of text Hold the probability of word, D represents the set of all optimization content of text words, p (z_j|op_doc_i) it is i-th z- described in step 2 The probability that j-th of theme occurs in the optimization content of text of score standardization microblog datas, p (op_word_i|z_j) it is j-th of master The probability that the optimization content of text word of i-th z-score standardization microblog data occurs described in step 2 in topic, K are puzzlement Spend theme number during perplexity minimums, pweibo_iMicroblog data is standardized for i-th z-score described in step 2 LDA theme distribution probability, p_i,1~p_i,KRespectively z₁~z_KThe probability of theme；

Microblogging feature vector described in step 4 is：

cweibo_i=(p_i,1,...,p_i,K,Reliability_i,Influence_i)(1≤i≤M)

Wherein, M be step 1 described in microblog data item number, p_i,1~p_i,KIt is respectively z described in step 3₁~z_KTheme it is general Rate, Reliability_iFor User reliability feature described in step 2, Influence_iIt is special for microblogging influence power described in step 2 Sign.