CN108090046B

CN108090046B - Microblog rumor identification method based on LDA and random forest

Info

Publication number: CN108090046B
Application number: CN201711483228.0A
Authority: CN
Inventors: 曾子明; 王婧
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-05-04
Anticipated expiration: 2037-12-29
Also published as: CN108090046A

Abstract

The invention discloses a microblog rumor recognition method based on LDA and random forests, which is characterized in that microblog data are collected from a microblog official platform by using a crawler method and are manually marked; calculating user credibility characteristics and microblog influence characteristics through text content data processing and z-score standardized microblog data; calculating the confusion degree through the LDA optimized text content and theme distribution probability and the LDA theme and optimized text content word distribution probability; further constructing and constructing a microblog feature vector; and establishing a microblog rumor classifier by taking the user credibility characteristic, the microblog influence characteristic, the LDA optimized text content and the theme distribution probability as input characteristics of a random forest model. According to the microblog text semantic information classification method, microblog text semantic information is deeply mined, and rumor classification precision is high.

Description

Microblog rumor identification method based on LDA and random forest

Technical Field

The invention relates to the fields of social networks, text analysis and the like, in particular to a social network rumor identification method based on LDA and random forests.

Background

With the rapid development of the internet and mobile communication equipment, the online social platform becomes an important channel for people to publish and acquire information, develop and maintain social relationships. The microblog attracts a large number of users by virtue of a convenient interaction mode, friendly interaction experience and influence of resident celebrities. According to the 8-month UnionWare index display in 2017, the number of active users in microblog months reaches 3.3 hundred million. As one of the active social platforms in China, a large amount of fragmented user generation information is gathered by microblogs. Because the information of the social platform presents a severe chaotic state, the uncertainty of individual cognition is improved, and the network rumor is bred. Research has found that rumors responsible for the greater social impact are mostly derived from the microblog platform. In the situation of official channels missing, rumors can relieve the cognitive anxiety of people to a certain extent. However, the abuse of rumors often causes negative and negative cyber public opinion wind waves, which pose potential threats to social stability and citizen safety, and the identification work of the cyber rumors is particularly critical.

Current research on rumor identification is mainly centered around rumor text feature studies, rumor issue user feature studies, and propagation network feature studies analyzing network rumor generation and propagation mechanisms.

In the above method, the deep semantic features, the credibility of the propagation users and the behavior features of the rumor content have not been well utilized.

Disclosure of Invention

In order to overcome the defects in the prior art, the technical scheme of the invention is a microblog rumor identification method based on LDA and random forests. The method comprises the following steps:

step 1, collecting microblog data from a microblog official platform by using a crawler method, wherein the microblog data comprise text content, praise number, forwarding number, comment number, microblog number, concern number, fan number, authentication state and rumor state, and the microblog data are artificially labeled according to rumor information issued by the microblog official platform and national departments;

step 2, performing irrelevant character filtering, text word segmentation, word removal and data conversion processing according to the text content in the step 1 to obtain optimized text content and an optimized text content phrase, counting the number of words of the optimized text content, obtaining z-score standardized microblog data by optimizing the text content, the number of words of the optimized text content and the number of praise, forwarding number, comment number, microblog number, attention number and fan number in the step 1 of z-score standardized processing, and calculating user credibility characteristics and microblog influence characteristics according to the z-score standardized microblog data;

step 3, modeling calculation is carried out on the optimized text content and the optimized text content words in the step 2 through an LDA topic model, so that LDA topic distribution probability, LDA optimized text content and topic distribution probability and LDA optimized text content word and topic distribution probability are obtained, the LDA optimized text content and topic distribution probability is used as a text deep semantic feature of rumor recognition, and the perplexity is calculated according to the LDA optimized text content and topic distribution probability and the LDA topic and optimized text content word distribution probability;

step 4, constructing a microblog feature vector according to the user credibility feature in the step 2, the microblog influence feature in the step 2 and the LDA theme distribution probability in the step 3;

and 5, according to the user credibility characteristics in the step 2, the microblog influence characteristics in the step 2 and the LDA optimized text content and the theme distribution probability in the step 3 as input characteristics of a random forest model, calculating optimal parameters of the random forest model based on the CART decision tree by using a 10-fold cross validation grid search algorithm, designing a microblog rumor classifier by combining the optimal parameters with the microblog feature vectors in the step 4, training according to the artificially labeled microblog data in the step 1 to obtain a final microblog rumor classifier, and applying the final microblog rumor classifier to rumor screening work.

Preferably, the microblog data in the step 1 are:

weibo_i＝{doc_i,like_i,repost_i,comment_i,num_i,following_i,follower_i,verify_i,fake_i}

(1≤i≤M)

wherein M is the number of pieces of microblog data, i is the serial number of the microblog data, doc_iFor text content, like_iFor praise, request_iFor forwarding data, comment_iNumber of comments, num_iFor microblog count, following_iTo count, follower_iVerify, the number of vermicelli_iTo an authenticated state, fake_iIn rumor state;

in step 1, the manual notation is:

authentication, verify, of user status through a microblog official platform_iRepresents the publication weibo_iWhether the user passes the personal authentication of the Sing microblog or not, if so, verify_iIs 1, otherwise verify_iWhen the number is 0, rumor marking is carried out on microblog data through rumor information issued by national departments, and if the microblog weibo is not equal to the national departments, the microblog data are subjected to rumor marking_iFake for rumor microblog_iIs 1, otherwise fake_iIs 0;

preferably, the z-score normalized microblog data in the step 2 are as follows:

z_weibo_i＝{op_doc_i,op_word_i,op_n_i,z_like_i,z_repost_i,z_comment_i,z_num_i,

z_following_i,z_follower_i,verify_i,fake_i}(1≤i≤M)

wherein, op _ doc_iTo optimize text content, op word_iTo optimize text content words, op _ n_iTo optimize the number of text content words, z _ like_iStandardizing the number of praise for z-score, z _ reload_iStandardizing forwarding numbers for z-score, z _ comment_iNumber of standardized reviews for z-score, z _ num_iStandardizing the number of microblogs for z-score, z _ following_iNormalizing the attention number for z-score, z _ focus_iStandardized vermicelli number for z-score;

the user credibility characteristics in the step 2 are as follows:

the microblog influence characteristic in the step 2 is as follows:

preferably, the confusion degree in step 3 is:

D＝{op_word₁,...,op_word_M}

pweibo_i＝(p_i,1,...,p_i,K)(1≤i≤M)

whereinM is the number of pieces of microblog data in the step 1, op _ n_iFor the number of words of the optimized text content in step 2, op _ word_iFor the optimized text content word, p (op _ word), in step 2_i) To optimize the probability of optimizing textual content words in the textual content, D represents the set of all optimized textual content words, p (z)_j|op_doc_i) The probability p (op _ word) of the j (th) theme in the optimized text content of the ith z-score standardized microblog data in the step 2_i|z_j) Is the probability of the occurrence of the optimized text content words of the ith z-score standardized microblog data in the step 2 in the jth theme, and K is the number of themes with minimum confusion, pweibo_iThe LDA topic distribution probability, p, of the ith z-score normalized microblog data in the step 2_i,1～p_i,KAre each z₁～z_KA probability of the topic;

preferably, the microblog feature vector in step 4 is:

cweibo_i＝(p_i,1,...,p_i,K,Reliability_i,Influence_i)(1≤i≤M)

wherein M is the number of pieces of microblog data in the step 1, and p_i,1～p_i,KRespectively in step 3 is z₁～z_KProbability of topic, Reliability_iFor the user credibility feature in step 2, infiluence_iAnd (3) the microblog influence characteristics in the step 2.

Compared with the prior art, the microblog text semantic information is deeply mined based on the LDA topic identification model, the LDA optimized text content and the topic distribution probability are obtained, and the LDA optimized text content and the topic distribution probability are classified and trained with the user credibility characteristics and microblog influence characteristic variables as the input variables of the random forest.

Drawings

FIG. 1: is a method flow diagram of an embodiment of the invention;

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

Referring to fig. 1, a method flowchart of an embodiment of the present invention provides a microblog rumor identification method based on LDA and random forest, including the following steps:

step 1, collecting microblog data from a 2016 green wave microblog platform by using a crawler method, wherein the microblog data comprise text content, praise number, forwarding number, comment number, microblog number, concern number, vermicelli number, authentication state and rumor state, and the microblog data are artificially labeled according to rumor information which is published by a microblog rumor official account number of a green wave microblog, a propaganda education center of the national environmental protection ministry, and a 2016 haze rumor which is jointly exposed in 12 months and 30 days by an environmental protection propaganda center of Beijing city as a rumor evaluation benchmark;

preferably, the microblog data in the step 1 are:

weibo_i＝{doc_i,like_i,repost_i,comment_i,num_i,following_i,follower_i,verify_i,fake_i}(1)

(1≤i≤M)

wherein, M872 is the number of microblog data, i is the serial number of microblog data, doc_iFor text content, like_iFor praise, request_iFor forwarding data, comment_iNumber of comments, num_iFor microblog count, following_iTo count, follower_iVerify, the number of vermicelli_iTo an authenticated state, fake_iIn rumor state;

in step 1, the manual notation is:

authentication, verify, of user status through a microblog official platform_iRepresents the publication weibo_iWhether the user passes the personal authentication of the Sing microblog or not, if so, verify_iIs 1, otherwise verify_i0, rumor announced by national departmentsRumor marking microblog data if microblog weibo_iFake for rumor microblog_iIs 1, otherwise fake_iIs 0;

preferably, the z-score normalized microblog data in the step 2 are as follows:

z_weibo_i＝{op_doc_i,op_word_i,op_n_i,z_like_i,z_repost_i,z_comment_i,z_num_i,(2)

z_following_i,z_follower_i,verify_i,fake_i}(1≤i≤M)

the user credibility characteristics in the step 2 are as follows:

the microblog influence characteristic in the step 2 is as follows:

step 3, modeling calculation is carried out on the optimized text content and the optimized text content words in the step 2 through an LDA topic model, so that LDA topic distribution probability, LDA optimized text content and topic distribution probability and LDA topic and optimized text content word distribution probability are obtained, the LDA optimized text content and topic distribution probability is used as a text deep semantic feature of rumor recognition, and the perplexity is calculated according to the LDA optimized text content and topic distribution probability and the LDA topic and optimized text content word distribution probability;

preferably, the confusion degree in step 3 is:

D＝{op_word₁,...,op_word_M}(6)

pweibo_i＝(p_i,1,...,p_i,K)(1≤i≤M)(8)

wherein M872 is the number of pieces of microblog data in step 1, and op _ n_iFor the number of words of the optimized text content in step 2, op _ word_iFor the optimized text content word, p (op _ word), in step 2_i) To optimize the probability of optimizing textual content words in textual content, D represents a set of optimized textual content words, p (z)_j|op_doc_i) The probability p (op _ word) of the j (th) theme in the optimized text content of the ith z-score standardized microblog data in the step 2_i|z_j) Is the probability of the occurrence of the optimized text content words of the ith z-score standardized microblog data in the step 2 in the jth topic, and K-7 is the number of topics with minimum confusion, pweibo_iFor the ith z-s in step 2LDA topic distribution probability, p, of core standardized microblog data_i,1～p_i,KAre each z₁～z_KA probability of the topic;

preferably, the microblog feature vector in step 4 is:

cweibo_i＝(p_i,1,...,p_i,K,Reliability_i,Influence_i)(1≤i≤M) (9)

wherein M872 is the number of pieces of microblog data in step 1, and p_i,1～p_i,KRespectively in step 3 is z₁～z_KProbability of topic, Reliability_iFor the user credibility feature in step 2, infiluence_iAnd (3) the microblog influence characteristics in the step 2.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A microblog rumor identification method based on LDA and random forests is characterized by comprising the following steps:

step 2, performing irrelevant character filtering, text word segmentation, word removal and data conversion processing according to the text content in the step 1 to obtain optimized text content and optimized text content words, counting the number of the optimized text content words, obtaining z-score standardized microblog data by optimizing the text content, the optimized text content words and the number of the like in the step 1 and performing z-score standardized processing on the number of prawns, the number of forwarded words, the number of comments, the number of microblogs, the number of concerns and the number of fans, and calculating user credibility characteristics and microblog influence characteristics according to the z-score standardized microblog data;

step 3, modeling calculation is carried out on the optimized text content and the optimized text content words in the step 2 through an LDA topic model, so that LDA topic distribution probability, LDA optimized text content and topic distribution probability and LDA optimized text content word and topic distribution probability are obtained, the LDA optimized text content and topic distribution probability is used as a text deep semantic feature of rumor recognition, and the perplexity is calculated according to the LDA optimized text content and topic distribution probability and the LDA optimized text content word and topic distribution probability;

step 5, according to the user credibility characteristics in the step 2, the microblog influence characteristics in the step 2, the LDA optimized text content and the theme distribution probability in the step 3 as input characteristics of a random forest model, calculating optimal parameters of the random forest model based on a CART decision tree by using a 10-fold cross validation grid search algorithm, designing a microblog rumor classifier by combining the optimal parameters with the microblog feature vectors in the step 4, training according to the artificially labeled microblog data in the step 1 to obtain a final microblog rumor classifier, and applying the final microblog rumor classifier to rumor screening work;

in the step 1, the microblog data are as follows:

weibo_i＝{doc_i,like_i,repost_i,comment_i,num_i,following_i,follower_i,verify_i,fake_i}，1≤i≤M；

in step 1, the manual notation is:

the z-score standardized microblog data in the step 2 are as follows:

z_weibo_i＝{op_doc_i,op_word_i,op_n_i,z_like_i,z_repost_i,z_comment_i,z_num_i,z_following_i,z_follower_i,verify_i,fake_i}，1≤i≤M；

wherein, op _ doc_iTo optimize text content, op word_iTo optimize text content words, op _ n_iTo optimize the number of text content words, z _ like_iStandardizing the number of praise for z-score, z _ reload_iIs z-score normalized forwarding number, z _ comment_iNumber of standardized reviews for z-score, z _ num_iStandardizing the number of microblogs for z-score, z _ following_iNormalizing the attention number for z-score, z _ focus_iStandardized vermicelli number for z-score;

the user credibility characteristics in the step 2 are as follows:

the microblog influence characteristic in the step 2 is as follows:

the confusion degree in step 3 is as follows:

D＝{op_word₁,...,op_word_M}

pweibo_i＝(p_i,1,...,p_i,K)，1≤i≤M；

wherein M is the number of pieces of microblog data in the step 1, and op _ n_iFor the number of words of the optimized text content in step 2, op _ word_iFor the optimized text content word, p (op _ word), in step 2_i) To optimize the probability of optimizing textual content words in the textual content, D represents the set of all optimized textual content words, p (z)_j|op_doc_i) Probability p (op _ word) of occurrence of jth theme in optimized text content of ith z-score standardized microblog data in step 2_i|z_j) Is the ith z-score standard in step 2 of the jth subjectOptimizing the probability of occurrence of text content words of microblog data, wherein K is the number of subjects with minimum confusion degree perplexity, pweibo_iFor the LDA topic distribution probability, p, of the ith z-score normalized microblog data in the step 2_i,1～p_i,KAre each z₁～z_KA probability of the topic;

in step 4, the microblog feature vector is as follows:

cweibo_i＝(p_i,1,...,p_i,K,Reliability_i,Influence_i)，1≤i≤M；

wherein M is the number of pieces of microblog data in the step 1, Reliability_iFor the user credibility feature in step 2, infiluence_iAnd (3) the microblog influence characteristics in the step 2.