CN105045857A

CN105045857A - Social network rumor recognition method and system

Info

Publication number: CN105045857A
Application number: CN201510401458.2A
Authority: CN
Inventors: 熊锦华; 张巧; 程学旗; 张水源; 许洪波; 余智华
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2015-07-09
Filing date: 2015-07-09
Publication date: 2015-11-11

Abstract

The invention discloses a social network rumor recognition method and system. The method comprises the steps of: obtaining a microblog information case, obtaining microblog information and user information of the microblog information case, and according to the microblog information and the user information, extracting microblog content features of the microblog information case, wherein the microblog content features include a shallow text feature and a deeply implied microblog feature; extracting basic attribute features of a user and deeply implied features of the user according to the user information, and extracting microblog popularity features according to the microblog information, wherein the microblog popularity features include a volatility feature based on popularity and popularity trend, a difference feature and a forwarding feature; and establishing a feature vector and a training classifier according to the shallow text feature, the deeply implied microblog feature, the basic attribute features, the deeply implied features of the user and the microblog popularity features, inputting the feature vector into the classifier, and outputting a result.

Description

A kind of social networks rumour recognition methods and system

Technical field

The present invention relates to social network analysis field, particularly the recognition methods of a kind of social networks rumour and system.

Background technology

Social networks popular and universal, information content in social networks is increased with presenting explosion type, but information quality is not but promoted accordingly, the various junk information especially deceptive information such as rumour is flooded with whole social networks, and the propagation of rumour and diffusion bring to the development of the life of people and society and endanger and negative effect greatly.

The rumour message in the middle of social networks can be identified timely and accurately, not only contribute to building good internet environment, help the true and false of people's identifying information better, stop the serious harm that malicious rumor brings in time, can also monitor in public sentiment, play positive effect in information guidance etc.

Current existing rumour recognition methods mainly can be divided into two classes, one class is based on artificial method, its mechanism is mainly by manually reporting to the authorities announced message and judging, the initial stage that these class methods cannot produce at rumour contains that it is propagated and diffusion, promptness is poor, and need a large amount of labours and financial resources, cost-effectivenes is high, another kind of method is the method based on machine learning, whether be that rumour processes as classification problem using microblogging, and utilize each category feature of microblogging, adopt certain classification learning algorithm to carry out the identification of rumour, in the selection of characteristic of division, mainly can be divided into 3 kinds at present, the content of microblogging respectively, the propagation of publisher and microblogging, in the selection of content characteristic, mainly utilize the shallow-layer text feature of content (as whether comprised link in content at present, picture, whether mention other people etc.), and deeper analysis is not done to text, its semanteme of abundant excavation, theme, the hidden features such as emotion, in publisher, mainly select some static natures, comprise the base attributes such as the bean vermicelli number of publisher, friend's number, do not take the confidence level and influence power etc. of publisher into consideration, in the selection of microblogging propagation characteristic, related work mainly concentrates on the propagation model of research microblogging rumour, structure take rumour as the forwarding graph of a relation of ancestor node, simulate its dissemination, or be only confined to some and simply forward attribute, do not analyse in depth rumour other features in communication process further.In the correlative study that these rumour recognition features are selected, but calibration is bad for selected feature, has some limitations, causes final rumour recognition effect not good, in sum, a kind of automatic mode that accurately can identify microblogging rumour is lacked in existing method.

Summary of the invention

For the defect existed in prior art and the exclusive feature of microblogging rumour, the object of the invention is to utilize the content of microblogging, issue the feature of popularity three aspects of user and microblogging, and by the sorting technique in machine learning, realize the automatic identification of microblogging rumour, and effectively improve recognition accuracy and the recall rate of microblogging rumour, the present invention proposes the recognition methods of a kind of social networks rumour and system.

The invention provides the recognition methods of a kind of social networks rumour, comprising:

Step 1, acquisition micro-blog information example, and obtain micro-blog information and the user profile of described micro-blog information example, according to described micro-blog information and described user profile, extract the content of microblog feature of described micro-blog information example, described content of microblog feature comprises shallow-layer text feature and microblogging deep layer hidden feature;

Step 2, according to described user profile, extract base attribute feature and user's deep layer hidden feature of described user, extract the microblogging popularity feature of described microblogging according to described micro-blog information, described microblogging popularity feature comprise based on popularity and popularity trend undulatory property feature and otherness feature and forward feature;

Step 3, according to described shallow-layer text feature, described microblogging deep layer hidden feature, described base attribute feature, described user's deep layer hidden feature, described microblogging popularity feature, construction feature vector, training classifier, described proper vector is inputted described sorter and Output rusults, identifies social networks rumour to complete.

Described social networks rumour recognition methods, described microblogging deep layer hidden feature comprises the viewpoint tendentiousness feature of focus tendentiousness feature, inside and outside consistance feature, feeling polarities feature and comment.

Described social networks rumour recognition methods, described user's deep layer hidden feature comprises social characteristics, viewpoint forwards feature and microblogging matching degree feature.

Described social networks rumour recognition methods, the step extracting described focus tendentiousness feature and described inside and outside consistance feature comprises carries out participle and part-of-speech tagging to microblogging text, noun, the verb with performance meaning are extracted as keyword, and use the weight that the TF-IDF in Text character extraction sorts as keyword, using the keyword of K the highest for a weight word as microblogging text.

Described social networks rumour recognition methods, the step of described extraction feeling polarities feature comprises carries out participle and part-of-speech tagging to microblogging text, and carried out the extraction of keyword by sentiment dictionary, emoticon dictionary, punctuation mark dictionary, responsive dictionary, the word extracting the notional word in described microblogging and can match in dictionary.

The present invention also proposes a kind of social networks rumour recognition system, comprising:

Extract content of microblog characteristic module, for obtaining micro-blog information example, and obtain micro-blog information and the user profile of described micro-blog information example, according to described micro-blog information and described user profile, extract the content of microblog feature of described micro-blog information example, described content of microblog feature comprises shallow-layer text feature and microblogging deep layer hidden feature;

Extract microblogging popularity characteristic module, for according to described user profile, extract base attribute feature and user's deep layer hidden feature of described user, extract the microblogging popularity feature of described microblogging according to described micro-blog information, described microblogging popularity feature comprise based on popularity and popularity trend undulatory property feature and otherness feature and forward feature;

Identify rumour module, for according to described shallow-layer text feature, described microblogging deep layer hidden feature, described base attribute feature, described user's deep layer hidden feature, described microblogging popularity feature, construction feature vector, training classifier, described proper vector is inputted described sorter and Output rusults, identifies social networks rumour to complete.

Described social networks rumour recognition system, described microblogging deep layer hidden feature comprises the viewpoint tendentiousness feature of focus tendentiousness feature, inside and outside consistance feature, feeling polarities feature and comment.

Described social networks rumour recognition system, described user's deep layer hidden feature comprises social characteristics, viewpoint forwards feature and microblogging matching degree feature.

Described social networks rumour recognition system, the step extracting described focus tendentiousness feature and described inside and outside consistance feature comprises carries out participle and part-of-speech tagging to microblogging text, noun, the verb with performance meaning are extracted as keyword, and use the weight that the TF-IDF in Text character extraction sorts as keyword, using the keyword of K the highest for a weight word as microblogging text.

Described social networks rumour recognition system, the step of described extraction feeling polarities feature comprises carries out participle and part-of-speech tagging to microblogging text, and carried out the extraction of keyword by sentiment dictionary, emoticon dictionary, punctuation mark dictionary, responsive dictionary, the word extracting the notional word in described microblogging and can match in dictionary.

From above scheme, the invention has the advantages that:

Effect of the present invention is: the exclusive feature that the present invention is directed to microblogging rumour, introduces content of microblog and the deep layer hidden feature issuing user, effectively can distinguish rumour microblogging and general microblogging in identifying; Merge the popularity and popularity trend feature that change in microblogging communication process, significantly improve accuracy rate and the recall rate of rumour identification in assorting process.

Accompanying drawing explanation

Fig. 1 is the overall flow figure of one embodiment of the invention;

Fig. 2 be content in one embodiment of the invention focus tendentiousness characteristic sum inside and outside the process flow diagram of consistance feature extraction;

Fig. 3 is the process flow diagram of the feeling polarities feature extraction of content in one embodiment of the invention;

Fig. 4 is the process flow diagram of the viewpoint tendentiousness feature extraction commented in one embodiment of the invention;

The process flow diagram that Fig. 5 is the social characteristics of user in one embodiment of the invention, viewpoint forwards feature and the feature extraction of history microblogging matching degree;

Fig. 6 is the undulatory property of popularity in one embodiment of the invention and the process flow diagram of otherness feature extraction.

Embodiment

Describe the present invention below in conjunction with the drawings and specific embodiments, but not as a limitation of the invention.

The present invention proposes. and the recognition methods of a kind of social networks rumour, below comprises for overall step:

Acquisition micro-blog information example, and obtain micro-blog information and the user profile of described micro-blog information example, according to described micro-blog information and described user profile, extract the content of microblog feature of described micro-blog information example, described content of microblog feature comprises shallow-layer text feature and microblogging deep layer hidden feature;

According to described user profile, extract base attribute feature and user's deep layer hidden feature of described user, extract the microblogging popularity feature of described microblogging according to described micro-blog information, described microblogging popularity feature comprise based on popularity and popularity trend undulatory property feature and otherness feature and forward feature;

According to described shallow-layer text feature, described microblogging deep layer hidden feature, described base attribute feature, described user's deep layer hidden feature, described microblogging popularity feature, construction feature vector, training classifier, described proper vector is inputted described sorter and Output rusults, identifies social networks rumour to complete.

Described microblogging deep layer hidden feature comprises the viewpoint tendentiousness feature of focus tendentiousness feature, inside and outside consistance feature, feeling polarities feature and comment.

Described user's deep layer hidden feature comprises social characteristics, viewpoint forwards feature and microblogging matching degree feature.

The step extracting described focus tendentiousness feature and described inside and outside consistance feature comprises carries out participle and part-of-speech tagging to microblogging text, noun, the verb with performance meaning are extracted as keyword, and use the weight that the TF-IDF in Text character extraction sorts as keyword, using the keyword of K the highest for a weight word as microblogging text.

The step of described extraction feeling polarities feature comprises carries out participle and part-of-speech tagging to microblogging text, and carried out the extraction of keyword by sentiment dictionary, emoticon dictionary, punctuation mark dictionary, responsive dictionary, the word extracting the notional word in described microblogging and can match in dictionary.

As shown in Figure 1, a specific embodiment of a kind of social networks rumour recognition methods, comprises the following steps:

(1) acquisition of micro-blog information example, according to the microblogging unique identification of input, obtain micro-blog information and corresponding user profile, micro-blog information comprises microblogging text, the history transfer amount vector of microblogging and all comment texts of microblogging, and user profile comprises the base attribute (bean vermicelli number, friend's number, mutually powder number) of user, the history microblogging within month and corresponding transfer amount vector.Step (1) is corresponding to the step 101 in figure and 102.

(2) extraction of content of microblog feature, comprises the shallow-layer text feature of content and deep layer hidden feature (the viewpoint tendentiousness feature of focus tendentiousness feature, inside and outside consistance feature, feeling polarities feature and comment)

Step (2) is corresponding to the step 103 in Fig. 1.

In step 103, first extract the shallow-layer text feature of content of microblog, comprising: whether comprise external linkage in microblogging issuing time to the interval, microblogging text of microblog account hour of log-on, whether containing picture, audio frequency, video etc., whether mention other people; Then, utilize the outside page 105 linking indication in focus theme dictionary 104, microblogging text to extract the detailed process of consistance feature inside and outside focus tendentiousness characteristic sum as shown in Figure 2, utilize all kinds of dictionary 106 such as emotion, symbol to extract the detailed process of the viewpoint tendentiousness feature of feeling polarities feature and comment respectively as shown in Figure 3 and Figure 4.

Inside and outside focus tendentiousness characteristic sum consistance feature leaching process in, first instrument is used to carry out participle and part-of-speech tagging to microblogging text, only using having the performance noun of meaning, verb extracts as keyword, and use the TF-IDF in Text character extraction (termfrequency – inversedocumentfrequency) to be used as the weight of keyword sequence, K the word that after selected and sorted, weight is the highest is as the keyword of microblogging text, and this process is as shown in step 201 and step 202.

In step 203, utilize focus theme dictionary 204 to extract the focus tendentiousness feature of content.

Suppose W={w ₁, w ₂..., w _nthe keyword set extracted from microblogging T, w _irepresent wherein a certain keyword, HotTopicWordBase={T ₁, T ₂..., T _mby the focus theme dictionary of subject classification, T _ibe the set of letters under a certain focus theme, then the focus tendentiousness computing formula of this content of microblog is as follows:

hot_feature(W)＝max(simi(W，T ₁)，simi(W，T ₂)，…，simi(W，T _m))

In above-mentioned formula, simi (W, T _i) represent set of letters T under the keyword set W of microblogging T and a certain focus theme _ijaccard similarity.

In step 205, utilize in microblogging text the outside page 206 linking indication to extract the inside and outside consistance feature of content, wherein the outside page is described by page title (title), page statement (description) and page key words (keyword), and computing formula is as follows:

\begin{matrix} c o n_f e a u r e (T, u r l_p a g e) = \\ \{\begin{matrix} 0, T n o t c o n t a i n U R L \\ m a x (Re l (T, t i t l e), Re l (T, d e s c r i p t i o n), Re l (T, k e y w o r d s)), T c o n t a i n U R L \end{matrix} \end{matrix}

In above-mentioned formula, Rel (T, title), Rel (T, description), Rel (T, keywords) represent the correlativity of the title of microblogging text and the outside page, page-describing and page key words respectively, and use Jaccard similarity to characterize.

In the leaching process of feeling polarities feature, first instrument is used to carry out participle and part-of-speech tagging 301 to microblogging text, and the extraction of keyword is carried out by sentiment dictionary, emoticon dictionary, punctuation mark dictionary, responsive dictionary, the word extracting the notional word (verb, noun, adjective) in microblogging and can match in dictionary, this process as shown in step 302.

Adopt the TF-IDF improved to carry out the calculating 304 of lexical item weight to the word extracted, weight calculation is as follows:

{weight}_{k} = \frac{(l o g (f_{k}) + 1.0) \times \log_{2} ({level}_{k} + 1)}{\sqrt{Σ_{k - 1}^{l} {[(l o g (f_{k}) + 1.0) \times \log_{2} ({level}_{k} + 1)]}^{2}}}

Wight in above-mentioned formula _krepresent the weight of a kth lexical item in microblogging, f _krepresent the word frequency of current lexical item in this microblogging, l represents the lexical item number in this microblogging, level _krepresent the grade of current lexical item, lexical item grade be set as follows shown in table:

In step 305, the lexical item weight obtained in above-mentioned steps is utilized to build the proper vector of microblogging text, for the input of feeling polarities sorter, obtain the feeling polarities feature of microblogging, have employed in the present invention with support vector machine (SVM), and feeling polarities is divided into front, negative and neutral three types.

In step 401, instrument is first used to carry out participle and part-of-speech tagging to microblogging text.

In step 402, except utilizing the dictionary that uses in feeling polarities feature extraction, also use viewpoint polarity dictionary, extract the viewpoint tendentiousness feature of comment.First according to the result of participle and part-of-speech tagging, and sentiment dictionary, emoticon dictionary, punctuation mark dictionary, responsive dictionary and viewpoint polarity dictionary, the word extracting the notional word (verb, noun, adjective) in microblogging and can match in dictionary, and supplement keyword by 2-gram, 3-gram and 4-gram of microblogging text.

In step 405, the TF-IDF improved is adopted to carry out the calculating of lexical item weight to the word extracted, the same with step 304.

The lexical item weight utilizing above-mentioned steps to obtain builds the proper vector of the every bar comment of microblogging, as the input of viewpoint polarity sorter, obtain the viewpoint polarity of the every bar comment of microblogging, as shown at step 406, in the present invention, the viewpoint polarity of comment is divided into support, opposes and other three classes.

In step 407, in statistics microblogging, all comment viewpoint polarity is the number supported and oppose, calculate the viewpoint tendentiousness feature of comment, computing formula is as follows:

v i e w_s e n t i_f e a t u r e = l o g (\frac{N_{p o s}}{N_{n e g}})

Wherein, N _posrepresent that in all comments of this microblogging, viewpoint polarity is the number of reviews supported, N _negrepresent that in all comments, viewpoint polarity is the number of reviews opposed.

(3) microblogging issues the extraction of user characteristics, comprises the base attribute feature issuing user, and the hidden feature of deep layer (social characteristics, viewpoint forward feature and microblogging matching degree feature)

Step (3) is corresponding to the step 107 in Fig. 1.

In step 107, first utilize the base attribute feature of base attribute information extraction user issuing user, comprising: the auth type of user, sex, whether a guy's information, the microblogging number delivered, bean vermicelli number, friend's book etc.; Then extract the deep layer hidden feature (social characteristics, viewpoint forward feature and microblogging matching degree feature) of user's base attribute feature and user in conjunction with the history microblog data of user, this detailed process as shown in Figure 5.

In step 501, first obtain the information of user, comprise the history microblogging that the base attribute information of user and user issued in nearest month.

In step 502, according to bean vermicelli number, friend's number, mutually powder number of user, calculate the social characteristics of user, computing formula is as follows:

s o c i a l_i n f_f e a t u r e = l o g (\frac{f o l_n u m - b i_f o l_n u m}{f r i_n u m + 1})

Wherein, fol_num represents the bean vermicelli number of this user, and fri_num represents friend's number (i.e. the number of users of this user concern) of this user, the number of the user that bi_fol_num represents and this user pays close attention to mutually.

In step 503, according to the history microblogging of user, the viewpoint calculating user forwards feature, as follows:

v i e w_r e t w e e t - f e a t u r e = \frac{r e t w e e t s_n u m}{s t a t u s e s_n u m}

In above-mentioned formula, status_num represents the microblogging number that this user issues, and retweets_num represents the total amount that the microblogging that this user issues is forwarded.

In step 504, according to the history microblogging of user, calculate the microblogging matching degree feature of user.First utilize topic model, obtain the theme distribution of user's history microblogging and current microblogging, be denoted as: his_topic={his_p ₁, his_p ₂..., his_p _kand now_topic={now_p ₁, now_p ₂..., now_p _k, wherein k is the theme number of specifying, and the matching degree feature of user is calculated by the cosine similarity of the theme distribution of history microblogging and the theme distribution of present microblogging, as follows:

\begin{matrix} w e i b o_m a t c h_f e a t u r e = \cos i n_s i m i (h i s_t o p i c, n o w_t o p i c) = \\ \frac{h i s_t o p i c \times n o w_t o p i c}{| h i s_t o p i c | \times | n o w_t o p i c |} \end{matrix}

(4) extraction of microblogging popularity feature, comprises the undulatory property feature based on popularity and popularity trend and otherness feature, and simply forwards feature.

Step (4) is corresponding to the step 108 in Fig. 1, and detailed process as shown in Figure 6.

In step 601, the history transfer amount vector of current microblogging is first obtained according to the unique identification of microblogging.

Step 602 is for obtaining the undulatory property feature of popularity.First a period of time after being issued by microblogging T is divided into n the equally spaced time interval, is denoted as: Interval={I ₁, I ₂..., I _n, wherein I _krepresent a kth time that time interval distance microblogging is issued.The transfer amount that this microblogging obtained in the moment that n time interval is corresponding, can be expressed as a vector:

retweets_vector＝{count ₁，count ₂，…，count _n}

Wherein, count _irepresent when i-th time interval is corresponding, the current transfer amount of this microblogging.Approx the curve that the transfer amount of microblogging on each time interval is formed is regarded as popularity trend curve herein.The straight slope that in this curve, the fluctuation rate of change of adjacent two points can be linked by 2 calculates, as follows.

w a v e_{rate}_{i} = \frac{{count}_{i + 1} - {count}_{i}}{\overset{&OverBar;}{r e t w e e t_c o u n t} \times (I_{i + 1} - I_{i})}

In above-mentioned formula, wave_rate _irepresent the fluctuation rate of change of adjacent two points in curve, count _irepresent i-th component forwarded in vector, be normalized factor, represent the average transfer amount of the history microblogging in user one month, measured this user send out the average popularity of microblogging.

The final popularity undulatory property feature of microblogging can be represented by the maximum fluctuation rate of change intending popularity trend curve, and namely forward the maximal value of adjacent 2 fluctuation rate of change in vector, computing formula is as follows.

wave_feature＝max(wave_rate ₁，wave_rate ₂，…，wave_rate _n-1)

In step 603, in conjunction with the average transfer amount of user's history microblogging in month, calculate the otherness feature of popularity, as follows:

i d f_f e a t u r e = \frac{r e t w e e t_{count}_{T} - r e w e e t_c o u n t}{r e t w e e t_c o u n t}

Wherein retweet_count _trepresenting the obtainable transfer amount of microblogging, is also the current popularity of microblogging.

(5) based on the microblogging rumour identification of classification, according to the content characteristic of above-mentioned microblogging, the popularity feature issuing user characteristics and microblogging, construction feature vector, as the step 109 in Fig. 1, wherein every one dimension of proper vector is as shown in the table.

Then according to the sorter built, and above-mentioned (2), (3), obtain in (4) step microblogging content characteristic, issue the popularity feature of user characteristics and microblogging, carry out the prediction whether microblogging is rumour, and exporting corresponding Forecasting recognition result, this step corresponds to the step 110 in Fig. 1.Wherein sorter can select the common classification such as SVM, decision tree device, in sorter building process, first the proper vector (the same with step 109) of training data is obtained, as the input of sorter, in order to adjust the parameter of sorter, treat that parameter adjustment is complete, obtain the prediction that the sorter of finally having trained is applied to rumour.

The present invention proposes a kind of microblogging rumour recognition methods based on multiple features classification and system, the feature that this system is exclusive according to microblogging rumour, introduce the content that forefathers never proposed and the implicit new feature issuing user, and innovatively merge popularity and the popularity feature of microblogging, and in conjunction with the classification learning method in machine learning, realize accurately and automatically identify microblogging rumour.

Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art are when making various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims

1. the recognition methods of social networks rumour, is characterized in that, comprising:

2. social networks rumour recognition methods as claimed in claim 1, is characterized in that, described microblogging deep layer hidden feature comprises the viewpoint tendentiousness feature of focus tendentiousness feature, inside and outside consistance feature, feeling polarities feature and comment.

3. social networks rumour recognition methods as claimed in claim 1, is characterized in that, described user's deep layer hidden feature comprises social characteristics, viewpoint forwards feature and microblogging matching degree feature.

4. social networks rumour recognition methods as claimed in claim 2, it is characterized in that, the step extracting described focus tendentiousness feature and described inside and outside consistance feature comprises carries out participle and part-of-speech tagging to microblogging text, noun, the verb with performance meaning are extracted as keyword, and use the weight that the TF-IDF in Text character extraction sorts as keyword, using the keyword of K the highest for a weight word as microblogging text.

5. social networks rumour recognition methods as claimed in claim 2, it is characterized in that, the step of described extraction feeling polarities feature comprises carries out participle and part-of-speech tagging to microblogging text, and carried out the extraction of keyword by sentiment dictionary, emoticon dictionary, punctuation mark dictionary, responsive dictionary, the word extracting the notional word in described microblogging and can match in dictionary.

6. a social networks rumour recognition system, is characterized in that, comprising:

7. social networks rumour recognition system as claimed in claim 6, is characterized in that, described microblogging deep layer hidden feature comprises the viewpoint tendentiousness feature of focus tendentiousness feature, inside and outside consistance feature, feeling polarities feature and comment.

8. social networks rumour recognition system as claimed in claim 6, is characterized in that, described user's deep layer hidden feature comprises social characteristics, viewpoint forwards feature and microblogging matching degree feature.

9. social networks rumour recognition system as claimed in claim 7, it is characterized in that, the step extracting described focus tendentiousness feature and described inside and outside consistance feature comprises carries out participle and part-of-speech tagging to microblogging text, noun, the verb with performance meaning are extracted as keyword, and use the weight that the TF-IDF in Text character extraction sorts as keyword, using the keyword of K the highest for a weight word as microblogging text.

10. social networks rumour recognition system as claimed in claim 7, it is characterized in that, the step of described extraction feeling polarities feature comprises carries out participle and part-of-speech tagging to microblogging text, and carried out the extraction of keyword by sentiment dictionary, emoticon dictionary, punctuation mark dictionary, responsive dictionary, the word extracting the notional word in described microblogging and can match in dictionary.