CN110032733A

CN110032733A - A kind of rumour detection method and system for news long text

Info

Publication number: CN110032733A
Application number: CN201910184862.7A
Authority: CN
Inventors: 曹娟; 钟雷; 郭俊波; 李锦涛; 谢添; 刘浩远
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2019-07-19

Abstract

The present invention relates to a kind of rumour detection methods and system for news long text, it include: to obtain the text for being greater than default number of words in specified news platform as long text, extract the keyword of paragraph in long text, and social data is obtained with the keyword retrieval social platform, the related data of the paragraph is obtained using text relevant algorithm；Obtain labeled data collection, labeled data collection includes the multiple social datas for having marked rumour information, use the multiple disaggregated models of labeled data collection training, and the disaggregated model collection that training is completed is combined into Fusion Model, the confidence score of related data is obtained using Fusion Model, to represent paragraph as the probability of non-rumour.The present invention solves the problems, such as to be difficult to directly differentiate long article using heterologous detection method.

Description

A kind of rumour detection method and system for news long text

Technical field

The present invention relates to the rumour detection field in big data analysis, in particular to a kind of rumour for news long text Detection method and system.

Background technique

It is the main source that people obtain news information as Internet news media platform is closely bound up with our life One of.However, there is a large amount of deceptive information, especially part news media platforms in media platform to increase information source It introduces that the long article text quality for causing wherein to issue from media number is irregular, easily becomes the issue source of rumour, these letters It ceases and the normal spin and civil life of society is brought and seriously affected, carrying out rumour detection for media platform also becomes It is particularly important.Long article data involved in this patent, which refer to, is present in the data that text size in news media's platform is greater than 140. The mode of traditional manual identified rumour needs to expend a large amount of manpower and material resources, it is difficult to meet requirement of real-time, and long article number More disperse according to semantic information, the artificial cost that marks further increases.Currently, carrying out rumour detection using machine learning method Work in, mainly using microblogging, push away short text data in top grade platform as research object, for news such as " flash reports everyday " Long article data research in media platform is less.Microblogging, the short essay data pushed away in top grade platform can provide more for learning algorithm More learning characteristics.Such as content characteristic, user characteristics, propagation characteristic, temporal characteristics etc., in conjunction with currently a popular engineering Algorithm or deep learning algorithm are practised, has had reached higher accuracy for the rumour detection method of short essay data.And due to Masses cannot participate in the contents production of news media's platform, therefore long article data are rich without social media data in such platform Rich data characteristics, it is available that common detection algorithm usually only has content of text, and finds according to the observation, long article text It is usually weaker in the characteristic aspects distinction such as semanteme, emotion, punctuation mark, so that sorting algorithm is difficult to ensure accuracy.Therefore The present invention proposes a kind of new rumour detection method for news media's platform long article data.

In the rumour detection method based on content, main explicit features and semantic implicit features using syntax.It is aobvious In terms of formula character, the prior art proposes to use word feature, symbolic feature and the simple affective characteristics of content of text； Whether the prior art proposes using string length, word number, includes punctuation mark, issuing time etc. feature.It is implicit special Sign aspect, the prior art are expressed using the hidden layer of Recognition with Recurrent Neural Network study message, improve experiment effect；The prior art uses Term vector obtains the semantic meaning representation of text using convolutional neural networks as input.Since the platform datas text such as microblogging is shorter, Information is concentrated, and text style is different, therefore content characteristic can make rumour detection obtain better effect.And in news media's platform Data text it is longer, semanteme dispersion, clause grammer is plain, is difficult to obtain preferable classifying quality using only content characteristic.

Research at present for the detection of long article rumour is less, and the prior art is for " food health " " medical health " two necks The long article in domain carries out rumour identification, according to " rumour has the characteristics that abnormal emotion feature ", the method for proposing to use sentiment analysis Carry out rumour detection.But this method does not have universality, only effective to certain types of rumour.

For the problem present on, the present invention proposes that a kind of rumour for long article data in news media's platform detects Method.It has been observed that the rumour in long article data is usually only present in some paragraph, this method is by more mature micro- Rich short essay rumour detection method, is first accounted for long article as unit of paragraph, is extracted to each paragraph corresponding crucial Word, into microblog, search obtains microblog data, under the premise of guaranteeing that microblog data is relevant to long article paragraph content, uses Fusion Model calculates the confidence level of microblog data, and then obtains the confidence score of each paragraph in long article.

Summary of the invention

In view of the above-mentioned problems, the present invention proposes a kind of rumour detection method for long article data in news media's platform, It mainly solves the problems, such as to be assessed to find close data in microblog, while providing the credible of each paragraph in long article Spend score.

In particular it relates to a kind of rumour detection method for news long text, including:

Step 1 obtains the text for being greater than default number of words in specified news platform as long text, extracts the long text middle section The keyword fallen, and social data is obtained with the keyword retrieval social platform, the paragraph is obtained using text relevant algorithm Related data；

Step 2 obtains labeled data collection, which includes the multiple social datas for having marked rumour information, is made It is combined into Fusion Model with the multiple disaggregated models of labeled data collection training, and by the disaggregated model collection that training is completed, is melted using this Molding type obtains the confidence score of the related data, to represent the paragraph as the probability of non-rumour.

This is directed to the rumour detection method of news long text, and wherein the step 1 includes: to use TF- for each paragraph IDF method is extracted to obtain the keyword of paragraph.

This is directed to the rumour detection method of news long text, and wherein the step 1 includes: to calculate the social activity with the keyword Similarity between data and the paragraph, and gather the social data that the similarity is greater than threshold value, as the related data.

This is directed to the rumour detection method of news long text, and wherein multiple disaggregated model includes: supporting vector in step 2 Machine, random forest, extreme random tree, gradient promote decision tree, the promotion of limit gradient and Logic Regression Models.

This is directed to the rumour detection method of news long text, wherein the multiple disaggregated models of training in the step 2 specifically:

Labeled data collection is divided into training set and test set, and training set is divided into 5 foldings of same size, for the support Vector machine, the random forest, the extreme random tree, the gradient promote decision tree, the limit gradient lift scheme, choose instruction respectively Practice 4 foldings concentrated to be trained, residue 1 folds into capable prediction, by the support vector machines, the random forest, the extreme random tree, is somebody's turn to do Gradient promotes decision tree, the respective prediction result collection of limit gradient lift scheme is combined into the first middle trained collection, and each While secondary trained, which is predicted, if being b to the prediction result each time of test set_i, by the support to Amount machine, the random forest, the extreme random tree, the gradient promote the respective prediction knot of decision tree, the limit gradient lift scheme Fruit is averaged to obtain the second middle trained collection, is instructed using the Logic Regression Models in the first middle trained collection and first centre Practice and train and test on collection, obtains the final Fusion Model.

The invention also discloses a kind of rumour detection system for news long text, including:

Module 1 obtains the text for being greater than default number of words in specified news platform as long text, extracts the long text middle section The keyword fallen, and social data is obtained with the keyword retrieval social platform, the paragraph is obtained using text relevant algorithm Related data；

Module 2 obtains labeled data collection, which includes the multiple social datas for having marked rumour information, is made It is combined into Fusion Model with the multiple disaggregated models of labeled data collection training, and by the disaggregated model collection that training is completed, is melted using this Molding type obtains the confidence score of the related data, to represent the paragraph as the probability of non-rumour.

This is directed to the rumour detection system of news long text, and wherein the module 1 includes: to use TF- for each paragraph IDF method is extracted to obtain the keyword of paragraph.

This is directed to the rumour detection system of news long text, and wherein the module 1 includes: to calculate the social activity with the keyword Similarity between data and the paragraph, and gather the social data that the similarity is greater than threshold value, as the related data.

This is directed to the rumour detection system of news long text, and wherein multiple disaggregated model includes: supporting vector in module 2 Machine, random forest, extreme random tree, gradient promote decision tree, the promotion of limit gradient and Logic Regression Models.

This is directed to the rumour detection system of news long text, wherein the multiple disaggregated models of training in the module 2 specifically:

The technology of the present invention effect includes: to solve the problems, such as to be difficult to directly differentiate long article using heterologous detection method, is made The specific paragraph in long article there are rumour can be navigated to sectional detecting method, the method for the present invention is pervasive in news media's platform The rumour of middle long article data detects.

Detailed description of the invention

Fig. 1 is the method block diagram for obtaining long article paragraph；

Fig. 2 is Model Fusion stacking method schematic diagram；

Fig. 3 holistic approach block diagram of the present invention.

Specific implementation details

Key point of the present invention includes:

1, sectional detecting method.It is in the whole text that rumour/non-rumour is different from traditional short text, long article number in news media's platform According to rumour be usually only present in certain several paragraph, for this feature, long article is carried out ballad by the present invention as unit of paragraph Speech detection.Sectional detecting method can provide the confidence score of each paragraph in long article, navigate to the physical segment for rumour occur It falls, makes result that more there is interpretation.

2, heterologous detection method.Availability data feature is less in long article data, is directly difficult to using machine learning method It is assessed.The present invention carries out rumour detection using the microblog data that search obtains different source, before guaranteeing similar in content It puts, obtains the confidence score of corresponding long article paragraph.Heterologous detection method enriches data characteristics, and it is accurate to improve detection Property.

3, construction feature calculates microblog data confidence score using Fusion Model.The present invention has crawled Sina weibo platform The rumour data of middle official's certification, and non-rumour data are obtained by manually marking normal microblogging.Further, the present invention extracts Comment number in microblog data, 22 data characteristicses such as thumb up number, forwarding number, first using support vector machines, random forest, Gradient promotes decision tree etc., and totally 6 models carry out initial training to data, then using the stacking method in Fusion Model New training test data set is constructed, is finally trained on neotectonics data set using Logic Regression Models.Merge mould Type keeps the rumour testing result of microblog data more accurate.

To allow features described above and effect of the invention that can illustrate more clearly understandable, special embodiment below, and cooperate Bright book attached drawing is described in detail below.

The method of the present invention can provide confidence score to each of long article paragraph text, and pervasive in news simultaneously The rumour detection of long article data in media platform.Invention is illustrated with specific implementation method with reference to the accompanying drawing.

Step 1, to long article segment processing, obtain relevant microblog data.

Long article data characteristics distinction in news media's platform is weaker, is difficult to directly carry out from long article text merely Rumour/non-rumour classification, therefore, the present invention propose the data characteristics progress rumour detection using relevant microblog data rich, more The not strong disadvantage of long article feature differentiation is mended, and then obtains the assessment score of long article, obtains the flow chart of relevant microblog data such as Shown in Fig. 1.

Found according to statistics: the rumour of long article data exists only in the part paragraph in text in news media's platform, and its His paragraph remains as normal text, and the present invention first merges the paragraph of number of words in long article less (such as less than 25 words), It is authenticated as unit of paragraph.For each paragraph, TF-IDF (term frequency-inverse is used Document frequency) method extracts paragraph keyword, and wherein TF represents the normalization of the word frequency of occurrences in a document Value, the bigger word of the frequency of occurrences, TF value is bigger, calculation method are as follows:

Word sum in word frequency of occurrence/document in TF=document

IDF represents inverse document frequency of the word in collection of document, and the number of files comprising the word is fewer, and IDF value is bigger. Calculation method are as follows:

IDF=log (total number of documents/(number of files+1 comprising the word))

TF-IDF finally represents the importance of word in a document, the key that the present invention extracts using the product of TF and IDF Word is 4 forward words of TF-IDF score.

Microblog provides searching interface, and user inputs keyword and is obtained with corresponding microblog data.Utilize this Interface, the present invention develop data acquisition program, using the keyword extracted, crawl the homepage number in search return list According to.Further, in order to guarantee the content relevance of microblog data Yu long article paragraph, the present invention uses word embedding grammar Word2vec obtains the vector expression of word in microblog data and long article paragraph respectively, according to the TF-IDF weight of word, to word Vector weighting is averaged to obtain the vector expression of text.And the degree of correlation size using cosine correlation calculations between the two, if Microblog data and the vector expression of long article paragraph are respectivelyThen text relevant calculation formula are as follows:

Relevance threshold, reservation and the biggish microblog data of long article correlation are set, is obtained in long article using the above method The corresponding microblog data of each paragraph (related data).

Step 2, certification analysis is carried out to microblog data.

Microblog data corresponding for long article paragraph, the present invention promote decision using support vector machines, random forest, gradient The fusion of 6 models such as tree to microblog data carries out certification analysis.

Microblogging certification analysis is it is believed that belong to two classification problems, and the rumour data in training data are both from microblog The rumour data of official's certification, non-rumour data are from artificial mark.For each microblog data, the point in microblogging is extracted Praise the data characteristicses such as number, comment number, forwarding number totally 22 social characteristics.Stacking method uses the multiple model structures in upper layer first New training test data set is built out, is then trained again using underlying model.Method is as shown in Figure 2:

The present invention has used support vector machines (SVM), random forest (RF), extreme random tree (ET), gradient to mention at the middle and upper levels Decision tree (GDBT), limit gradient promotion (xgboost) totally 5 models are risen, data set is divided into training set and test set first, And training set is divided into 5 foldings of same size, for each model, 4 foldings chosen in training set are trained, another to fold into Row prediction (guarantees that each model is different with the data set to give a forecast), if the result predicted each time is a_i, 5 times are predicted As a result combination forms matrix A, becomes new training dataset.While training each time, test set data are predicted, If the prediction result each time to test set is b_i, 5 prediction results are averaged to obtain matrix B, become new test data Collection.Finally, training and testing on new training dataset A and new test data set B using Logic Regression Models, obtain most Whole evaluation model.

Fusion Model can reduce the deviation that single model occurs in assorting process, achieve in rumour detection more preferable Effect.For each long article paragraph, the confidence score of relevant microblog data is obtained using Model Fusion method, to represent The paragraph is the probability of non-rumour.

The following are system embodiment corresponding with above method embodiment, present embodiment can be mutual with above embodiment Cooperation is implemented.The relevant technical details mentioned in above embodiment are still effective in the present embodiment, in order to reduce repetition, Which is not described herein again.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above embodiment.

Claims

1. a kind of rumour detection method for news long text characterized by comprising

Step 1 obtains the text for being greater than default number of words in specified news platform as long text, extracts paragraph in the long text Keyword, and social data is obtained with the keyword retrieval social platform, the phase of the paragraph is obtained using text relevant algorithm Close data；

Step 2 obtains labeled data collection, which includes the multiple social datas for having marked rumour information, uses this The multiple disaggregated models of labeled data collection training, and the disaggregated model collection that training is completed is combined into Fusion Model, use the fusion mould Type obtains the confidence score of the related data, to represent the paragraph as the probability of non-rumour.

2. being directed to the rumour detection method of news long text as described in claim 1, which is characterized in that the step 1 includes: pair In each paragraph, extract to obtain the keyword of paragraph using TF-IDF method.

3. being directed to the rumour detection method of news long text as described in claim 1, which is characterized in that the step 1 includes: meter Calculator has the similarity between the social data of the keyword and the paragraph, and gathers the social data that the similarity is greater than threshold value, As the related data.

4. being directed to the rumour detection method of news long text as described in claim 1, which is characterized in that multiple in step 2 Disaggregated model includes: support vector machines, random forest, extreme random tree, gradient promotion decision tree, limit gradient is promoted and logic Regression model.

5. being directed to the rumour detection method of news long text as described in claim 1, which is characterized in that training in the step 2 Multiple disaggregated models specifically:

Labeled data collection is divided into training set and test set, and training set is divided into 5 foldings of same size, for the supporting vector Machine, the random forest, the extreme random tree, the gradient promote decision tree, the limit gradient lift scheme, choose training set respectively In 4 foldings be trained, residue 1 folds into capable prediction, by the support vector machines, the random forest, the extreme random tree, the gradient Promote decision tree, the respective prediction result collection of limit gradient lift scheme is combined into the first middle trained collection, and instructs each time While white silk, which is predicted, if being b to the prediction result each time of test set_i, by the supporting vector Machine, the random forest, the extreme random tree, the gradient promote decision tree, the respective prediction result of limit gradient lift scheme It is averaged to obtain the second middle trained collection, using the Logic Regression Models in the first middle trained collection and first middle trained It trains and tests on collection, obtain the final Fusion Model.

6. a kind of rumour detection system for news long text characterized by comprising

Module 1 obtains the text for being greater than default number of words in specified news platform as long text, extracts paragraph in the long text Keyword, and social data is obtained with the keyword retrieval social platform, the phase of the paragraph is obtained using text relevant algorithm Close data；

Module 2 obtains labeled data collection, which includes the multiple social datas for having marked rumour information, uses this The multiple disaggregated models of labeled data collection training, and the disaggregated model collection that training is completed is combined into Fusion Model, use the fusion mould Type obtains the confidence score of the related data, to represent the paragraph as the probability of non-rumour.

7. being directed to the rumour detection system of news long text as claimed in claim 6, which is characterized in that the module 1 includes: pair In each paragraph, extract to obtain the keyword of paragraph using TF-IDF method.

8. being directed to the rumour detection system of news long text as claimed in claim 6, which is characterized in that the module 1 includes: meter Calculator has the similarity between the social data of the keyword and the paragraph, and gathers the social data that the similarity is greater than threshold value, As the related data.

9. being directed to the rumour detection system of news long text as claimed in claim 6, which is characterized in that multiple in module 2 Disaggregated model includes: support vector machines, random forest, extreme random tree, gradient promotion decision tree, limit gradient is promoted and logic Regression model.

10. being directed to the rumour detection system of news long text as claimed in claim 6, which is characterized in that training in the module 2 Multiple disaggregated models specifically: