CN108763574A - A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour - Google Patents

A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour Download PDF

Info

Publication number
CN108763574A
CN108763574A CN201810576095.XA CN201810576095A CN108763574A CN 108763574 A CN108763574 A CN 108763574A CN 201810576095 A CN201810576095 A CN 201810576095A CN 108763574 A CN108763574 A CN 108763574A
Authority
CN
China
Prior art keywords
rumour
microblogging
detection
sample
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810576095.XA
Other languages
Chinese (zh)
Inventor
杨波
熊枭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201810576095.XA priority Critical patent/CN108763574A/en
Publication of CN108763574A publication Critical patent/CN108763574A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The invention discloses a kind of microblogging rumour detection algorithms based on gradient boosted tree and rumour to detect characteristic set.The characteristic set for providing rumour detection, including 23 features.A kind of rumour detection algorithm based on gradient boosted tree is provided, which is used for microblogging rumour detection model training first, in accordance with the latent structure training sample in feature set, training sample;Then repeatedly training is carried out on training sample set and obtains multiple regression tree models, and every regression tree provides a predicted value, and final microblogging rumour detection model is worth in conjunction with the prediction of more regression trees;When carrying out rumour detection, the feature of the microblogging to be predicted is extracted by feature set, is used in combination detection model that the predicted value to the microblogging to be predicted is calculated, judges that the microblogging to be predicted belongs to rumour microblogging or non-rumour microblogging according to predicted value.Compared with existing microblogging rumour detection algorithm, a kind of microblogging rumour detection algorithm based on gradient boosted tree provided by the present invention can bring higher rumour accuracy of detection with rumour detection characteristic set, especially early stage rumour is published, accuracy of detection is significantly higher than existing microblogging rumour detection algorithm.

Description

A kind of microblogging rumour detection algorithm based on gradient boosted tree detects feature with rumour Set
Technical field
The present invention relates to the technical fields of microblogging rumour detection, and in particular to a kind of microblogging rumour based on gradient boosted tree Detection algorithm detects characteristic set with rumour.
Background technology
The features such as micro-blog information diversification, the freedom of speech, explosive ground spread speed encouraged rumour message generation and It propagates so that microblogging becomes the ideal place that false unreal message is propagated.In order to detect rumour and the in time biography of prevention rumour It broadcasts, the related algorithm of rumour detection comes into being.
Existing microblogging rumour detection algorithm accuracy of detection is not high enough, the detection essence especially early stage rumour is published It spends relatively low.This is an important deficiency of existing microblogging rumour detection algorithm.
Invention content
For deficiency existing for existing microblogging rumour detection algorithm, the present invention provides a kind of based on the micro- of gradient boosted tree Rich rumour detection algorithm detects characteristic set with rumour.It is provided by the present invention compared with existing microblogging rumour detection algorithm A kind of microblogging rumour detection algorithm based on gradient boosted tree can bring higher rumour to detect with rumour detection characteristic set Precision, especially early stage rumour is published, accuracy of detection is significantly higher than existing microblogging rumour detection algorithm.
The present invention is characterized in that including the following contents:
1, the characteristic set of rumour detection.And carry out rumour detection using this feature set.It is examined with existing microblogging rumour Method of determining and calculating is compared, and the detection feature that the present invention constructs helps to improve the accuracy of detection of rumour early detection.Specific feature set Conjunction is shown in Table 1.
Table 1
2, a kind of rumour detection algorithm based on gradient boosted tree.Using gradient promoted tree algorithm (i.e. S1, S2 in Fig. 1, Refer to S1.1-S1.2, S2.1-S2.6).Feature extraction is carried out first, in accordance with the feature of table 1, obtains training sample set.Then exist Training obtains more regression tree models on training dataset, the weight of each regression tree is calculated according to formula (1), and according to public affairs The label value of formula (2) more new samples.The training for repeating tag update and regression tree finally obtains more regression trees, according to Formula (3) obtains final detection model in conjunction with more regression trees.It is final rich to the microblogging of a Unknown Label using formula (4) Text carries out Tag Estimation.
Description of the drawings
Fig. 1 is a kind of flow chart of microblogging rumour detection algorithm based on gradient boosted tree provided by the invention.
Fig. 2 is the flow chart of S1 in Fig. 1.
Fig. 3 is the flow chart of S2 in Fig. 1.
Symbol description used in the present invention:
xiThe feature of sample i
yiThe label of sample i
N- training samples numbers
γmThe weight of-the m regression tree
α-smoothing parameter
h0Initial prediction
L- costs (loss) function
- the m takes turns iteration sample xiLabel
FM(x)-final prediction model
θ-is used for the threshold value of decision output label
Specific implementation mode
A kind of microblogging rumour detection algorithm based on gradient boosted tree disclosed by the invention detects characteristic set, packet with rumour Containing based on gradient boosted tree rumour detection algorithm, for rumour detection two parts of characteristic set.
The overall flow figure of microblogging rumour detection algorithm based on gradient boosted tree is as shown in Figure 1.Below in conjunction with the accompanying drawings, right The specific implementation mode of the present invention elaborates.
One, data processing
S1 in this part corresponding diagram 1, detail flowchart are shown in Fig. 2.
S1.1:Extract feature
Feature extraction, the value of the feature in extraction rumour detection characteristic set, characteristic set such as table 1 are carried out to initial data It is shown.
S1.2:Label is set
For a sample xi(1≤i≤N), if it belongs to rumour, it is 1 that its label yi, which is arranged,;Otherwise, its mark is set Sign yiIt is 0.
Two, detection model structure and rumour detection
S2 in this part corresponding diagram 1, detail flowchart are shown in Fig. 3.
S2:Model construction
This stage from data set for obtaining detection model FM(x), using FM(x) rumour detection is carried out.
S2.1:Initialization
The quantity M of regression tree, the depth capacity P of regression tree, smoothing parameter α, decision-making value θ are set.
S2.2:Traverse feature, the value of feature
1) traversal characteristic set xj∈{x1,x2,x3,…,x23(j=1,2 ..., 23) and each feature xjIt is all Value.
S2.3:Calculate characteristic loss
1) it divides (j, s) for one and training data is divided into RleftAnd RrightTwo regions, wherein Rleft(j, s)={ x |xj≤ s }, Rright(j, s)=and x | xj>s}.Calculate the predicted value in each region:
Calculate the loss L (j, s) of the division:
2) optimal dividing (j, a s) * is found so that loss L (j, s) is minimum.And the division is used to be drawn as final Point, training data is divided into two regions.
3) data area division is recursively carried out, until the depth of regression tree reaches P, obtains regression tree hm(x).S2.4: Calculate the weight of current regression tree
1) weight of regression tree is calculated:
S2.5:Update prediction target
1) value of each sample y in training set is updated:
Update overall model:
Fm(x)=Fm-1(x)+αγmhm(x)。
2) detection model F is finally obtainedM(x):
S2.6:Carry out rumour detection
Detection model F can be obtained to step S2.5 in step S2.1M(x)。
1) for one it is unknown whether the Twitter message of rumour or non-rumour, it is to be predicted to extract this by the feature set of table 1 Microblogging 23 features value.
2) F is calculatedM(x) a decision-making value θ is arranged in value, as follows to the prediction of the label of x:

Claims (3)

1. a kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour, it is characterised in that:Including Microblogging rumour detection algorithm, rumour provided by the invention provided by the invention based on gradient boosted tree detect characteristic set.
2. a kind of microblogging rumour detection algorithm based on gradient boosted tree according to claim 1 detects feature set with rumour It closes, which is characterized in that the detection that rumour detection characteristic set therein includes is characterized as:Time interval, microblogging length, question mark number Amount, exclamation mark quantity, reference quantity, bracket quantity, first person word quantity, second person word quantity, third person word number Amount, quantity, topic numbers, date quantity, digital numerical, emoticon quantity, good friend's quantity, bean vermicelli quantity, mutual attention number Amount, all microblogging quantity, user force, number of reviews, forwards quantity, thumbs up quantity registion time;Wherein, customer impact The calculation formula of power is as follows:
3. a kind of microblogging rumour detection algorithm based on gradient boosted tree according to claim 1 detects feature set with rumour It closes, which is characterized in that the microblogging rumour detection algorithm therein based on gradient boosted tree is as follows:The packet obtained after feature extraction Containing N number of sample (xi,yi), in the data set of 1≤i≤N, wherein xiFor sample characteristics, yiFor the label belonging to sample, pass through instruction It gets to a detection model FM(x), model F is then usedM(x) rumour detection is carried out, be as follows:
Step 1:Feature extraction
1) in the data set comprising N number of sample, the value of 23 rumours detection feature in extraction 2;Later for a rumour Sample xi, by its label yiIt is set as 1, for non-rumour sample xi, by its label yiIt is set as 0;Finally obtaining N number of has label Training sample (xi,yi), 1≤i≤N;
Step 2:Detection model F is obtained by trainingM(x)
1) a positive integer M is given, value represents the number of training iteration;Initialize F0(x) it is a constant, even F0(x)=h0 (0<h0<1);Enable m=1;
2) as 1≤m<When M:
2.1) for each sample xi(1≤i≤N) updates its label yiFor Computational methods it is as follows:
2.2) sample set after updating labelIn, construct a regression tree hm(x), regression tree hm(x) specific configuration mistake Cheng Wei:
2.2.1) for each feature in characteristic set, all values of each feature, such as:(j, s) (wherein j is a certain Feature xj, some value that s is characterized);One is divided, sample set is divided into RleftAnd RrightTwo regions, wherein:
Rleft(j, s)=and x | xj≤ s },
Rright(j, s)=and x | xj>s}
Then, the predicted value in each region is calculated:
2.2.2 the loss L (j, s) of the division) is calculated:
2.2.3) in all divisions, optimal dividing (j, a s) * is found so that loss L (j, s) is minimum;
And split data into two regions using the division;
2.2.4 data area division) is recursively carried out, until the depth of regression tree reaches P, obtains regression tree h at this timem(x);
2.3) h is calculatedm(x) weight γm, circular is as follows:
Wherein L is loss function, is defined as follows:
2.4) F is obtainedm(x), circular is as follows:
Fm(x)=Fm-1(x)+αγmhm(x)
Wherein, α is smoothing parameter (0<α≤1);
2.5) value of m is added 1, goes to the 2 of step 2);
3) as m=M, detection model F is obtainedM(x), it is shown below:
Step 3:Carry out rumour detection
For a microblogging blog article x without label, F is calculatedM(x);A decision-making value θ is given, if FM(x)>θ, then x belong to ballad Say blog article;If FM(x)≤θ, then x belong to non-rumour blog article.
CN201810576095.XA 2018-06-06 2018-06-06 A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour Pending CN108763574A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810576095.XA CN108763574A (en) 2018-06-06 2018-06-06 A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810576095.XA CN108763574A (en) 2018-06-06 2018-06-06 A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour

Publications (1)

Publication Number Publication Date
CN108763574A true CN108763574A (en) 2018-11-06

Family

ID=64000204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810576095.XA Pending CN108763574A (en) 2018-06-06 2018-06-06 A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour

Country Status (1)

Country Link
CN (1) CN108763574A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670484A (en) * 2019-01-16 2019-04-23 电子科技大学 A kind of mobile phone individual discrimination method based on bispectrum feature and boosted tree
CN110807556A (en) * 2019-11-05 2020-02-18 重庆邮电大学 Method and device for predicting propagation trend of microblog rumors or/and dagger rumors
CN112749559A (en) * 2021-01-19 2021-05-04 北京邮电大学 Microblog rumor detection model training method, microblog rumor detection method and microblog rumor detection device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013178606A1 (en) * 2012-05-30 2013-12-05 Gemalto S.A. Smart card graphical and electrical personalization module and method of graphical and electrical personalization for smart cards
US20140016510A1 (en) * 2012-07-16 2014-01-16 Cisco Technology, Inc. Methods and apparatus for efficient decentralized information dissemination in a network
US20140278992A1 (en) * 2013-03-15 2014-09-18 Nfluence Media, Inc. Ad blocking tools for interest-graph driven personalization
CN104361231A (en) * 2014-11-11 2015-02-18 电子科技大学 Method for controlling rumor propagation in complicated network
CN106097111A (en) * 2016-06-20 2016-11-09 重庆房慧科技有限公司 A kind of public opinion prediction method based on the big data of intelligence community network
CN106202211A (en) * 2016-06-27 2016-12-07 四川大学 A kind of integrated microblogging rumour recognition methods based on microblogging type
CN106776528A (en) * 2015-11-19 2017-05-31 中国移动通信集团公司 A kind of information processing method and device
CN106919579A (en) * 2015-12-24 2017-07-04 腾讯科技(深圳)有限公司 A kind of information processing method and device, equipment
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
CN107256245A (en) * 2017-06-02 2017-10-17 河海大学 Improved and system of selection towards the off-line model that refuse messages are classified

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013178606A1 (en) * 2012-05-30 2013-12-05 Gemalto S.A. Smart card graphical and electrical personalization module and method of graphical and electrical personalization for smart cards
US20140016510A1 (en) * 2012-07-16 2014-01-16 Cisco Technology, Inc. Methods and apparatus for efficient decentralized information dissemination in a network
US20140278992A1 (en) * 2013-03-15 2014-09-18 Nfluence Media, Inc. Ad blocking tools for interest-graph driven personalization
CN104361231A (en) * 2014-11-11 2015-02-18 电子科技大学 Method for controlling rumor propagation in complicated network
CN106776528A (en) * 2015-11-19 2017-05-31 中国移动通信集团公司 A kind of information processing method and device
CN106919579A (en) * 2015-12-24 2017-07-04 腾讯科技(深圳)有限公司 A kind of information processing method and device, equipment
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
CN106097111A (en) * 2016-06-20 2016-11-09 重庆房慧科技有限公司 A kind of public opinion prediction method based on the big data of intelligence community network
CN106202211A (en) * 2016-06-27 2016-12-07 四川大学 A kind of integrated microblogging rumour recognition methods based on microblogging type
CN107256245A (en) * 2017-06-02 2017-10-17 河海大学 Improved and system of selection towards the off-line model that refuse messages are classified

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
HUIRU YUAN等: "On predicting event propagation on Weibo", 《2017 INTERNATIONAL CONFERENCE ON SERVICE SYSTEMS AND SERVICE MANAGEMENT》 *
QIAO ZHANG等: "Automatic Detection of Rumor on Social Network", 《NLPCC 2015: NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING》 *
刘建平PINARD: "梯度提升树(GBDT)原理小结", 《HTTPS://WWW.CNBLOGS.COM/PINARD/P/6140514.HTML》 *
段大高等: "基于梯度提升决策树的微博虚假消息检测", 《计算机应用》 *
熊枭: "基于集成分类器的微博谣言检测算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
霍良安: "突发事件发生后不实信息的传播问题研究", 《中国博士学位论文全文数据库信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670484A (en) * 2019-01-16 2019-04-23 电子科技大学 A kind of mobile phone individual discrimination method based on bispectrum feature and boosted tree
CN109670484B (en) * 2019-01-16 2022-03-25 电子科技大学 Mobile phone individual identification method based on bispectrum characteristics and lifting tree
CN110807556A (en) * 2019-11-05 2020-02-18 重庆邮电大学 Method and device for predicting propagation trend of microblog rumors or/and dagger rumors
CN110807556B (en) * 2019-11-05 2022-05-31 重庆邮电大学 Method and device for predicting propagation trend of microblog rumors or/and dagger topics
CN112749559A (en) * 2021-01-19 2021-05-04 北京邮电大学 Microblog rumor detection model training method, microblog rumor detection method and microblog rumor detection device

Similar Documents

Publication Publication Date Title
CN106294350B (en) A kind of text polymerization and device
CN107818105B (en) Recommendation method of application program and server
CN107220352A (en) The method and apparatus that comment collection of illustrative plates is built based on artificial intelligence
CN103198057B (en) One kind adds tagged method and apparatus to document automatically
US11514063B2 (en) Method and apparatus of recommending information based on fused relationship network, and device and medium
CN108628971A (en) File classification method, text classifier and the storage medium of imbalanced data sets
CN103580939B (en) A kind of unexpected message detection method and equipment based on account attribute
CN107704503A (en) User&#39;s keyword extracting device, method and computer-readable recording medium
CN106484764A (en) User&#39;s similarity calculating method based on crowd portrayal technology
CN105630884B (en) A kind of geographical location discovery method of microblog hot event
CN103886020B (en) A kind of real estate information method for fast searching
CN108763574A (en) A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour
CN106339495A (en) Topic detection method and system based on hierarchical incremental clustering
CN104915335B (en) The method and apparatus of the document sets that are the theme generation summary
WO2022141876A1 (en) Word embedding-based search method, apparatus and device, and storage medium
CN110134792A (en) Text recognition method, device, electronic equipment and storage medium
CN110457672A (en) Keyword determines method, apparatus, electronic equipment and storage medium
CN104915399A (en) Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN109446339B (en) Knowledge graph representation method based on multi-core Gaussian distribution
CN105992178B (en) A kind of refuse messages recognition methods and device
CN105426382B (en) A kind of music recommendation method of the mood context-aware based on Personal Rank
CN107111607A (en) The system and method detected for language
JP2017151933A (en) Data classifier, data classification method, and program
CN103034657B (en) Documentation summary generates method and apparatus
CN111694949B (en) Multi-text classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181106