CN108763574A - A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour - Google Patents
A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour Download PDFInfo
- Publication number
- CN108763574A CN108763574A CN201810576095.XA CN201810576095A CN108763574A CN 108763574 A CN108763574 A CN 108763574A CN 201810576095 A CN201810576095 A CN 201810576095A CN 108763574 A CN108763574 A CN 108763574A
- Authority
- CN
- China
- Prior art keywords
- rumour
- microblogging
- detection
- sample
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Abstract
The invention discloses a kind of microblogging rumour detection algorithms based on gradient boosted tree and rumour to detect characteristic set.The characteristic set for providing rumour detection, including 23 features.A kind of rumour detection algorithm based on gradient boosted tree is provided, which is used for microblogging rumour detection model training first, in accordance with the latent structure training sample in feature set, training sample;Then repeatedly training is carried out on training sample set and obtains multiple regression tree models, and every regression tree provides a predicted value, and final microblogging rumour detection model is worth in conjunction with the prediction of more regression trees;When carrying out rumour detection, the feature of the microblogging to be predicted is extracted by feature set, is used in combination detection model that the predicted value to the microblogging to be predicted is calculated, judges that the microblogging to be predicted belongs to rumour microblogging or non-rumour microblogging according to predicted value.Compared with existing microblogging rumour detection algorithm, a kind of microblogging rumour detection algorithm based on gradient boosted tree provided by the present invention can bring higher rumour accuracy of detection with rumour detection characteristic set, especially early stage rumour is published, accuracy of detection is significantly higher than existing microblogging rumour detection algorithm.
Description
Technical field
The present invention relates to the technical fields of microblogging rumour detection, and in particular to a kind of microblogging rumour based on gradient boosted tree
Detection algorithm detects characteristic set with rumour.
Background technology
The features such as micro-blog information diversification, the freedom of speech, explosive ground spread speed encouraged rumour message generation and
It propagates so that microblogging becomes the ideal place that false unreal message is propagated.In order to detect rumour and the in time biography of prevention rumour
It broadcasts, the related algorithm of rumour detection comes into being.
Existing microblogging rumour detection algorithm accuracy of detection is not high enough, the detection essence especially early stage rumour is published
It spends relatively low.This is an important deficiency of existing microblogging rumour detection algorithm.
Invention content
For deficiency existing for existing microblogging rumour detection algorithm, the present invention provides a kind of based on the micro- of gradient boosted tree
Rich rumour detection algorithm detects characteristic set with rumour.It is provided by the present invention compared with existing microblogging rumour detection algorithm
A kind of microblogging rumour detection algorithm based on gradient boosted tree can bring higher rumour to detect with rumour detection characteristic set
Precision, especially early stage rumour is published, accuracy of detection is significantly higher than existing microblogging rumour detection algorithm.
The present invention is characterized in that including the following contents:
1, the characteristic set of rumour detection.And carry out rumour detection using this feature set.It is examined with existing microblogging rumour
Method of determining and calculating is compared, and the detection feature that the present invention constructs helps to improve the accuracy of detection of rumour early detection.Specific feature set
Conjunction is shown in Table 1.
Table 1
2, a kind of rumour detection algorithm based on gradient boosted tree.Using gradient promoted tree algorithm (i.e. S1, S2 in Fig. 1,
Refer to S1.1-S1.2, S2.1-S2.6).Feature extraction is carried out first, in accordance with the feature of table 1, obtains training sample set.Then exist
Training obtains more regression tree models on training dataset, the weight of each regression tree is calculated according to formula (1), and according to public affairs
The label value of formula (2) more new samples.The training for repeating tag update and regression tree finally obtains more regression trees, according to
Formula (3) obtains final detection model in conjunction with more regression trees.It is final rich to the microblogging of a Unknown Label using formula (4)
Text carries out Tag Estimation.
Description of the drawings
Fig. 1 is a kind of flow chart of microblogging rumour detection algorithm based on gradient boosted tree provided by the invention.
Fig. 2 is the flow chart of S1 in Fig. 1.
Fig. 3 is the flow chart of S2 in Fig. 1.
Symbol description used in the present invention:
xiThe feature of sample i
yiThe label of sample i
N- training samples numbers
γmThe weight of-the m regression tree
α-smoothing parameter
h0Initial prediction
L- costs (loss) function
- the m takes turns iteration sample xiLabel
FM(x)-final prediction model
θ-is used for the threshold value of decision output label
Specific implementation mode
A kind of microblogging rumour detection algorithm based on gradient boosted tree disclosed by the invention detects characteristic set, packet with rumour
Containing based on gradient boosted tree rumour detection algorithm, for rumour detection two parts of characteristic set.
The overall flow figure of microblogging rumour detection algorithm based on gradient boosted tree is as shown in Figure 1.Below in conjunction with the accompanying drawings, right
The specific implementation mode of the present invention elaborates.
One, data processing
S1 in this part corresponding diagram 1, detail flowchart are shown in Fig. 2.
S1.1:Extract feature
Feature extraction, the value of the feature in extraction rumour detection characteristic set, characteristic set such as table 1 are carried out to initial data
It is shown.
S1.2:Label is set
For a sample xi(1≤i≤N), if it belongs to rumour, it is 1 that its label yi, which is arranged,;Otherwise, its mark is set
Sign yiIt is 0.
Two, detection model structure and rumour detection
S2 in this part corresponding diagram 1, detail flowchart are shown in Fig. 3.
S2:Model construction
This stage from data set for obtaining detection model FM(x), using FM(x) rumour detection is carried out.
S2.1:Initialization
The quantity M of regression tree, the depth capacity P of regression tree, smoothing parameter α, decision-making value θ are set.
S2.2:Traverse feature, the value of feature
1) traversal characteristic set xj∈{x1,x2,x3,…,x23(j=1,2 ..., 23) and each feature xjIt is all
Value.
S2.3:Calculate characteristic loss
1) it divides (j, s) for one and training data is divided into RleftAnd RrightTwo regions, wherein Rleft(j, s)={ x
|xj≤ s }, Rright(j, s)=and x | xj>s}.Calculate the predicted value in each region:
Calculate the loss L (j, s) of the division:
2) optimal dividing (j, a s) * is found so that loss L (j, s) is minimum.And the division is used to be drawn as final
Point, training data is divided into two regions.
3) data area division is recursively carried out, until the depth of regression tree reaches P, obtains regression tree hm(x).S2.4:
Calculate the weight of current regression tree
1) weight of regression tree is calculated:
S2.5:Update prediction target
1) value of each sample y in training set is updated:
Update overall model:
Fm(x)=Fm-1(x)+αγmhm(x)。
2) detection model F is finally obtainedM(x):
S2.6:Carry out rumour detection
Detection model F can be obtained to step S2.5 in step S2.1M(x)。
1) for one it is unknown whether the Twitter message of rumour or non-rumour, it is to be predicted to extract this by the feature set of table 1
Microblogging 23 features value.
2) F is calculatedM(x) a decision-making value θ is arranged in value, as follows to the prediction of the label of x:
Claims (3)
1. a kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour, it is characterised in that:Including
Microblogging rumour detection algorithm, rumour provided by the invention provided by the invention based on gradient boosted tree detect characteristic set.
2. a kind of microblogging rumour detection algorithm based on gradient boosted tree according to claim 1 detects feature set with rumour
It closes, which is characterized in that the detection that rumour detection characteristic set therein includes is characterized as:Time interval, microblogging length, question mark number
Amount, exclamation mark quantity, reference quantity, bracket quantity, first person word quantity, second person word quantity, third person word number
Amount, quantity, topic numbers, date quantity, digital numerical, emoticon quantity, good friend's quantity, bean vermicelli quantity, mutual attention number
Amount, all microblogging quantity, user force, number of reviews, forwards quantity, thumbs up quantity registion time;Wherein, customer impact
The calculation formula of power is as follows:
3. a kind of microblogging rumour detection algorithm based on gradient boosted tree according to claim 1 detects feature set with rumour
It closes, which is characterized in that the microblogging rumour detection algorithm therein based on gradient boosted tree is as follows:The packet obtained after feature extraction
Containing N number of sample (xi,yi), in the data set of 1≤i≤N, wherein xiFor sample characteristics, yiFor the label belonging to sample, pass through instruction
It gets to a detection model FM(x), model F is then usedM(x) rumour detection is carried out, be as follows:
Step 1:Feature extraction
1) in the data set comprising N number of sample, the value of 23 rumours detection feature in extraction 2;Later for a rumour
Sample xi, by its label yiIt is set as 1, for non-rumour sample xi, by its label yiIt is set as 0;Finally obtaining N number of has label
Training sample (xi,yi), 1≤i≤N;
Step 2:Detection model F is obtained by trainingM(x)
1) a positive integer M is given, value represents the number of training iteration;Initialize F0(x) it is a constant, even F0(x)=h0
(0<h0<1);Enable m=1;
2) as 1≤m<When M:
2.1) for each sample xi(1≤i≤N) updates its label yiFor Computational methods it is as follows:
2.2) sample set after updating labelIn, construct a regression tree hm(x), regression tree hm(x) specific configuration mistake
Cheng Wei:
2.2.1) for each feature in characteristic set, all values of each feature, such as:(j, s) (wherein j is a certain
Feature xj, some value that s is characterized);One is divided, sample set is divided into RleftAnd RrightTwo regions, wherein:
Rleft(j, s)=and x | xj≤ s },
Rright(j, s)=and x | xj>s}
Then, the predicted value in each region is calculated:
2.2.2 the loss L (j, s) of the division) is calculated:
2.2.3) in all divisions, optimal dividing (j, a s) * is found so that loss L (j, s) is minimum;
And split data into two regions using the division;
2.2.4 data area division) is recursively carried out, until the depth of regression tree reaches P, obtains regression tree h at this timem(x);
2.3) h is calculatedm(x) weight γm, circular is as follows:
Wherein L is loss function, is defined as follows:
2.4) F is obtainedm(x), circular is as follows:
Fm(x)=Fm-1(x)+αγmhm(x)
Wherein, α is smoothing parameter (0<α≤1);
2.5) value of m is added 1, goes to the 2 of step 2);
3) as m=M, detection model F is obtainedM(x), it is shown below:
Step 3:Carry out rumour detection
For a microblogging blog article x without label, F is calculatedM(x);A decision-making value θ is given, if FM(x)>θ, then x belong to ballad
Say blog article;If FM(x)≤θ, then x belong to non-rumour blog article.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576095.XA CN108763574A (en) | 2018-06-06 | 2018-06-06 | A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576095.XA CN108763574A (en) | 2018-06-06 | 2018-06-06 | A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108763574A true CN108763574A (en) | 2018-11-06 |
Family
ID=64000204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810576095.XA Pending CN108763574A (en) | 2018-06-06 | 2018-06-06 | A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763574A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670484A (en) * | 2019-01-16 | 2019-04-23 | 电子科技大学 | A kind of mobile phone individual discrimination method based on bispectrum feature and boosted tree |
CN110807556A (en) * | 2019-11-05 | 2020-02-18 | 重庆邮电大学 | Method and device for predicting propagation trend of microblog rumors or/and dagger rumors |
CN112749559A (en) * | 2021-01-19 | 2021-05-04 | 北京邮电大学 | Microblog rumor detection model training method, microblog rumor detection method and microblog rumor detection device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013178606A1 (en) * | 2012-05-30 | 2013-12-05 | Gemalto S.A. | Smart card graphical and electrical personalization module and method of graphical and electrical personalization for smart cards |
US20140016510A1 (en) * | 2012-07-16 | 2014-01-16 | Cisco Technology, Inc. | Methods and apparatus for efficient decentralized information dissemination in a network |
US20140278992A1 (en) * | 2013-03-15 | 2014-09-18 | Nfluence Media, Inc. | Ad blocking tools for interest-graph driven personalization |
CN104361231A (en) * | 2014-11-11 | 2015-02-18 | 电子科技大学 | Method for controlling rumor propagation in complicated network |
CN106097111A (en) * | 2016-06-20 | 2016-11-09 | 重庆房慧科技有限公司 | A kind of public opinion prediction method based on the big data of intelligence community network |
CN106202211A (en) * | 2016-06-27 | 2016-12-07 | 四川大学 | A kind of integrated microblogging rumour recognition methods based on microblogging type |
CN106776528A (en) * | 2015-11-19 | 2017-05-31 | 中国移动通信集团公司 | A kind of information processing method and device |
CN106919579A (en) * | 2015-12-24 | 2017-07-04 | 腾讯科技(深圳)有限公司 | A kind of information processing method and device, equipment |
CN106940732A (en) * | 2016-05-30 | 2017-07-11 | 国家计算机网络与信息安全管理中心 | A kind of doubtful waterborne troops towards microblogging finds method |
CN107256245A (en) * | 2017-06-02 | 2017-10-17 | 河海大学 | Improved and system of selection towards the off-line model that refuse messages are classified |
-
2018
- 2018-06-06 CN CN201810576095.XA patent/CN108763574A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013178606A1 (en) * | 2012-05-30 | 2013-12-05 | Gemalto S.A. | Smart card graphical and electrical personalization module and method of graphical and electrical personalization for smart cards |
US20140016510A1 (en) * | 2012-07-16 | 2014-01-16 | Cisco Technology, Inc. | Methods and apparatus for efficient decentralized information dissemination in a network |
US20140278992A1 (en) * | 2013-03-15 | 2014-09-18 | Nfluence Media, Inc. | Ad blocking tools for interest-graph driven personalization |
CN104361231A (en) * | 2014-11-11 | 2015-02-18 | 电子科技大学 | Method for controlling rumor propagation in complicated network |
CN106776528A (en) * | 2015-11-19 | 2017-05-31 | 中国移动通信集团公司 | A kind of information processing method and device |
CN106919579A (en) * | 2015-12-24 | 2017-07-04 | 腾讯科技(深圳)有限公司 | A kind of information processing method and device, equipment |
CN106940732A (en) * | 2016-05-30 | 2017-07-11 | 国家计算机网络与信息安全管理中心 | A kind of doubtful waterborne troops towards microblogging finds method |
CN106097111A (en) * | 2016-06-20 | 2016-11-09 | 重庆房慧科技有限公司 | A kind of public opinion prediction method based on the big data of intelligence community network |
CN106202211A (en) * | 2016-06-27 | 2016-12-07 | 四川大学 | A kind of integrated microblogging rumour recognition methods based on microblogging type |
CN107256245A (en) * | 2017-06-02 | 2017-10-17 | 河海大学 | Improved and system of selection towards the off-line model that refuse messages are classified |
Non-Patent Citations (6)
Title |
---|
HUIRU YUAN等: "On predicting event propagation on Weibo", 《2017 INTERNATIONAL CONFERENCE ON SERVICE SYSTEMS AND SERVICE MANAGEMENT》 * |
QIAO ZHANG等: "Automatic Detection of Rumor on Social Network", 《NLPCC 2015: NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING》 * |
刘建平PINARD: "梯度提升树(GBDT)原理小结", 《HTTPS://WWW.CNBLOGS.COM/PINARD/P/6140514.HTML》 * |
段大高等: "基于梯度提升决策树的微博虚假消息检测", 《计算机应用》 * |
熊枭: "基于集成分类器的微博谣言检测算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
霍良安: "突发事件发生后不实信息的传播问题研究", 《中国博士学位论文全文数据库信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670484A (en) * | 2019-01-16 | 2019-04-23 | 电子科技大学 | A kind of mobile phone individual discrimination method based on bispectrum feature and boosted tree |
CN109670484B (en) * | 2019-01-16 | 2022-03-25 | 电子科技大学 | Mobile phone individual identification method based on bispectrum characteristics and lifting tree |
CN110807556A (en) * | 2019-11-05 | 2020-02-18 | 重庆邮电大学 | Method and device for predicting propagation trend of microblog rumors or/and dagger rumors |
CN110807556B (en) * | 2019-11-05 | 2022-05-31 | 重庆邮电大学 | Method and device for predicting propagation trend of microblog rumors or/and dagger topics |
CN112749559A (en) * | 2021-01-19 | 2021-05-04 | 北京邮电大学 | Microblog rumor detection model training method, microblog rumor detection method and microblog rumor detection device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106294350B (en) | A kind of text polymerization and device | |
CN107818105B (en) | Recommendation method of application program and server | |
CN107220352A (en) | The method and apparatus that comment collection of illustrative plates is built based on artificial intelligence | |
CN103198057B (en) | One kind adds tagged method and apparatus to document automatically | |
US11514063B2 (en) | Method and apparatus of recommending information based on fused relationship network, and device and medium | |
CN108628971A (en) | File classification method, text classifier and the storage medium of imbalanced data sets | |
CN103580939B (en) | A kind of unexpected message detection method and equipment based on account attribute | |
CN107704503A (en) | User's keyword extracting device, method and computer-readable recording medium | |
CN106484764A (en) | User's similarity calculating method based on crowd portrayal technology | |
CN105630884B (en) | A kind of geographical location discovery method of microblog hot event | |
CN103886020B (en) | A kind of real estate information method for fast searching | |
CN108763574A (en) | A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour | |
CN106339495A (en) | Topic detection method and system based on hierarchical incremental clustering | |
CN104915335B (en) | The method and apparatus of the document sets that are the theme generation summary | |
WO2022141876A1 (en) | Word embedding-based search method, apparatus and device, and storage medium | |
CN110134792A (en) | Text recognition method, device, electronic equipment and storage medium | |
CN110457672A (en) | Keyword determines method, apparatus, electronic equipment and storage medium | |
CN104915399A (en) | Recommended data processing method based on news headline and recommended data processing method system based on news headline | |
CN109446339B (en) | Knowledge graph representation method based on multi-core Gaussian distribution | |
CN105992178B (en) | A kind of refuse messages recognition methods and device | |
CN105426382B (en) | A kind of music recommendation method of the mood context-aware based on Personal Rank | |
CN107111607A (en) | The system and method detected for language | |
JP2017151933A (en) | Data classifier, data classification method, and program | |
CN103034657B (en) | Documentation summary generates method and apparatus | |
CN111694949B (en) | Multi-text classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181106 |