CN108763574A

CN108763574A - A kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour

Info

Publication number: CN108763574A
Application number: CN201810576095.XA
Authority: CN
Inventors: 杨波; 熊枭
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-11-06

Abstract

The invention discloses a kind of microblogging rumour detection algorithms based on gradient boosted tree and rumour to detect characteristic set.The characteristic set for providing rumour detection, including 23 features.A kind of rumour detection algorithm based on gradient boosted tree is provided, which is used for microblogging rumour detection model training first, in accordance with the latent structure training sample in feature set, training sample；Then repeatedly training is carried out on training sample set and obtains multiple regression tree models, and every regression tree provides a predicted value, and final microblogging rumour detection model is worth in conjunction with the prediction of more regression trees；When carrying out rumour detection, the feature of the microblogging to be predicted is extracted by feature set, is used in combination detection model that the predicted value to the microblogging to be predicted is calculated, judges that the microblogging to be predicted belongs to rumour microblogging or non-rumour microblogging according to predicted value.Compared with existing microblogging rumour detection algorithm, a kind of microblogging rumour detection algorithm based on gradient boosted tree provided by the present invention can bring higher rumour accuracy of detection with rumour detection characteristic set, especially early stage rumour is published, accuracy of detection is significantly higher than existing microblogging rumour detection algorithm.

Description

A kind of microblogging rumour detection algorithm based on gradient boosted tree detects feature with rumour Set

Technical field

The present invention relates to the technical fields of microblogging rumour detection, and in particular to a kind of microblogging rumour based on gradient boosted tree Detection algorithm detects characteristic set with rumour.

Background technology

The features such as micro-blog information diversification, the freedom of speech, explosive ground spread speed encouraged rumour message generation and It propagates so that microblogging becomes the ideal place that false unreal message is propagated.In order to detect rumour and the in time biography of prevention rumour It broadcasts, the related algorithm of rumour detection comes into being.

Existing microblogging rumour detection algorithm accuracy of detection is not high enough, the detection essence especially early stage rumour is published It spends relatively low.This is an important deficiency of existing microblogging rumour detection algorithm.

Invention content

For deficiency existing for existing microblogging rumour detection algorithm, the present invention provides a kind of based on the micro- of gradient boosted tree Rich rumour detection algorithm detects characteristic set with rumour.It is provided by the present invention compared with existing microblogging rumour detection algorithm A kind of microblogging rumour detection algorithm based on gradient boosted tree can bring higher rumour to detect with rumour detection characteristic set Precision, especially early stage rumour is published, accuracy of detection is significantly higher than existing microblogging rumour detection algorithm.

The present invention is characterized in that including the following contents：

1, the characteristic set of rumour detection.And carry out rumour detection using this feature set.It is examined with existing microblogging rumour Method of determining and calculating is compared, and the detection feature that the present invention constructs helps to improve the accuracy of detection of rumour early detection.Specific feature set Conjunction is shown in Table 1.

Table 1

2, a kind of rumour detection algorithm based on gradient boosted tree.Using gradient promoted tree algorithm (i.e. S1, S2 in Fig. 1, Refer to S1.1-S1.2, S2.1-S2.6).Feature extraction is carried out first, in accordance with the feature of table 1, obtains training sample set.Then exist Training obtains more regression tree models on training dataset, the weight of each regression tree is calculated according to formula (1), and according to public affairs The label value of formula (2) more new samples.The training for repeating tag update and regression tree finally obtains more regression trees, according to Formula (3) obtains final detection model in conjunction with more regression trees.It is final rich to the microblogging of a Unknown Label using formula (4) Text carries out Tag Estimation.

Description of the drawings

Fig. 1 is a kind of flow chart of microblogging rumour detection algorithm based on gradient boosted tree provided by the invention.

Fig. 2 is the flow chart of S1 in Fig. 1.

Fig. 3 is the flow chart of S2 in Fig. 1.

Symbol description used in the present invention：

x_iThe feature of sample i

y_iThe label of sample i

N- training samples numbers

γ_mThe weight of-the m regression tree

α-smoothing parameter

h₀Initial prediction

L- costs (loss) function

- the m takes turns iteration sample x_iLabel

F_M(x)-final prediction model

θ-is used for the threshold value of decision output label

Specific implementation mode

A kind of microblogging rumour detection algorithm based on gradient boosted tree disclosed by the invention detects characteristic set, packet with rumour Containing based on gradient boosted tree rumour detection algorithm, for rumour detection two parts of characteristic set.

The overall flow figure of microblogging rumour detection algorithm based on gradient boosted tree is as shown in Figure 1.Below in conjunction with the accompanying drawings, right The specific implementation mode of the present invention elaborates.

One, data processing

S1 in this part corresponding diagram 1, detail flowchart are shown in Fig. 2.

S1.1：Extract feature

Feature extraction, the value of the feature in extraction rumour detection characteristic set, characteristic set such as table 1 are carried out to initial data It is shown.

S1.2：Label is set

For a sample x_i(1≤i≤N), if it belongs to rumour, it is 1 that its label yi, which is arranged,；Otherwise, its mark is set Sign y_iIt is 0.

Two, detection model structure and rumour detection

S2 in this part corresponding diagram 1, detail flowchart are shown in Fig. 3.

S2：Model construction

This stage from data set for obtaining detection model F_M(x), using F_M(x) rumour detection is carried out.

S2.1：Initialization

The quantity M of regression tree, the depth capacity P of regression tree, smoothing parameter α, decision-making value θ are set.

S2.2：Traverse feature, the value of feature

1) traversal characteristic set x^j∈{x¹,x²,x³,…,x²³(j=1,2 ..., 23) and each feature x^jIt is all Value.

S2.3：Calculate characteristic loss

1) it divides (j, s) for one and training data is divided into R_leftAnd R_rightTwo regions, wherein R_left(j, s)={ x |x^j≤ s }, R_right(j, s)=and x | x^j>s}.Calculate the predicted value in each region：

Calculate the loss L (j, s) of the division：

2) optimal dividing (j, a s) * is found so that loss L (j, s) is minimum.And the division is used to be drawn as final Point, training data is divided into two regions.

3) data area division is recursively carried out, until the depth of regression tree reaches P, obtains regression tree h_m(x).S2.4： Calculate the weight of current regression tree

1) weight of regression tree is calculated：

S2.5：Update prediction target

1) value of each sample y in training set is updated：

Update overall model：

F_m(x)=F_m-1(x)+αγ_mh_m(x)。

2) detection model F is finally obtained_M(x)：

S2.6：Carry out rumour detection

Detection model F can be obtained to step S2.5 in step S2.1_M(x)。

1) for one it is unknown whether the Twitter message of rumour or non-rumour, it is to be predicted to extract this by the feature set of table 1 Microblogging 23 features value.

2) F is calculated_M(x) a decision-making value θ is arranged in value, as follows to the prediction of the label of x：

Claims

1. a kind of microblogging rumour detection algorithm based on gradient boosted tree detects characteristic set with rumour, it is characterised in that：Including Microblogging rumour detection algorithm, rumour provided by the invention provided by the invention based on gradient boosted tree detect characteristic set.

2. a kind of microblogging rumour detection algorithm based on gradient boosted tree according to claim 1 detects feature set with rumour It closes, which is characterized in that the detection that rumour detection characteristic set therein includes is characterized as：Time interval, microblogging length, question mark number Amount, exclamation mark quantity, reference quantity, bracket quantity, first person word quantity, second person word quantity, third person word number Amount, quantity, topic numbers, date quantity, digital numerical, emoticon quantity, good friend's quantity, bean vermicelli quantity, mutual attention number Amount, all microblogging quantity, user force, number of reviews, forwards quantity, thumbs up quantity registion time；Wherein, customer impact The calculation formula of power is as follows：

3. a kind of microblogging rumour detection algorithm based on gradient boosted tree according to claim 1 detects feature set with rumour It closes, which is characterized in that the microblogging rumour detection algorithm therein based on gradient boosted tree is as follows：The packet obtained after feature extraction Containing N number of sample (x_i,y_i), in the data set of 1≤i≤N, wherein x_iFor sample characteristics, y_iFor the label belonging to sample, pass through instruction It gets to a detection model F_M(x), model F is then used_M(x) rumour detection is carried out, be as follows：

Step 1：Feature extraction

1) in the data set comprising N number of sample, the value of 23 rumours detection feature in extraction 2；Later for a rumour Sample x_i, by its label y_iIt is set as 1, for non-rumour sample x_i, by its label y_iIt is set as 0；Finally obtaining N number of has label Training sample (x_i,y_i), 1≤i≤N；

Step 2：Detection model F is obtained by training_M(x)

1) a positive integer M is given, value represents the number of training iteration；Initialize F₀(x) it is a constant, even F₀(x)=h₀ (0<h₀<1)；Enable m=1；

2) as 1≤m<When M：

2.1) for each sample x_i(1≤i≤N) updates its label y_iFor Computational methods it is as follows：

2.2) sample set after updating labelIn, construct a regression tree h_m(x), regression tree h_m(x) specific configuration mistake Cheng Wei：

2.2.1) for each feature in characteristic set, all values of each feature, such as:(j, s) (wherein j is a certain Feature x^j, some value that s is characterized)；One is divided, sample set is divided into R_leftAnd R_rightTwo regions, wherein：

R_left(j, s)=and x | x^j≤ s },

R_right(j, s)=and x | x^j>s}

Then, the predicted value in each region is calculated：

2.2.2 the loss L (j, s) of the division) is calculated：

2.2.3) in all divisions, optimal dividing (j, a s) * is found so that loss L (j, s) is minimum；

And split data into two regions using the division；

2.2.4 data area division) is recursively carried out, until the depth of regression tree reaches P, obtains regression tree h at this time_m(x)；

2.3) h is calculated_m(x) weight γ_m, circular is as follows：

Wherein L is loss function, is defined as follows：

2.4) F is obtained_m(x), circular is as follows：

F_m(x)=F_m-1(x)+αγ_mh_m(x)

Wherein, α is smoothing parameter (0<α≤1)；

2.5) value of m is added 1, goes to the 2 of step 2)；

3) as m=M, detection model F is obtained_M(x), it is shown below：

Step 3：Carry out rumour detection

For a microblogging blog article x without label, F is calculated_M(x)；A decision-making value θ is given, if F_M(x)>θ, then x belong to ballad Say blog article；If F_M(x)≤θ, then x belong to non-rumour blog article.