CN108009249A

CN108009249A - For the comment spam filter method of the fusion user behavior rule of unbalanced data

Info

Publication number: CN108009249A
Application number: CN201711247021.3A
Authority: CN
Inventors: 丁茜; 武琼; 孙剑
Original assignee: China Television Information Technology (beijing) Co Ltd
Current assignee: China Television Information Technology (beijing) Co Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2018-05-08
Anticipated expiration: 2037-12-01
Also published as: CN108009249B

Abstract

The present invention provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, including：Few class sample data is rebuild, lack sampling is carried out to the multiclass sample data that multiclass sample data is concentrated, reconfigures the training sample corpus being balanced；Using training sample corpus, characteristic of division word is extracted, the Bayesian model of structure is trained；User behavior rule model is called, detects whether comment data to be sorted is comment spam, preliminary classification is carried out to comment data；Feature extraction is carried out to comment data to be sorted, is classified using trained Bayesian Classification Model；Using Adaboosting algorithms, integrated study is carried out to user behavior rule model and Bayesian Classification Model, is trained using marked training sample data, obtains the classification results of final comment data to be sorted.Advantage is：Comment spam filter efficiency is improved comprehensively.

Description

For the comment spam filter method of the fusion user behavior rule of unbalanced data

Technical field

The invention belongs to comment spam filtration art, and in particular to a kind of fusion user's row for unbalanced data For the comment spam filter method of rule.

Background technology

With the rapid development of Internet, more and more people are handed over by issuing various commentaries Flow, state the viewpoint attitude of oneself.At the same time, also provide a convenient for some hackers, launched on the platform normally commented on Substantial amounts of advertisement, the speech comment for publicizing and abusing so that user can not obtain useful information, also counteracts that to information Excavate.Therefore comment spam filtering is extremely important.

From the technical point of view, the filtering for this comment spam belongs to text classification category.Existing comment spam Filtering technique is generally divided to two kinds：Supervised learning method and unsupervised learning method.Unsupervised learning：Need not manually it mark in advance Language material, to comment carry out categorical filtering.But the scene for the nonstandard comment of various colloquial style produced on internet, Unsupervised learning method, which can not be identified accurately, filters out comment spam.The problem of compared to unsupervised learning, supervised learning The heavier text classification of colloquial style can be solved the problems, such as to a certain extent.But supervised learning needs substantial amounts of manually mark early period Corpus is noted, is spent human and material resources very big.Some experts and scholars attempt to solve the problems, such as this using the method for machine learning, such as existing Some comment spam filtering technique-Bayesian Classification Arithmetics, although this method is fine for filtering spam mail effect, premise It is that training corpus is balanced.But in real network media, the comment form of user is different, and comment spam bears example and collects ratio It is more difficult, easily so that training data concentrates the sample number of some classification to be far smaller than another classification, so that the property of classification Can drastically it decline；And in the network media, user comment expression form is changeable, especially advertisement change is various, so as to increase The difficulty identified to new word identification and advertisement.

The content of the invention

In view of the defects existing in the prior art, the present invention provides a kind of fusion user behavior rule for unbalanced data Comment spam filter method, can effectively solve the above problems.

The technical solution adopted by the present invention is as follows：

The present invention provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, including Following steps：

Step 1, the comment sample data composition comment sample data set on network media channel is captured, marks every comment The comment data type of sample data；Wherein, the comment data type includes positive example sample type and negative example sample type；

Step 2, the comment sample data concentrates the quantity of positive example sample data and negative example sample data, if just The quantity of example sample data is more than the quantity of negative example sample data, then, will be negative using positive example sample data as multiclass sample data Example sample data is as few class sample data；, whereas if the quantity of positive example sample data is less than the quantity of negative example sample data, Then using positive example sample data as few class sample data, using negative example sample data as multiclass sample data；Thus sample will be commented on Notebook data collection is divided into few class sample data set and multiclass sample data set；

Step 3, the few class sample data concentrated to few class sample data is rebuild, and multiclass sample data is concentrated Multiclass sample data carry out lack sampling, reconfigure the training sample corpus being balanced；Wherein, training sample corpus In the labeled comment data type of each training sample, therefore, training sample distinguishes positive example training sample and the training of negative example Sample；Training sample corpus includes positive example training sample word bank and negative training sample word bank；

Step 4, using training sample corpus, characteristic of division word is extracted, the Bayesian model of structure is trained；This Step specifically includes：

Step 4.1, the prior probability P (c of positive example training sample are calculated according to formula₁) and the priori of negative training sample it is general Rate P (c₂)：

Wherein：N represents the total quantity of the training sample in training sample corpus；n₁Represent in training sample corpus just The quantity of example training sample；n₂Represent the quantity of negative training sample in training sample corpus；

Step 4.2, for each training sample, it is segmented, based on following formula, is calculated using the method for information gain Contribution margin IGs (w) of each word w to category of model：

Wherein：

P (w) represents the probability of occurrence of the training sample comprising word w in training sample corpus, i.e.,：P (w)=comprising single The total quantity N of quantity/training sample of the training sample of word w；

P(c₁| w) represent the conditional probability of the positive example training sample comprising word w, i.e.,：P(c₁| w)=including word w just The quantity of the quantity of example training sample/training sample comprising word w；

P(c₂| w) represent the conditional probability of the negative training sample comprising word w, i.e.,：P(c₂| w)=negative comprising word w The quantity of the quantity of example training sample/training sample comprising word w；

Represent the probability of occurrence of the training sample not comprising word w in training sample corpus, i.e.,：

Represent the conditional probability of the positive example training sample not comprising word w, i.e.,：

Represent the conditional probability of the negative training sample not comprising word w, i.e.,：

Step 4.3, presetting contribution threshold, extraction contribution margin is higher than the word of contribution threshold as Feature Words, if extraction altogether It is respectively w to m Feature Words₁,w₂..., w_m, thus obtain feature word set F={ w₁,w₂,...,w_m}；

Step 4.4, arbitrary characteristics word w is concentrated for Feature Words_i, i=1,2 ..., m, Feature Words w is calculated using following formula_iBelong to In the conditional probability value P (w of positive example training sample classification_i|c₁) and w_iBelong to the conditional probability value P of negative training sample classification (w_i|c₂)：

Wherein：

Represent Feature Words w occur in training sample corpus_iPositive example training sample quantity；

Represent positive example training sample quantity in training sample corpus；

Represent Feature Words w occur in training sample corpus_iNegative training sample quantity；

Represent negative training sample quantity in training sample corpus；

Step 5, for comment data to be sorted, sensitive word detection is carried out first, if there are sensitive word to be directly considered as rubbish Rubbish is commented on；If sensitive word is not present, 6 are entered step；

Step 6, new word discovery detection is carried out to the comment data to be sorted after step 5 detection, if containing neologisms, After neologisms are marked comment data type, add it in training sample corpus, extension renewal training sample corpus；Such as Fruit does not contain neologisms, enters step 7；

Step 7, user behavior rule model is called, detects whether comment data to be sorted is comment spam, to comment Data carry out preliminary classification；Subsequently into step 8；

Step 8, feature extraction is carried out to comment data to be sorted, trained Bayesian Classification Model carries out for use Classification；This step specifically includes：

Step 8.1, participle pretreatment is carried out for comment data to be sorted, obtains several participles；If from what is obtained In dry participle, extraction belongs to the participle for the feature word set F that step 4.3 obtains as Feature Words, it is assumed that extracts s feature altogether Word, is respectively x₁,x₂..., x_s, thus obtain showing the feature term vector X={ x that comment classification is inclined to₁, x₂,...,x_s}；

Step 8.2, the priori conditions occurred using following formula calculating comment data to be sorted in positive example training sample word bank Probability P (X | c₁) and negative training sample word bank occur priori conditions probability P (X | c₂)：

Wherein：

u_jRepresent whether include Feature Words x in comment data to be sorted_jIf contain Feature Words x_j, then u_jFor 1；It is on the contrary For 0；

p(x_j|c₁) represent Feature Words x_jBelong to the conditional probability value of positive example training sample classification, be calculated with step 4.4 Same Feature Words P (w_i|c₁) value it is equal；

u_jp(x_j|c₂) represent Feature Words x_jBelong to the conditional probability value of negative training sample classification, calculated with step 4.4 P (the w of the same Feature Words arrived_i|c₂) value it is equal；

Step 8.3, calculated using following formula in training sample corpus, the probability P that comment data to be sorted occurs (X)：

P (X)=P (X | c₁)P(c₁)+P(X|c₂)P(c₂)

Step 8.4, the probability of positive example comment classification is belonged to using following formula calculating comment data to be sorted

P(c₁| X) and belong to the probability P (c of negative example comment classification₂|X)：

Step 9, using Adaboosting algorithms, user behavior rule model and Bayesian Classification Model are integrated Study, is trained using marked training sample data, and point of final comment data to be sorted is obtained using following formula Class result Q_k(X)：

Wherein:K is the number of disaggregated model, is herein constant value 2；

μ_tRepresent the weighted value that t-th of disaggregated model accounts in final accurate grader, be consistently greater than 0；

f_t(X) the classification judging result in X dimensional features term vector spatially t-th of disaggregated model is expressed as, wherein, f₁(X) Represent the classification judging result in X dimensional features term vector spatially Bayesian Classification Model, f₂(X) represent the classification judging result in X dimensional features term vector spatially user behavior rule model, be what step 7 obtained Preliminary classification result.

Preferably, in step 1, the positive example sample type is normal comment sample type；The negative example sample type is Comment spam sample type.

Preferably, in step 3, the few class sample data concentrated to few class sample data is rebuild, and is specially：

Step 3.1, it is assumed that few class sample data concentrates shared r few class sample datas, is denoted as respectively：Few class sample data L₁, few class sample data L₂..., few class sample data L_r；

Lack class sample data L for each_h, h=1,2 ..., r, calculate few class sample data L respectively_hTo few class sample The Euclidean distance of other r-1 few class sample datas in data set；

Step 3.2, presetting neighbour chooses quantity d, and selected distance lacks class sample data L_hD nearest few class sample numbers According to as few class sample data L_hNeighbour；

Step 3.3, the training samples number of the balance according to needed for actual sample ratio and training pattern, sets sampling times Rate e；

Lack class sample data L for each_h, carry out e sampling；Sample, selected at random from its d neighbour each time Select 1 neighbour, it is assumed that this neighbour for sampling selection is L_qQ=1,2 ..., r, the then new few class being building up to after this sampling Sample data L_new：

L_new=L_h+rand(0,1)×(L_q-L_h)

Wherein：Rand (0,1) represents the random number of generation 0 to 1, so that new few class sample data L_newIn few class Sample data L_hBetween its neighbour；

Thus the new few class sample data of extended architecture again.

Preferably, in step 9, further include：The mistake point rate error (t) of the disaggregated model trained according to training data, is adopted With the weighted value of following formula adjust automatically corresponding model：

It is provided by the invention for unbalanced data fusion user behavior rule comment spam filter method with Lower advantage：

This patent provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, this hair The bright few class sample data new using specific method construct, and multiple lack sampling is carried out to multiclass sample data, reconfigure The training sample corpus being balanced so that training sample data reach equilibrium state, and consider the characteristic of unbalanced data Data characteristics is extracted, comment classification is carried out thereby using Bayesian Classification Arithmetic；For the data such as changeable advertisement, multiple level marketing, profit Determined whether with the rule of conduct of user, finally unify to integrate comment data using Adaboosting methods Learning classification.Whole comment spam filtration system is identified word newly-generated in network using CRF algorithms, constantly updates Corpus causes it to be preferably applied in the comment spam of network media filtering scene, so as to improve comment spam filtering comprehensively Efficiency.

Brief description of the drawings

Fig. 1 is the comment spam filter method of the fusion user behavior rule provided by the invention for unbalanced data Flow diagram.

Embodiment

In order to which technical problem, technical solution and beneficial effect solved by the invention is more clearly understood, below in conjunction with Accompanying drawings and embodiments, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein only to Explain the present invention, be not intended to limit the present invention.

The present invention provides a kind of comment spam filter method of the fusion user behavior rule for unbalanced data, belongs to Text Classification in artificial intelligence field under natural language processing direction, present invention can apply in media television program, Or in the filtration system of the comment spam of online friend's issue, solved based on the method for integrated sample renewal and user behavior rule The filtration problem of uneven, changeable comment spam data.Central scope of the present invention is：In the imbalance based on supervised learning point In class, in order to make full use of the information in multiclass sample, by way of carrying out multiple lack sampling to multiclass sample and few class sample The method of this resampling builds the training based on Ensemble Learning Algorithms and expects storehouse together, and is constantly updated according under actual scene Positive and negative comment sample data is fed back, so as to solve the imbalance problem in comment spam filtering；It is various for what is produced on line The junk data problem such as advertisement, multiple level marketing, new word discovery corpus renewal is carried out using random forest, and is combined user behavior and advised Then, to identify comment spam, while united using the mode based on Bayesian Classification Arithmetic integrated study to comment data One classification learning, filtering spam comment data.

With reference to figure 1, a kind of comment spam mistake of fusion user behavior rule for unbalanced data provided by the invention Filtering method, for comment tendency in real network media is uneven, the various spy such as changeable, short and small, lack of standardization of user comment data Point, devises the comment spam filtration system of the integrated study for merging user behavior rule for unbalanced data, specific bag Include following steps：

Step 1, the comment sample data composition comment sample data set on network media channel is captured, marks every comment The comment data type of sample data；Wherein, the comment data type includes positive example sample type and negative example sample type；Its In, the positive example sample type is normal comment sample type；The negative example sample type is comment spam sample type.

Step 2, the comment sample data concentrates the quantity of positive example sample data and negative example sample data, if just The quantity of example sample data, then, will using positive example sample data as multiclass sample data far more than the quantity of negative example sample data Negative example sample data is as few class sample data；, whereas if the quantity of positive example sample data is far fewer than negative example sample data Quantity, then using positive example sample data as few class sample data, using negative example sample data as multiclass sample data；Thus will comment Few class sample data set and multiclass sample data set are divided into by sample data set；

In this step, the few class sample data concentrated to few class sample data is rebuild, and is specially：

L_new=L_h+rand(0,1)×(L_q-L_h)

Thus the new few class sample data of extended architecture again.

Wherein：

Step 4.3, presetting contribution threshold, extraction contribution margin is higher than the word of contribution threshold as Feature Words, if extraction altogether It is respectively w to m Feature Words₁,w₂,...,w_m, thus obtain feature word set F={ w₁,w₂..., w_m}；

Wherein：

Represent negative training sample quantity in training sample corpus；

The sensitive dictionary (such as " lottery industry ") of the present invention one in local maintenance, if there are sensitive word in comment, just by It is considered as comment spam.If not containing sensitive word, new word discovery detection is carried out to it using CRF++ algorithms, if the comment Containing neologisms, then extension renewal training corpus in training corpus is added it to after manually being marked, it is more adapted to In actual network media application scenarios；If without neologisms, into user behavior rule model and Bayesian Classification Model It is middle to be learnt respectively.

User behavior rule model：The behavioural habits of comment spam are mainly issued according to user, such as：The comment of commercial paper Generally all containing " QQ "+numeral or " wechat "+numeral；So as to formulate a plurality of comment classifying rules, local generation is then utilized Deformation dictionary, the deformation of such as QQ, wechat, public platform word, classify comment data to be sorted.

Step 8.1, participle pretreatment is carried out for comment data to be sorted, obtains several participles；If from what is obtained In dry participle, extraction belongs to the participle for the feature word set F that step 4.3 obtains as Feature Words, it is assumed that extracts s feature altogether Word, is respectively x₁,x₂,...,x_s, thus obtain showing the feature term vector X={ x that comment classification is inclined to₁,x₂..., x_s}；

Wherein：

P (X)=P (X | c₁)P(c₁)+P(X|c₂)P(c₂)

Step 8.4, the probability P (c of positive example comment classification is belonged to using following formula calculating comment data to be sorted₁| X) and Belong to the probability P (c of negative example comment classification₂|X)：

Wherein:K is the number of disaggregated model, is herein constant value 2；

f_t(X) the classification judging result in X dimensional features term vector spatially t-th of disaggregated model is expressed as, wherein, for Two disaggregated models are related only in the present invention in actual conditions, therefore we assume that f₁(X) represent in X dimensional feature term vectors The spatially classification judging result of Bayesian Classification Model, f₂(X) represent in X dimensional feature words The classification judging result of user behavior rule model in vector space, is the preliminary classification result that step 7 obtains.

Wherein, the mistake point rate error (t) of the disaggregated model trained according to training data, using following formula adjust automatically phase Answer the weighted value of model：

The main feature of the present invention includes：

1st, the present invention rebuilds few class sample data using smote algorithms, is carried out for multiclass sample data Lack sampling, reconfigures the training sample corpus of balance；

2nd, new word discovery detection is carried out to comment data using CRF++ algorithms, and constantly automatically updates training corpus；

3rd, the present invention constructs user behavior rule model, and whether detection comment data is the comment spams such as advertisement, multiple level marketing；

4th, use information gain method carries out feature extraction to comment data to be sorted, is divided using Bayesian model Class；

5th, using Adaboosting integrated learning approachs, user behavior rule model and Bayesian model are learnt, to comment Data are classified.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should Depending on protection scope of the present invention.

Claims

A kind of 1. comment spam filter method of fusion user behavior rule for unbalanced data, it is characterised in that including Following steps：

Step 1, the comment sample data composition comment sample data set on network media channel is captured, marks every comment sample The comment data type of data；Wherein, the comment data type includes positive example sample type and negative example sample type；

Step 2, the comment sample data concentrates the quantity of positive example sample data and negative example sample data, if positive example sample The quantity of notebook data is more than the quantity of negative example sample data, then using positive example sample data as multiclass sample data, by negative example sample Notebook data is as few class sample data；, whereas if the quantity of positive example sample data is less than the quantity of negative example sample data, then will Positive example sample data is as few class sample data, using negative example sample data as multiclass sample data；Thus sample number will be commented on Few class sample data set and multiclass sample data set are divided into according to collection；

Step 3, the few class sample data concentrated to few class sample data is rebuild, and multiclass sample data is concentrated more Class sample data carries out lack sampling, reconfigures the training sample corpus being balanced；Wherein, in training sample corpus The labeled comment data type of each training sample, therefore, training sample distinguishes positive example training sample and negative training sample； Training sample corpus includes positive example training sample word bank and negative training sample word bank；

Step 4, using training sample corpus, characteristic of division word is extracted, the Bayesian model of structure is trained；This step Specifically include：

Step 4.1, the prior probability P (c of positive example training sample are calculated according to formula₁) and negative training sample prior probability P (c₂)：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>n</mi> <mn>1</mn> </msub> <mi>N</mi> </mfrac> </mrow>

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mi>n</mi> <mn>2</mn> </msub> <mi>N</mi> </mfrac> </mrow>

Wherein：N represents the total quantity of the training sample in training sample corpus；n₁Represent that positive example is instructed in training sample corpus Practice the quantity of sample；n₂Represent the quantity of negative training sample in training sample corpus；

Step 4.2, for each training sample, it is segmented, based on following formula, is calculated using the method for information gain each Contribution margin IGs (w) of the word w to category of model：

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>I</mi> <mi>G</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <mo>{</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>{</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <mover> <mi>w</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>{</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>+</mo> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mi>log</mi> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mover> <mi>w</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mo>}</mo> </mrow> </mtd> </mtr> </mtable> </mfenced>

Wherein：

P (w) represents the probability of occurrence of the training sample comprising word w in training sample corpus, i.e.,：P (w)=include word w Training sample quantity/training sample total quantity N；

P(c₁| w) represent the conditional probability of the positive example training sample comprising word w, i.e.,：P(c₁| w)=positive example comprising word w is instructed Practice the quantity of quantity/training sample comprising word w of sample；

P(c₂| w) represent the conditional probability of the negative training sample comprising word w, i.e.,：P(c₂| w)=negative example comprising word w is instructed Practice the quantity of quantity/training sample comprising word w of sample；

Represent the probability of occurrence of the training sample not comprising word w in training sample corpus, i.e.,：

Represent the conditional probability of the positive example training sample not comprising word w, i.e.,：

Represent the conditional probability of the negative training sample not comprising word w, i.e.,：

Step 4.3, presetting contribution threshold, extraction contribution margin is higher than the word of contribution threshold as Feature Words, if extracting m altogether Feature Words, are respectively w₁, w₂..., w_m, thus obtain feature word set F={ w₁,w₂,...,w_m}；

Step 4.4, arbitrary characteristics word w is concentrated for Feature Words_i, i=1,2 ..., m, Feature Words w is calculated using following formula_iBelong to just Conditional probability value P (the w of example training sample classification_i|c₁) and w_iBelong to the conditional probability value P (w of negative training sample classification_i| c₂)：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>n</mi> <mrow> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> </mrow> </msub> <mo>/</mo> <msub> <mi>n</mi> <msub> <mi>c</mi> <mn>1</mn> </msub> </msub> </mrow>

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>n</mi> <mrow> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> </mrow> </msub> <mo>/</mo> <msub> <mi>n</mi> <msub> <mi>c</mi> <mn>2</mn> </msub> </msub> </mrow>

Wherein：

Represent Feature Words w occur in training sample corpus_iPositive example training sample quantity；

Represent positive example training sample quantity in training sample corpus；

Represent Feature Words w occur in training sample corpus_iNegative training sample quantity；

Represent negative training sample quantity in training sample corpus；

Step 5, for comment data to be sorted, sensitive word detection is carried out first, is commented if being directly considered as rubbish there are sensitive word By；If sensitive word is not present, 6 are entered step；

Step 6, new word discovery detection is carried out to the comment data to be sorted after step 5 detection, will be new if containing neologisms After word mark comment data type, add it in training sample corpus, extension renewal training sample corpus；If no Containing neologisms, 7 are entered step；

Step 7, user behavior rule model is called, detects whether comment data to be sorted is comment spam, to comment data Carry out preliminary classification；Subsequently into step 8；

Step 8, feature extraction is carried out to comment data to be sorted, is divided using trained Bayesian Classification Model Class；This step specifically includes：

Step 8.1, participle pretreatment is carried out for comment data to be sorted, obtains several participles；From several obtained In participle, extraction belongs to the participle for the feature word set F that step 4.3 obtains as Feature Words, it is assumed that s Feature Words are extracted altogether, Respectively x₁,x₂..., x_s, thus obtain showing the feature term vector X={ x that comment classification is inclined to₁, x₂..., x_s}；

Step 8.2, the priori conditions probability P occurred using following formula calculating comment data to be sorted in positive example training sample word bank (X|c₁) and negative training sample word bank occur priori conditions probability P (X | c₂)：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Pi;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>s</mi> </munderover> <mrow> <mo>(</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mi>p</mi> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>p</mi> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Pi;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>s</mi> </munderover> <mrow> <mo>(</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mi>p</mi> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>u</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>p</mi> <mo>(</mo> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>

Wherein：

u_jRepresent whether include Feature Words x in comment data to be sorted_jIf contain Feature Words x_j, then u_jFor 1；Otherwise for 0；

p(x_j|c₁) represent Feature Words x_jBelong to the conditional probability value of positive example training sample classification, be calculated with step 4.4 same P (the w of one Feature Words_i|c₁) value it is equal；

u_jp(x_j|c₂) represent Feature Words x_jBelong to the conditional probability value of negative training sample classification, be calculated with step 4.4 P (the w of same Feature Words_i|c₂) value it is equal；

Step 8.3, calculated using following formula in training sample corpus, the probability P (X) that comment data to be sorted occurs：

P (X)=P (X | c₁)P(c₁)+P(X|c₂)P(c₂)

Step 8.4, the probability P (c of positive example comment classification is belonged to using following formula calculating comment data to be sorted₁| X) and belong to Probability P (the c of negative example comment classification₂|X)：

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>|</mo> <mi>X</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

<mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>|</mo> <mi>X</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>|</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mi>P</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

Step 9, using Adaboosting algorithms, integrated study is carried out to user behavior rule model and Bayesian Classification Model, It is trained using marked training sample data, the classification results of final comment data to be sorted is obtained using following formula Q_k(X)：

<mrow> <msub> <mi>Q</mi> <mi>k</mi> </msub> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <msub> <mi>&mu;</mi> <mi>t</mi> </msub> <msub> <mi>f</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </mrow>

Wherein:K is the number of disaggregated model, is herein constant value 2；

μ_tRepresent the weighted value that t-th of disaggregated model accounts in final accurate grader, be consistently greater than 0；

f_t(X) the classification judging result in X dimensional features term vector spatially t-th of disaggregated model is expressed as, wherein, f₁(X) represent In the classification judging result of X dimensional features term vector spatially Bayesian Classification Model,f₂(X) Represent the classification judging result in X dimensional features term vector spatially user behavior rule model, be that step 7 obtains preliminary point Class result.
2. the comment spam filter method of the fusion user behavior rule according to claim 1 for unbalanced data, It is characterized in that, in step 1, the positive example sample type is normal comment sample type；The negative example sample type is rubbish Comment on sample type.
3. the comment spam filter method of the fusion user behavior rule according to claim 1 for unbalanced data, It is characterized in that, in step 3, the few class sample data concentrated to few class sample data is rebuild, and is specially：

Step 3.1, it is assumed that few class sample data concentrates shared r few class sample datas, is denoted as respectively：Few class sample data L₁, it is few Class sample data L₂..., few class sample data L_r；

Lack class sample data L for each_h, h=1,2 ..., r, calculate few class sample data L respectively_hTo few class sample data Concentrate the Euclidean distance of other r-1 few class sample datas；

Step 3.2, presetting neighbour chooses quantity d, and selected distance lacks class sample data L_hD nearest few class sample datas, make For few class sample data L_hNeighbour；

Step 3.3, the training samples number of the balance according to needed for actual sample ratio and training pattern, sets sampling multiplying power e；

Lack class sample data L for each_h, carry out e sampling；Sample each time, 1 is randomly choosed from its d neighbour Neighbours, it is assumed that this neighbour for sampling selection is L_qQ=1,2 ..., r, the then new few class sample being building up to after this sampling Data L_new：

L_new=L_h+rand(0,1)×(L_q-L_h)

Wherein：Rand (0,1) represents the random number of generation 0 to 1, so that new few class sample data L_newIn few class sample Data L_hBetween its neighbour；

Thus the new few class sample data of extended architecture again.
4. the comment spam filter method of the fusion user behavior rule according to claim 1 for unbalanced data, It is characterized in that, in step 9, further include：The mistake point rate error (t) of the disaggregated model trained according to training data, under The weighted value of formula adjust automatically corresponding model：

<mrow> <msub> <mi>&mu;</mi> <mi>t</mi> </msub> <mo>=</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <mi>e</mi> <mi>r</mi> <mi>r</mi> <mi>o</mi> <mi>r</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>e</mi> <mi>r</mi> <mi>r</mi> <mi>o</mi> <mi>r</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>.</mo> </mrow>