CN103389995B

CN103389995B - A kind of trash content recognition method and device

Info

Publication number: CN103389995B
Application number: CN201210144007.1A
Authority: CN
Inventors: 王帅
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2012-05-10
Filing date: 2012-05-10
Publication date: 2016-11-23
Anticipated expiration: 2032-05-10
Also published as: CN103389995A

Abstract

The embodiment of the present application provides a kind of trash content recognition method and device, including: when receiving sample data to be identified, classify based on Naive Bayes Classifier relative to prior art utilizes, Naive Bayes Classifier is improved by the embodiment of the present application, carries out the classification identification of sample data to be identified in conjunction with the first trust-factor and the second trust-factor.Owing to introducing the first trust-factor and the second trust-factor, naive Bayesian conditional is made independently to become softer, category classification can be better achieved, thus improve the accuracy of identification of rubbish contents, and owing to introducing the first trust-factor and the second trust-factor, the feature negligible amounts introduced in when making to classify so that utilize relatively short period of time i.e. can realize the identification of rubbish contents.

Description

A kind of trash content recognition method and device

Technical field

The application relates to the communications field, particularly relates to a kind of trash content recognition method and device.

Background technology

Along with popularizing of network, network information is increasing.The substantial amounts of network information is necessarily deposited At some invalid informations (invalid information can be referred to as rubbish contents), in order to ensure the network information health, Legal, the automatic identification of rubbish contents also becomes more and more important.

The most conventional anti-spam (the automatic identification of rubbish contents) scheme mainly has two big classes, and a class is base In the anti-spam scheme of user behavior, a class is anti-spam scheme based on user content.

Anti-spam scheme based on user behavior is mainly classified according to the operation behavior of user, will issue Frequency is defined as junk user higher than the user of setting value, thus identifies which user is normal users, which User is junk user, it is possible to the content that junk user is issued is defined as rubbish contents.

Anti-spam scheme of based on user behavior is in machine software user high-volume, high-frequency issue content Time relatively effective, because junk user can be identified with being apparent from by issuing frequency.But, if User suitably reduces issue frequency, will be unable to identify junk user with being apparent from by issue frequency.Such as A user utilizes an account within one day, to carry out 1000 contents issues when, classification is distinguished brighter Aobvious, can be easily identified out this user is junk user, and then identifies rubbish contents.But, as Really a user has 100 accounts, the when that an account issuing 10 times in one day, and the district of this mode Dividing just seems less proves effective, it is impossible to effectively identifying this user is junk user, also goes out this with regard to None-identified The rubbish contents that user issues.So publishing policy low-frequency for many accounts, based on user behavior anti- Rubbish scheme cannot effectively identify junk user, the most just cannot effectively identify rubbish contents.

Anti-spam scheme based on content mainly includes rule-based anti-spam scheme and based on grader Two kinds of methods of anti-spam scheme.

The way of rule-based anti-spam scheme, usually preset rules filter (fiter), by key word Carry out accurately or fuzzy matching with user input content, the content meeting preset rules is defined as in rubbish Hold, to reach the effect of rubbish contents identification.Rule-based anti-spam scheme, it needs to be determined that key word, is closed The difficulty that keyword determines is relatively big, and falsely dropping and the problem leaking choosing easily occurs in the key word determined.It addition, with One key word, in different context environmentals, representative meaning is the most different, is difficult to accomplish universality, The precision of rubbish contents identification is difficult to be guaranteed.The content such as with " U.S. pupil " this key word is being washed in a pan Maiden's product is exactly rubbish contents, but, it not the most rubbish contents in cosmetics.At key word When quantity is more, the ambiguity of key word can show to become apparent from, and the precision of rubbish contents identification can be subject to Bigger impact.

In anti-spam scheme based on grader, anti-spam scheme based on naive Bayesian, is current main flow Anti-spam method.But, due to the restriction of the conditional independence of Naive Bayes Classifier so that trained Cheng Feichang is time-consuming, and the precision improvement of grader there is also bottleneck.

Summary of the invention

The embodiment of the present application provides a kind of trash content recognition method and device, for shortening the knowledge of rubbish contents The other time, and improve the accuracy of identification of rubbish contents.

A kind of trash content recognition method, described method includes:

Receive the sample data to be identified that user provides；

Sample number to be identified is determined according to Naive Bayes Classifier, the first trust-factor and the second trust-factor According to classification, when described sample data to be identified is classified as rubbish contents classification, represent this sample to be identified Notebook data belongs to rubbish contents, when described sample data to be identified is classified as non-junk content type, and table Show that this sample data to be identified belongs to non-junk content,

Wherein, described first trust-factor is for each feature in the sample data of the rubbish contents classification configured The conditional probability occurred is respectively provided with, and described second trust-factor is the non-junk content type for configuration The conditional probability that in sample data, each feature occurs is respectively provided with.

A kind of rubbish contents identification device, described device includes:

Receiver module, for receiving the sample data to be identified that user provides；

Prediction module, for true according to Naive Bayes Classifier, the first trust-factor and the second trust-factor The classification of fixed sample data to be identified, when described sample data to be identified is classified as rubbish contents classification, Represent that this sample data to be identified belongs to rubbish contents, be classified as non-junk in described sample data to be identified During content type, represent that this sample data to be identified belongs to non-junk content, wherein, described first trust because of Son is that the conditional probability occurred for each feature in the sample data of the rubbish contents classification configured is respectively provided with , described second trust-factor is to occur for each feature in the sample data of the non-junk content type configured Conditional probability be respectively provided with.

The scheme provided according to the embodiment of the present application, when receiving the sample data to be identified that user provides, Classifying based on Naive Bayes Classifier relative to utilizing in prior art, the embodiment of the present application is to simplicity Bayes classifier improves, and carries out sample number to be identified in conjunction with the first trust-factor and the second trust-factor According to classification identification.Owing to introducing the first trust-factor and the second trust-factor, adjusted by trust-factor Joint conditional independence so that naive Bayesian conditional independently becomes softer, and class can be better achieved Do not classify, thus improve the accuracy of identification of rubbish contents, and owing to introducing the first trust-factor and the second letter Appoint the factor so that the feature negligible amounts introduced in during classification so that utilize relatively short period of time i.e. can realize The identification of rubbish contents.

Accompanying drawing explanation

The flow chart of steps of the trash content recognition method that Fig. 1 provides for the embodiment of the present application one；

The flow chart of steps of the characteristic model method after the determination training that Fig. 2 provides for the embodiment of the present application two；

The data flow diagram that Fig. 3 provides for the embodiment of the present application three；

The training process schematic that Fig. 4 provides for the embodiment of the present application four；

The test process schematic diagram that Fig. 5 provides for the embodiment of the present application five；

The structural representation of the rubbish contents identification device that Fig. 6 provides for the embodiment of the present application six.

Detailed description of the invention

Low in order to solve existing trash content recognition method accuracy of identification and the longest problem, this Shen Characteristic model based on Naive Bayes Classifier please be improved embodiment, introduce letter for each feature Appoint the factor, weaken the conditional independence of each feature so that through fewer number training can so that The accuracy of identification of the characteristic model that must train meets requirement so that the training time shortens.And due to training time Number is less, and the feature quantity introduced in characteristic model is the most fewer so that the characteristic model after training is to rubbish The recognition time of content can also shorten.Meanwhile, by introducing trust-factor, it is also possible to make each sample The classification of data definitely, thus promotes nicety of grading.The rubbish contents identification that the embodiment of the present application provides Scheme can apply to the application scenarios of various rubbish contents identification, such as identification or the refuse messages of spam Identification etc..

Below in conjunction with Figure of description and each embodiment, the application scheme is illustrated.

Embodiment one

The embodiment of the present application one provides a kind of trash content recognition method, the steps flow chart of the method such as Fig. 1 institute Show, including:

Step 001, receive sample data to be identified.

In this step, can receive that user provides is to be identified for be made whether as rubbish contents identification Sample data.This sample data to be identified can be that any one needs such as mail data or note data are carried out The sample data of rubbish contents identification.

Step 002, to the sample identified data are classified.

In this step, can be according to Naive Bayes Classifier, the first trust-factor and the second trust-factor Determine the classification of sample data to be identified.Wherein, described first trust-factor is the rubbish contents class for configuration The conditional probability that in other sample data, each feature occurs is respectively provided with, described second trust-factor be for The conditional probability that in the sample data of the non-junk content type of configuration, each feature occurs is respectively provided with.? Determine when the classification of described sample data to be identified is rubbish contents classification, i.e. may determine that this sample to be identified Data belong to rubbish contents, otherwise, are non-junk content class in the classification determining described sample data to be identified Time other, i.e. may determine that this sample data to be identified belongs to non-junk content.

Concrete, in this step, can be according to Naive Bayes Classifier, the first trust-factor and second Trust-factor determines characteristic model, it is possible to utilize the sample to be identified that user is provided by the characteristic model after training Notebook data is categorized as rubbish contents classification or non-junk content type.

Further, the sample data to be identified that user is provided by the characteristic model after utilizing training is classified Time, the sample data to be identified that user provides can be decomposed, it is thus achieved that the spy of this sample data to be identified Levy subset, the statistics knot of the characteristic model after utilizing each feature in this feature subset to determine for training Really, it is possible to when statistical result is more than the second setting threshold value, this sample data to be identified is categorized as rubbish Content type, otherwise, is categorized as non-junk content type by this sample data to be identified.

Below by embodiment two, the process of the characteristic model after determining training is described in detail.

Embodiment two

The embodiment of the present application two provides a kind of method determining the characteristic model after training, the step stream of the method Journey is as in figure 2 it is shown, include:

Step 101, determine sample set, sample data classification and the characteristic model of improvement.

In the scheme that the present embodiment provides, characteristic model based on Naive Bayes Classifier is improved, It is trained based on the characteristic model improved, it is possible to after utilizing training, user's input is waited to know by characteristic model Very notebook data is classified, it is judged that whether this sample data to be identified belongs to rubbish contents classification, it is achieved rubbish Rubbish content recognition.

The most in this step, in order to realize the training to characteristic model, it is also desirable to user provides sample set. Concrete, training sample subset can be determined from the sample set that user provides, for characteristic model is carried out Training.

And for the ease of follow-up to the test of characteristic model nicety of grading after training, can be further by user The sample set provided is divided into training sample subset and test sample subset, the sample number in training sample subset According to for being trained characteristic model, the sample data in test sample subset is for the feature after training Model carries out nicety of grading test.

Due to the present embodiment is in order to realize the identification of rubbish contents it can be understood as the most at last user provide Sample data to be identified be divided into two classifications: rubbish contents classification and non-junk content type.Therefore exist In this step, according to the feature of each sample data, the sample data in sample set can be categorized as two Classification, it is possible to represent with rubbish contents classification and non-junk content type respectively.Such that it is able to utilize mark For the sample data of rubbish contents classification be designated the sample data of non-junk content type to characteristic model It is trained and tests.

Characteristic model based on Naive Bayes Classifier can be multi-form, to divide based on naive Bayesian As a example by the characteristic model of class device is represented by below equation:

\frac{P (spam | χ)}{1 - P (spam | χ)} = \frac{P (spam) Π_{i = 1}^{n} P (X_{i} | spam)}{P (ham) Π_{i = 1}^{n} P (X_{i} | ham)} - - - (1)

Then, the characteristic model of improvement can be represented by below equation so that follow-up to this system based on formula The span of meter result is between 0 ~ 1, it is simple to follow-up this statistical result based on formula is carried out value model The restriction enclosed.Certainly, the characteristic model of improvement can also be represented by the formula of other forms, the present embodiment Formula form is not specifically limited:

\frac{P (spam | χ)}{1 - P (spam | χ)} = \frac{P (spam) Π_{i = 1}^{n} P (X_{i} | spam) θ (spam, X_{i})}{P (ham) Π_{i = 1}^{n} P (X_{i} | ham) θ (ham, X_{i})} - - - (2)

Wherein, in formula (1) and formula (2):

X={X₁,X₂,...X_n, represent feature X_i, i=1,2 ... the set of n；

Under conditions of P (spam | x) represents that characteristic set x occurs, sample data is the condition of rubbish contents classification Probability；

P (spam) represents occur being designated the probability of the sample data of rubbish contents classification；

P (ham) represents occur being designated the probability of the sample data of non-junk content type；

P(X_i| spam) represent feature X in the sample data being designated rubbish contents classification_iThe condition occurred is general Rate；

P(X_i| ham) represent feature X in the sample data being designated non-junk content type_iThe condition occurred is general Rate.

In formula (2), relative to formula (1), introduce the sample for being designated rubbish contents classification The first trust-factor that the conditional probability that in data, each feature occurs is respectively provided with, and for being designated non-rubbish The second trust-factor that the conditional probability that in the sample data of rubbish content type, each feature occurs is respectively provided with:

Wherein, θ (spam, X_i) represent feature X in the sample data being designated rubbish contents classification_iThe bar occurred First trust-factor of part probability；

θ(ham,X_i) represent feature X in the sample data being designated non-junk content type_iThe condition occurred is general Second trust-factor of rate.

Step 102, utilize sample data that characteristic model is trained.

In this step, it may be determined that the sample data of the setting quantity in training sample subset, to character modules Type is trained.Wherein it is determined that the sample data of the setting quantity gone out is not for instructing characteristic model Practice.Characteristic model is trained by each sample data in the sample data utilizing described setting quantity After, i.e. can obtain the characteristic model after epicycle training and (i.e. can be understood as taking turns training to the one of characteristic model Instruction characteristic model carried out including each sample data in the sample data utilizing described setting quantity Practice).

Owing to, after a sample data training, the partial parameters in characteristic model is it may happen that change, therefore, The character modules obtained after different sample data training can be identified with the characteristic model of different labels Type.Wherein, the 1st characteristic model can be designated without the characteristic model of sample data training.Often through one Individual sample data is trained, and can add 1 by the label of characteristic model, as will be trained through a sample data 1st characteristic model is designated the 2nd characteristic model.By the 2nd characteristic model through a sample data training It is designated the 3rd characteristic model, by that analogy.

To utilize a sample data in training sample subset that kth characteristic model is trained, obtain (assume that k is positive integer) as a example by k+1 characteristic model, the training process of kth characteristic model specifically included:

Decomposing this sample data, it is thus achieved that belong to the fisrt feature subset of described characteristic set, utilizing should Each feature in fisrt feature subset determines the statistical result for kth characteristic model；

It is identified as non-junk content type in this sample data, and described statistical result is more than the first setting door During limit value, in kth characteristic model, for each feature in this fisrt feature subset, it is reduced to be designated The second trust-factor that the conditional probability that in the sample data of non-junk content type, this feature occurs is arranged, will It is designated this feature during the quantity of this feature increases this sample data in the sample data of non-junk content type Quantity, and redefine statistical result；Otherwise, it determines the training of this sample data terminates, will be through this sample Kth characteristic model after data training is defined as kth+1 characteristic model；Or,

It is identified as rubbish contents classification in this sample data, and this statistical result is not more than the second setting thresholding During value, in kth characteristic model, for each feature in this fisrt feature subset, it is reduced to be designated rubbish The first trust-factor that the conditional probability that in the sample data of rubbish content type, this feature occurs is arranged, will mark For the quantity of this feature during the quantity of this feature increases this sample data in the sample data of rubbish contents classification, And redefine statistical result；Otherwise, it determines the training of this sample data terminates, will train through this sample data After kth characteristic model be defined as kth+1 characteristic model.

It should be noted that when a sample data is identified as non-junk content type, this sample data May be after the repeatedly training to kth characteristic model, just can be for kth characteristic model, it is thus achieved that the least In the first statistical result setting threshold value.It is similar to, is identified as rubbish contents class a sample data Time other, this sample data is likely to after the repeatedly training to kth characteristic model, just can be for kth Characteristic model, it is thus achieved that set the statistical result of threshold value more than second.

Step 103, the nicety of grading of characteristic model after training is tested.

This step is a preferred steps.Utilize the sample data of the setting quantity determined in a step 102 In each sample data take turns after training terminates for the one of characteristic model, can be to the characteristic model after training Nicety of grading test, so that it is determined that whether the characteristic model after Xun Lian can classify accurately, And can be when the characteristic model after determining training can be classified accurately, after utilizing the training determined Characteristic model carry out rubbish contents identification, otherwise, it determines need characteristic model is carried out the training of a new round, Return and perform step 102.

Concrete, it is assumed that in a step 102 last training be kth characteristic model is trained after Arrive kth+1 characteristic model, kth+1 characteristic model can be utilized in this step by described test specimens book The sample data concentrated is categorized as rubbish contents classification and non-junk content type.

Owing to the most the classification of each sample data in test sample subset being marked Know, therefore, in this step, can be according to the classification essence of the characteristic model after mark test training before Degree.The classification that each sample data is divided in described test sample subset is identified as with this sample data The identical number of classification not less than setting value time, when the nicety of grading i.e. tested out is not less than setting value, can Carry out rubbish contents identification with the characteristic model after utilizing the training determined, otherwise, jump to step 102, Redefine the sample data setting quantity, continue with sample data and characteristic model is trained.

Assuming in step 103, it is (the least that the measuring accuracy of kth+1 characteristic model can meet requirement In setting value), then when the characteristic model after utilizing the training determined carries out rubbish contents identification, permissible The sample data to be identified utilizing kth+1 characteristic model user to be provided is categorized as rubbish contents classification or non-rubbish Rubbish content type, when the classification of this sample data to be identified is rubbish contents classification, it is determined that this is to be identified Sample data comprises rubbish contents, it is achieved the identification to rubbish contents.Certainly, if sample data is to spy The frequency of training levying model is abundant, it is understood that the measuring accuracy for the characteristic model after training can be expired Foot requirement, after step 102, it is not necessary to perform step 103, utilizes the character modules after the training determined When type carries out rubbish contents identification, if it is assumed that last training is to kth character modules in a step 102 When type has obtained kth+1 characteristic model after being trained, it is also possible to directly utilizing kth+1 characteristic model will use The sample data to be identified that family provides is categorized as rubbish contents classification or non-junk content type.

Below by embodiment three, the method for the characteristic model after the determinations training of two offers in conjunction with the embodiments, The data flow relating to embodiment one in the trash content recognition method provide illustrates.

Embodiment three

As it is shown on figure 3, its data flow diagram provided for the embodiment of the present application three.Wherein, user provides The sub-centralised identity of training sample have the sample data of rubbish contents classification or non-junk content type to claim For language material, language material can be saved in corpus.Training airplane can be utilized according to language material, characteristic model to be carried out Training, it is possible to the characteristic model obtained after training is saved in feature model library.Input to be identified user After sample data, it is possible to use the characteristic model that prediction machine obtains after obtaining training from feature model library, and The classification of the sample data to be identified of user's input is predicted, it is possible to will predict the outcome and be stored in result Storehouse.Further, user manually can judge predicting the outcome, and determines the accuracy predicted the outcome, And can will be input in corpus as language material, for follow-up instruction through the artificial sample data judged Practice.And due to corpus it may happen that update, therefore, it can utilize intervalometer to carry out clocked flip training airplane weight Newly being trained, the characteristic model after the language material after updating with utility is trained again can more be as the criterion Really classify.

Below by embodiment four to embodiment two utilizing a sample data characteristic model is trained Process be described in detail.

Embodiment four

The embodiment of the present application four provides a kind of sample data training method to characteristic model, the step of the method Flow process as shown in Figure 4, including:

Step 201, sample data is decomposed.

When needing to utilize the sample data determined that characteristic model is trained, first to this sample data Decompose, it is thus achieved that the character subset of this sample data.This feature subset can be understood as each for spy Levy a part of feature that the sample data that model is trained carries out decomposing in the characteristic set obtained or all Feature.Concrete, it is possible to use language model N-GRAM pair conventional in large vocabulary continuous speech recognition Sample data is decomposed.

Step 202, determine statistical result.

Sample data can be understood as the value to characteristic model relevant parameter to the once training of characteristic model The process being updated, it is possible to the parameter obtained after preserving training every time.Concrete, spy can be saved in Levy in model library.Such that it is able to when determining statistical result, it is possible to use train the parameter, right obtained before The quantity that described sample data carries out each feature and each feature decomposing in the character subset obtained is true The fixed statistical result for characteristic model.

Such as, a sample data is identified as rubbish contents classification, decomposes this sample data, Character subset (X to this sample data₁, X₃, X₅, X₆), and X in this sample data₁Quantity be 5 Individual, X₃Quantity be 7, X₅Quantity be 10, X₆Quantity be 4, and sample will be trained in advance The sample data that book is concentrated is designated the sample size known (for example, 300) of rubbish contents classification, then may be used To determine:

It is designated feature X in the sample data of rubbish contents classification₁The conditional probability occurred P(X₁|spam)=(5+M₁)/300, are designated feature X in the sample data of rubbish contents classification₃The bar occurred Part probability P (X₃|spam)=(7+M₃)/300, are designated feature X in the sample data of rubbish contents classification₅Go out Existing conditional probability P (X₅|spam)=(10+M₅)/300, are designated in the sample data of rubbish contents classification spy Levy X₆Conditional probability P (the X occurred₆|spam)=(4+M₆)/300。

And for example, a sample data is identified as non-junk content type, decomposes this sample data, Obtain the character subset (X of this sample data₁, X₃, X₅, X₆), and X in this sample data₁Quantity be 5, X₃Quantity be 6, X₅Quantity be 7, X₆Quantity be 8, and in advance will training Sample data in sample set is designated the sample size known (being assumed to be 100) of non-junk content type, Then may determine that:

It is designated feature X in the sample data of non-junk content type₁The conditional probability occurred P(X₁| ham=(5+N₁)/100, are designated feature X in the sample data of non-junk content type₃The bar occurred Part probability P (X₃|ham)=(6+N₃)/100, are designated feature X in the sample data of non-junk content type₅Go out Existing conditional probability P (X₅|ham)=(7+N₅)/100, are designated in the sample data of non-junk content type spy Levy X₆Conditional probability P (the X occurred₆|ham)=(8+N₆)/100。

And can utilize formula:

\frac{P (spam | χ)}{1 - P (spam | χ)} = \frac{P (spam) Π_{i = 1}^{n} P (X_{i} | spam) θ (spam, X_{i})}{P (ham) Π_{i = 1}^{n} P (X_{i} | ham) θ (ham, X_{i})},

Determine system Meter result.

Wherein, owing in advance the sample data in training sample subset to be designated rubbish contents classification and non-rubbish The sample size of rubbish content type is it is known that there is being designated the sample data of rubbish contents classification in characteristic model Probability, and occur that the probability being designated the sample data of non-junk content type all can obtain.

For example, it is assumed that training sample subset includes 400 sample datas, and wherein it is designated rubbish contents The sample data of type is 300, and the sample data being designated non-junk content type is 100, then special Levying the probability occurring being designated the sample data of rubbish contents classification in model is P (spam)=300/400, occurs The probability of the sample data being designated non-junk content type is P (ham)=100/400.

And when this determines statistical result, train the sample number being designated rubbish contents classification obtained before First trust-factor θ (spam, X of the conditional probability that each feature occurs according to_i), be designated non-junk content Second trust-factor θ (ham, X of the conditional probability that each feature occurs in the sample data of classification_i), for mark Know the sample data for rubbish contents classification, feature X in characteristic model_iOccurrence number sum M_i, and, For being designated the sample data of non-junk content type, feature X in characteristic model_iOccurrence number sum N_i It is all confirmable.

Assume θ (spam, X_i) represent that characteristic model is trained by each sample data every time after, be designated rubbish Feature X in the sample data of rubbish content type_iFirst trust-factor of the conditional probability occurred.With θ(spam,X₁) the example that is defined as illustrate: assume, when this determines statistical result, to have utilized 3 Characteristic model is trained by the sample data of mark rubbish contents classification, wherein, is designated rubbish contents The sample data 1 of classification has carried out 3 training to characteristic model, is designated the sample number of rubbish contents classification Carry out 2 training according to 2 pairs of characteristic models, be designated this enforcement of sample data 3(of rubbish contents classification Example is used for the sample data being trained characteristic model) characteristic model has been carried out 1 training, sample Data 1 are decomposed in the character subset obtained and do not comprise feature X₁, sample data 2 is decomposed the feature obtained Subset comprises feature X₁, sample data 3 is decomposed in the character subset obtained and comprises feature X₁, and false If every time to θ (spam, X₁) adjustment amplitude be and be reduced to original α times, θ (spam, X₁) initial value be 1, then when this determines statistical result, θ (spam, X₁)=α³。

Assume θ (ham, X_i) represent that characteristic model is trained by each sample data every time after, be designated non-rubbish Feature X in the sample data of rubbish content type_iSecond trust-factor of the conditional probability occurred.With θ (ham, X₁) The example that is defined as illustrate: assume when this determines statistical result, utilized 3 to be designated non-rubbish Characteristic model is trained by the sample data of rubbish content type, wherein, is designated non-junk content type Sample data 1 characteristic model has been carried out 3 times training, be designated the sample data of non-junk content type 2 pairs of characteristic models have carried out 2 training, are designated sample data 3(the present embodiment of rubbish contents classification In for sample data that characteristic model is trained) characteristic model carried out 1 training, sample number According to decomposing in 1, the character subset obtained does not comprises feature X₁, sample data 2 is decomposed feature obtained Concentration comprises feature X₁, sample data 3 is decomposed in the character subset obtained and comprises feature X₁, and assume Every time to θ (ham, X₁) adjustment amplitude be and be reduced to original β times, θ (ham, X₁) initial value be 1, Then when this determines statistical result, θ (ham, X₁)=β³。

Assume M_iAfter representing that characteristic model is trained by each sample data, for being designated rubbish every time The sample data of content type, feature X in characteristic model_iOccurrence number sum.With M₁The example that is defined as enter Row explanation: assume when this determines statistical result, utilized 3 samples being designated rubbish contents classification Characteristic model is trained by notebook data, wherein, is designated the sample data 1 of rubbish contents classification to spy Levying model and carried out 3 training, characteristic model is carried out by the sample data 2 being designated rubbish contents classification 2 training, are designated in sample data 3(the present embodiment of rubbish contents classification for entering characteristic model The sample data of row training) characteristic model has been carried out 1 training, sample data 1 is decomposed the spy obtained Levy and subset does not comprise feature X₁, sample data 2 is decomposed in the character subset obtained and comprises feature X₁, And feature X₁The number of times occurred in sample data 2 is 3 times, decomposes the feature obtained in sample data 3 Subset comprises feature X₁, and feature X₁The number of times occurred in sample data 3 is 5 times, then at this When determining statistical result, M1=3+3+5=11.

Assume N_iAfter representing that characteristic model is trained by each sample data, for being designated non-junk every time The sample data of content type, feature X in characteristic model_iAdding up of occurrence number.With N₁Be defined as example Illustrate: assume when this determines statistical result, utilized 3 to be designated non-junk content type Sample data characteristic model is trained, wherein, be designated the sample of non-identifying rubbish contents classification Data 1 have carried out 3 training to characteristic model, are designated the sample data 2 of non-junk content type to spy Levy model carried out 2 times training, be designated in sample data 3(the present embodiment of rubbish contents classification for The sample data that characteristic model is trained) characteristic model is carried out 1 training, in sample data 1 Decompose in the character subset obtained and do not comprise feature X₁, sample data 2 is decomposed in the character subset obtained and wraps Containing feature X₁, and feature X₁The number of times occurred in sample data 2 is 2 times, decomposes in sample data 3 The character subset obtained comprises feature X₁, and feature X₁The number of times occurred in sample data 3 is 5 times, Then when this determines statistical result, N1=2+2+5=9.

In the present embodiment, the executive agent of each step can be understood as training airplane.In this step, training Machine can obtain each parameter of preservation from feature model library, it is possible to according to each parameter determined, enters one Step determines statistical result.

Step 203, determine the type of sample data for training.

When this sample data is identified as non-junk content type, perform step 2041, in this sample data When being identified as non-junk content type, perform step 2042.

Step 2041, determine the need for continue training.

In this step, can determine the need for utilizing this sample data to characteristic model according to statistical result Proceed training.

Concrete, it is identified as non-junk content type in this sample data, and described statistical result is (permissible Represent with P) more than first setting threshold value time, it is threshold value that this first threshold value can be set as 0.5-e, e Constant, determines that needs proceed training, continues executing with step 2051, otherwise, it determines this sample data pair The training of characteristic model terminates.

Step 2042, determine the need for continue training.

In this step, equally determine the need for utilizing this sample data to feature according to statistical result Model proceeds training.

Concrete, it is identified as rubbish contents classification in this sample data, and this statistical result (can use P Represent) be not more than second setting threshold value time, it is that threshold value is normal that this first threshold value can be set as 0.5+e, e Amount, determines that needs proceed training, continues executing with step 2052, otherwise, it determines this sample data is to spy The training levying model terminates.

Step 2051, characteristic model is trained.

When this sample data is identified as non-junk content type, characteristic model is once instructed by this sample data The content practiced includes:

For each feature in the character subset of this sample data obtained, it is reduced to be designated in non-junk Hold the second trust-factor that the conditional probability that in the sample data of classification, this feature occurs is arranged, and will be designated The quantity of this feature during the quantity of this feature increases this sample data in the sample data of non-junk content type. Concrete, the second trust-factor can be reduced to original α times every time, α more than 0 less than 1 is just Number.

Step 2052, characteristic model is trained.

It is identified as rubbish contents classification in this sample data, time, characteristic model is once instructed by this sample data The content practiced includes:

For each feature in the character subset of this sample data obtained, it is reduced to be designated rubbish contents The first trust-factor that the conditional probability that in the sample data of classification, this feature occurs is arranged, and rubbish will be designated The quantity of this feature during the quantity of this feature increases this sample data in the sample data of rubbish content type.Specifically , it is also possible to the first trust-factor being reduced to original α times, α is the positive number more than 0 less than 1 every time.

The nicety of grading to the characteristic model after training related to embodiment two below by embodiment five is entered The process of row test is described in detail.

Embodiment five

The embodiment of the present application five provides a kind of side testing the nicety of grading of the characteristic model after training Method, the steps flow chart of the method is as it is shown in figure 5, include:

Step 301, sample data is decomposed.

In the present embodiment, it is thus necessary to determine that the classification of each sample data in test sample subset.Therefore, For each sample data in test sample subset, this sample data can be decomposed, it is thus achieved that this sample The character subset of notebook data.This feature subset can be understood as each sample for being trained characteristic model A part of feature that notebook data carries out decomposing in the characteristic set obtained or all feature.Concrete, can be in order to With N-GRAM model, sample data is decomposed.

Step 302, determine sample data classification.

For each sample data in test sample subset, it is possible to use this sample data carries out decomposition and obtains Character subset in each feature determine for training after the statistical result of characteristic model.

Such as, it is (X determining that sample data decomposes the character subset obtained₁, X₂) time, can be in order to Characteristic model with after training:

\frac{P (spam | χ)}{1 - P (spam | χ)} = \frac{P (spam) P (X_{1} | spam) θ (spam, X_{1}) P (x_{2} | spam) θ (spam, X_{2})}{P (ham) P (X_{1} | spam) θ (ham, X_{1}) P (X_{2} | ham) θ (ham, X_{2})},

Determine statistics Result.

When statistical result is more than the first setting threshold value, this sample data is categorized as rubbish contents classification, no Then, this sample data is categorized as non-junk content type.Concrete, in this step, can be in statistics This sample data, more than the first setting threshold value, during such as 0.5, is categorized as rubbish contents classification by result, otherwise, This sample data is categorized as non-junk content type.

Step 303, determine nicety of grading.

The classification being identified as according to each sample data in described test sample subset, and in step 302 In the classification that is determined, the classification that each sample data is divided in described test sample subset and this sample When the identical number of classification that notebook data is identified as is not less than setting value, determine the character modules after described training The nicety of grading of type can meet requirement, may be used for carrying out the sample data to be identified of user's input point Class, otherwise, it determines the nicety of grading of the characteristic model after described training is unsatisfactory for requirement, can be to character modules Type proceeds training, until nicety of grading meets requirement.

To the process classified of sample data to be identified of user's input with in test sample subset The process (step 301 ~ step 302) that individual sample data carries out classifying is similar to, and follow-up no longer carrying out is repeated Bright.

With the embodiment of the present application one ~ embodiment five based on same inventive concept, it is provided that following device.

Embodiment six

The embodiment of the present application six provides a kind of rubbish contents identification device, and the structure of this device can be such as Fig. 6 institute Show, including receiver module 11 and prediction module 12, it was predicted that module 12 i.e. can be understood as embodiment three offer Fig. 3 in prediction machine, wherein:

Receiver module 11 is for receiving the sample data to be identified that user provides；Prediction module 12 is for basis Naive Bayes Classifier, the first trust-factor and the second trust-factor determine the class of sample data to be identified Not, when described sample data to be identified is classified as rubbish contents classification, this sample data to be identified is represented Belong to rubbish contents, when described sample data to be identified is classified as non-junk content type, represent that this is treated Identifying that sample data belongs to non-junk content, wherein, described first trust-factor is the rubbish contents for configuration The conditional probability that in the sample data of classification, each feature occurs is respectively provided with, and described second trust-factor is It is respectively provided with for the conditional probability of each feature appearance in the sample data of the non-junk content type of configuration.

Wherein, described prediction module 12 is specifically for according to Naive Bayes Classifier, the first trust-factor Determine characteristic model with the second trust-factor, utilize the characteristic model after training to determine sample data to be identified Classification.

Described device also includes determining module 13 and training module 14, and training module 14 i.e. can be understood as reality Execute example three provide Fig. 3 in training airplane:

Determine that module 13 is for determining training sample subset, training sample set from the sample set that user provides In each sample data be identified as rubbish contents classification or non-junk content type；

Training module 14, for determining the sample data setting quantity in training sample subset, sets for described Each sample data in the sample data of determined number, utilizes this sample data to be trained characteristic model, After characteristic model is trained by each sample data in the sample data utilizing described setting quantity, Characteristic model after training.

Determine module 13 to be additionally operable to from the sample set that user provides and determine test sample subset, test sample The each sample data concentrated is identified as rubbish contents classification or non-junk content type, test sample subset Do not occur simultaneously with training sample subset；

Described device also includes judge module 15:

Judge module 15 is after training module train after characteristic model, it was predicted that module utilizes trains The sample data to be identified that user is provided by rear characteristic model is categorized as rubbish contents classification or non-junk content Before classification, utilize the characteristic model after training that the sample data in described test sample subset is categorized as rubbish Rubbish content type and non-junk content type, in described test sample subset, each sample data is divided into When the identical number of classification that classification is identified as with this sample data is not less than setting value, trigger prediction mould Block, otherwise, triggers training module.

Training module 14 is specifically for decomposing a sample data, it is thus achieved that belong to described characteristic set Fisrt feature subset, utilize each feature in this fisrt feature subset to determine the statistics for characteristic model Result；It is identified as non-junk content type in this sample data, and described statistical result is more than the first setting During threshold value, in characteristic model, for each feature in this fisrt feature subset, it is reduced to be designated non- The second trust-factor that the conditional probability that in the sample data of rubbish contents classification, this feature occurs is arranged, will mark Know and increase this feature in this sample data for the quantity of this feature in the sample data of non-junk content type Quantity, and redefine statistical result, otherwise, it determines the training of this sample data terminates, or, at this sample Notebook data is identified as rubbish contents classification, and when this statistical result is not more than the second setting threshold value, spy Levy in model, for each feature in this fisrt feature subset, be reduced to be designated the sample of rubbish contents classification The first trust-factor that the conditional probability that in notebook data, this feature occurs is arranged, will be designated rubbish contents classification Sample data in the quantity of this feature increase the quantity of this feature in this sample data, and redefine statistics As a result, otherwise, it determines the training of this sample data terminates.

Prediction module 12 is decomposed specifically for the sample data to be identified providing user, it is thus achieved that belong to The third feature subset of described characteristic set, utilizes each feature in this third feature subset to determine for instruction The statistical result of the characteristic model after white silk, when statistical result is more than the second setting threshold value, by this sample to be identified Notebook data is categorized as rubbish contents classification, otherwise, this sample data to be identified is categorized as non-junk content class Not.

Those skilled in the art are it should be appreciated that embodiments herein can be provided as method, system or meter Calculation machine program product.Therefore, the application can use complete hardware embodiment, complete software implementation or knot The form of the embodiment in terms of conjunction software and hardware.And, the application can use and wherein wrap one or more Computer-usable storage medium containing computer usable program code (include but not limited to disk memory, CD-ROM, optical memory etc.) form of the upper computer program implemented.

The application is with reference to method, equipment (system) and the computer program product according to the embodiment of the present application The flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and / or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embedding The processor of formula datatron or other programmable data processing device is to produce a machine so that by calculating The instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart one The device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to set In the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memory Instruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chart The function specified in flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions also can be loaded in computer or other programmable data processing device, makes Sequence of operations step must be performed to produce computer implemented place on computer or other programmable devices Reason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart one The step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.

Although having been described for the preferred embodiment of the application, but those skilled in the art once knowing base This creativeness concept, then can make other change and amendment to these embodiments.So, appended right is wanted Ask and be intended to be construed to include preferred embodiment and fall into all changes and the amendment of the application scope.

Obviously, those skilled in the art can carry out various change and modification without deviating from this Shen to the application Spirit and scope please.So, if the application these amendment and modification belong to the application claim and Within the scope of its equivalent technologies, then the application is also intended to comprise these change and modification.

Claims

1. a trash content recognition method, it is characterised in that described method includes:

Receive sample data to be identified；

Sample number to be identified is determined according to Naive Bayes Classifier, the first trust-factor and the second trust-factor According to classification, when described sample data to be identified is classified as rubbish contents classification, represent this sample to be identified Notebook data belongs to rubbish contents, when described sample data to be identified is classified as non-junk content type, and table Show that this sample data to be identified belongs to non-junk content；

Wherein, described first trust-factor is for each feature in the sample data of the rubbish contents classification configured The conditional probability occurred is respectively provided with, and described second trust-factor is the non-junk content type for configuration The conditional probability that in sample data, each feature occurs is respectively provided with；

Sample number to be identified is determined according to Naive Bayes Classifier, the first trust-factor and the second trust-factor According to classification, specifically include:

Characteristic model is determined according to Naive Bayes Classifier, the first trust-factor and the second trust-factor；

The characteristic model after training is utilized to determine the classification of sample data to be identified；

Wherein, described characteristic model is represented by below equation:

\frac{P (s p a m | χ)}{1 - P (s p a m | χ)} = \frac{P (s p a m) Π_{i = 1}^{n} P (X_{i} | s p a m) θ (s p a m, X_{i})}{P (h a m) Π_{i = 1}^{n} P (X_{i} | h a m) θ (h a m, X_{i})};

Wherein, χ={ X₁,X₂,...X_n, represent feature X_i, i=1,2 ... the set of n；

Under conditions of P (spam | χ) represents that characteristic set χ occurs, sample data is the condition of rubbish contents classification Probability；

P(X_i| ham) represent feature X in the sample data being designated non-junk content type_iThe condition occurred is general Rate；

θ(spam,X_i) represent feature X in the sample data being designated rubbish contents classification_iThe conditional probability occurred The first trust-factor；

2. the method for claim 1, it is characterised in that after determining training by the following method Characteristic model:

Determining training sample subset from sample set, each sample data that training sample is concentrated is identified as rubbish Rubbish content type or non-junk content type；

Determine the sample data setting quantity in training sample subset, for the sample data of described setting quantity In each sample data, utilize this sample data that characteristic model is trained, utilizing described setting number After characteristic model is trained by each sample data in the sample data of amount, the character modules after being trained Type.

3. method as claimed in claim 2, it is characterised in that described method also includes:

Determining test sample subset from sample set, each sample data that test sample is concentrated is identified as rubbish Rubbish content type or non-junk content type, test sample subset is not occured simultaneously with training sample subset；

Then, after being trained after characteristic model, utilize training after characteristic model user is provided to be identified Before sample data is categorized as rubbish contents classification or non-junk content type, described method also includes:

The characteristic model after training is utilized the sample data in described test sample subset to be categorized as in rubbish Hold classification and non-junk content type；

The classification that each sample data is divided in described test sample subset is identified with this sample data When the identical number of classification become is not less than setting value, utilize the characteristic model after training by treating that user provides Identify that sample data is categorized as rubbish contents classification or non-junk content type, otherwise, redefine training sample The sample data setting quantity concentrated by book, continues to be trained characteristic model.

4. the method for claim 1, it is characterised in that utilize a sample data to character modules Type is trained, and specifically includes:

This sample data is decomposed, it is thus achieved that belong to the fisrt feature subset of described characteristic set；

The each feature in this fisrt feature subset is utilized to determine the statistical result for characteristic model；

It is identified as non-junk content type in this sample data, and described statistical result is more than the first setting door During limit value, in characteristic model, for each feature in this fisrt feature subset, it is reduced to be designated non-rubbish The second trust-factor that the conditional probability that in the sample data of rubbish content type, this feature occurs is arranged, will mark For the number of this feature during the quantity of this feature increases this sample data in the sample data of non-junk content type Amount, and redefine statistical result, otherwise, it determines the training of this sample data terminates；Or,

It is identified as rubbish contents classification in this sample data, and this statistical result is not more than the second setting thresholding During value, in characteristic model, for each feature in this fisrt feature subset, it is reduced to be designated in rubbish Hold the first trust-factor that the conditional probability that in the sample data of classification, this feature occurs is arranged, rubbish will be designated During in the sample data of rubbish content type, the quantity of this feature increases this sample data, the quantity of this feature, lays equal stress on Newly determined statistical result, otherwise, it determines the training of this sample data terminates.

5. the method for claim 1, it is characterised in that utilize the characteristic model after training to survey The sample data that sample book is concentrated is categorized as rubbish contents classification and non-junk content type, specifically includes:

For each sample data in test sample subset, this sample data is decomposed, it is thus achieved that belong to The second feature subset of described characteristic set, utilizes each feature in this second feature subset to determine for instruction The statistical result of the characteristic model after white silk；

When statistical result is more than the first setting threshold value, this sample data is categorized as rubbish contents classification, no Then, this sample data is categorized as non-junk content type.

6. the method for claim 1, it is characterised in that utilize the characteristic model after training to use The sample data to be identified that family provides is categorized as rubbish contents classification or non-junk content type, specifically includes:

The sample data to be identified providing user is decomposed, it is thus achieved that belong to the 3rd spy of described characteristic set Levy subset, the statistics of the characteristic model after utilizing each feature in this third feature subset to determine for training Result；

When statistical result is more than the second setting threshold value, this sample data to be identified is categorized as rubbish contents class Not, otherwise, this sample data to be identified is categorized as non-junk content type.

7. a rubbish contents identification device, it is characterised in that described device includes:

Receiver module, is used for receiving sample data to be identified；

Prediction module, for true according to Naive Bayes Classifier, the first trust-factor and the second trust-factor The classification of fixed sample data to be identified, when described sample data to be identified is classified as rubbish contents classification, Represent that this sample data to be identified belongs to rubbish contents, be classified as non-junk in described sample data to be identified During content type, represent that this sample data to be identified belongs to non-junk content, wherein, described first trust because of Son is that the conditional probability occurred for each feature in the sample data of the rubbish contents classification configured is respectively provided with , described second trust-factor is to occur for each feature in the sample data of the non-junk content type configured Conditional probability be respectively provided with；

Described prediction module, specifically for according to Naive Bayes Classifier, the first trust-factor and the second letter Appoint the factor to determine characteristic model, utilize the characteristic model after training to determine the classification of sample data to be identified；

Wherein, described characteristic model is represented by below equation:

\frac{P (s p a m | χ)}{1 - P (s p a m | χ)} = \frac{P (s p a m) Π_{i = 1}^{n} P (X_{i} | s p a m) θ (s p a m, X_{i})}{P (h a m) Π_{i = 1}^{n} P (X_{i} | h a m) θ (h a m, X_{i})};