CN106485263B

CN106485263B - The processing method and processing device of training sample

Info

Publication number: CN106485263B
Application number: CN201610826098.5A
Authority: CN
Inventors: 孙浩
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-09-14
Filing date: 2016-09-14
Publication date: 2019-10-11
Anticipated expiration: 2036-09-14
Also published as: CN106485263A

Abstract

The invention discloses a kind of processing method and processing devices of training sample, are related to computer application technology, solve the problems, such as that the efficiency of existing trained SVM model is lower.The method comprise the steps that obtaining raw data base, the raw data base includes all words that training sample may relate to, and the training sample is related at least two different classes of training samples；The hypothesis probability of the word in the raw data base is calculated based on bayesian algorithm, the hypothesis probability is a possibility that when institute's predicate appears in training sample, the training sample is the training sample of certain classification size；The word of the hypothesis probability within a preset range is extracted, target database is obtained；Training sample is converted into training sample matrix based on the word in the target database, the training sample after being denoised.The present invention is applied to during denoising to training sample.

Description

The processing method and processing device of training sample

Technical field

The present invention relates to computer application technology more particularly to a kind of processing method and processing devices of training sample.

Background technique

Support vector machines (support vector machine, SVM) is that one kind is used to carry out pattern-recognition, classification etc. Learning model.In practical applications, SVM model is best for the solution effect of two classification problems, therefore is commonly used for solving Two classification problems.For example classify to mail, it is input to unknown mails as data to be predicted in SVM model, passes through SVM Two sort features of model obtain the classification results that the unknown mails are normal email or spam.

In general, before being classified using SVM model, it is necessary first to using known training sample to SVM model into Row training.However, during being trained to SVM model, inventors have found that the training sample excessively high for some dimensions This, it will usually comprising more " noise data ", such as use the normal email largely collected in advance and spam as instruction When white silk sample is trained SVM model, due to that would generally include many meaningless " noise words " in Mail Contents, as " " " " etc..If directly using the training sample comprising more " noise data ", inevitably result in the efficiency of trained SVM model compared with It is low.

Summary of the invention

In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind State the processing method and processing device of the training sample of problem.

In order to solve the above technical problems, on the one hand, the present invention provides a kind of processing methods of training sample, comprising:

Raw data base is obtained, the raw data base includes all words that training sample may relate to, the trained sample Originally it is related at least two different classes of training samples；

The hypothesis probability of the word in the raw data base is calculated based on bayesian algorithm, the hypothesis probability is when described When word is appeared in training sample, the training sample be certain classification training sample a possibility that size；

The word of the hypothesis probability within a preset range is extracted, target database is obtained；

Training sample is converted into training sample matrix based on the word in the target database, the training after being denoised Sample.

Specifically, the training sample includes the training sample of first kind training sample and the second class training sample two categories This, the hypothesis probability that the word in the raw data base is calculated based on bayesian algorithm, comprising:

Calculate the probability of occurrence that institute's predicate respectively appears in the other training sample of every type；

Calculate separately the other training sample of every type sample number account for all categories training sample total sample number ratio Example；

Institute's predicate is appeared in the sample number of the probability of occurrence and the first kind training sample in first kind training sample The ratio for accounting for the total sample number of the training sample of all categories is multiplied, and obtains the first probability, first probability is training sample Probability for first kind training sample and comprising institute's predicate；

Institute's predicate is appeared in the sample number of the probability of occurrence and the second class training sample in the second class training sample The ratio for accounting for the total sample number of the training sample of all categories is multiplied, and obtains the second probability, second probability is training sample Probability for the second class training sample and comprising institute's predicate；

Hypothesis probability based on first probability and the second probability calculation institute predicate.

Specifically, the hypothesis probability based on first probability and the second probability calculation institute predicate, comprising:

According to following formula calculate institute's predicate appear in training sample when training sample be first kind training sample can Energy property size:

Alternatively, calculating training sample when institute's predicate appears in training sample according to following formula is the second class training sample A possibility that size:

Wherein P_aFor first probability, the P_bFor second probability.

Specifically, assuming the word of probability within a preset range, before obtaining target database, the method in described extract Further comprise:

According to parameter preset, the maximum relational expression for assuming probability and minimum hypothesis probability progress linear combination acquisition, really The fixed preset range, the parameter preset are used to determine the dimension of the training sample matrix.

Specifically, described carry out linear combination acquisition according to parameter preset, maximum hypothesis probability and minimum hypothesis probability Relational expression, determine the preset range, comprising:

First threshold is determined according to following formula:

θP₁+(1-θ)P_n

Second threshold is determined according to following formula:

(1-θ)P₁+θP_n

Wherein, P₁For the maximum hypothesis probability, P_nFor it is described it is minimum assume that probability, θ are the parameter preset, described the One threshold value is less than the second threshold；

By the numberical range no more than the first threshold and not less than the numberical range of the second threshold, it is determined as The preset range.

Specifically, training sample is converted to training sample matrix by the word based in the target database, obtain Training sample after denoising, comprising:

Multi-dimensional matrix is set for the training sample, the word in element and the target database in the multi-dimensional matrix It corresponds；

The word for including in the training sample is matched with the word in target database；

Assignment is carried out to the element in the multi-dimensional matrix according to the matching result, obtains the training sample matrix.

Specifically, described carry out assignment to the element in the multi-dimensional matrix according to the matching result, comprising:

If the corresponding word of element appears in the training sample in the multi-dimensional matrix, the element is assigned a value of 1；

If the corresponding word of element does not appear in the training sample in the multi-dimensional matrix, the element is assigned a value of 0。

Specifically, after the training sample after being denoised, the method further includes:

Support vector machines model is trained by the training sample after the denoising, obtains training result vector；

The word in the target database is screened according to the non-zero coefficient in the training result vector, is divided Class set of matches；

Prediction data is treated according to the classification and matching collection to classify.

Specifically, the word in the target database is screened according to the non-zero coefficient in the training result vector, Obtain classification and matching collection, comprising:

By the non-zero coefficient in the training result vector, it is determined as in target database word corresponding with the non-zero coefficient Coefficient；

Retain the word in target database with coefficient, obtains the classification and matching collection.

On the other hand, the present invention provides a kind of processing units of training sample, comprising:

Acquiring unit, for obtaining raw data base, the raw data base include training sample may relate to it is all Word, the training sample are related at least two different classes of training samples；

Computing unit, for calculating the hypothesis probability of the word in the raw data base, the vacation based on bayesian algorithm If probability is big for a possibility that when institute's predicate appears in training sample, the training sample is the training sample of certain classification It is small；

Extraction unit obtains target database for extracting the word of the hypothesis probability within a preset range；

Converting unit is obtained for training sample to be converted to training sample matrix based on the word in the target database Training sample after must denoising.

Specifically, the computing unit, comprising:

First computing module includes two type of first kind training sample and the second class training sample for the training sample Other training sample calculates the probability of occurrence that institute's predicate respectively appears in the other training sample of every type；

Second computing module, the sample number for calculating separately the other training sample of every type account for the training sample of all categories The ratio of this total sample number；

First multiplication module, for institute's predicate to be appeared in the probability of occurrence in first kind training sample and the first kind The ratio that the sample number of training sample accounts for the total sample number of the training sample of all categories is multiplied, and obtains the first probability, and described the One probability be training sample for first kind training sample and include institute's predicate probability；

Second multiplication module, for institute's predicate to be appeared in the probability of occurrence in the second class training sample and second class The ratio that the sample number of training sample accounts for the total sample number of the training sample of all categories is multiplied, and obtains the second probability, and described the Two probability be training sample for the second class training sample and include institute's predicate probability；

Third computing module, for the hypothesis probability based on first probability and the second probability calculation institute predicate.

Specifically, the third computing module, is used for:

According to following formula calculate institute's predicate appear in training sample when training sample be the second class training sample can Energy property size:

It is wherein first probability, described is second probability.

Specifically, described device further comprises:

Determination unit, for assuming the word of probability within a preset range, before obtaining target database, root in described extract According to parameter preset, the maximum relational expression for assuming probability and minimum hypothesis probability progress linear combination acquisition, determine described default Range, the parameter preset are used to determine the dimension of the training sample matrix.

Specifically, the determination unit, comprising:

First determining module, for determining first threshold according to following formula:

θP₁+(1-θ)P_n

Second determining module, for determining second threshold according to following formula:

(1-θ)P₁+θP_n

Third determining module, for the numberical range of the first threshold will to be not more than and not less than the second threshold Numberical range, be determined as the preset range.

Specifically, the converting unit, comprising:

Setup module, for multi-dimensional matrix to be arranged for the training sample, the element and the mesh in the multi-dimensional matrix The word marked in database corresponds；

Matching module, for matching the word for including in the training sample with the word in target database；

Assignment module, for carrying out assignment to the element in the multi-dimensional matrix according to the matching result, described in acquisition Training sample matrix.

Specifically, the assignment module, is used for:

Specifically, described device further comprises:

Training unit, after the training sample after being denoised, by the training sample after the denoising to branch It holds vector machine SVM model to be trained, obtains training result vector；

Obtaining unit, for according to the non-zero coefficient in the training result vector to the word in the target database into Row screening, obtains classification and matching collection；

Taxon is classified for treating prediction data according to the classification and matching collection.

Specifically, the obtaining unit, comprising:

4th determining module, for being determined as the non-zero coefficient in the training result vector in target database and institute State the coefficient of the corresponding word of non-zero coefficient；

Module is obtained, for retaining the word in target database with coefficient, obtains the classification and matching collection.

By above-mentioned technical proposal, the processing method and processing device of training sample provided by the invention can obtain original first Beginning database, includes all words that training sample may relate in raw data base, and training sample is related at least two inhomogeneities Other training sample；Secondly, the hypothesis probability of the word in raw data base is calculated based on bayesian algorithm, wherein assuming that probability is When word appears in training sample, training sample be certain classification training sample a possibility that size；Then, it extracts and assumes The word of probability within a preset range, obtains target database；Finally, training sample is converted to based on the word in target database Training sample matrix, the training sample after being denoised.Compared with prior art, since target database is relative to initial data Library eliminates the word for assuming that probability does not meet preset range, that is, eliminates for determining that training sample classification contributes lesser noise Word, thus based on target database be converted into training sample matrix to training sample when, will not be considered any further in training sample The noise word for including, so the training sample after capable of being denoised, therefore using the training sample after denoising to SVM model into Trained efficiency can be improved in row training.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 shows a kind of flow chart of the processing method of training sample provided in an embodiment of the present invention；

Fig. 2 shows the flow charts of the processing method of another training sample provided in an embodiment of the present invention；

Fig. 3 shows a kind of composition block diagram of the processing unit of training sample provided in an embodiment of the present invention；

Fig. 4 shows the composition block diagram of the processing unit of another training sample provided in an embodiment of the present invention.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

To solve the problems, such as that existing trained SVM model efficiency is lower, the embodiment of the invention provides a kind of training samples Processing method, as shown in Figure 1, this method comprises:

101, raw data base is obtained.

Wherein, which includes all words that training sample may relate to, and training sample is related at least two not Generic training sample.Wherein " all words that may relate to " are illustrated, it is assumed that training sample is Chinese class The data of text-type, the then all words that may relate to refer to all Chinese vocabularies.In the present embodiment, training sample can be used for pair SVM model is trained, to obtain the SVM model treating prediction data and classifying.It, can for different classes of training sample Think different classes of mail, which includes normal email and spam；Or it is different classes of News, the different classes of news include sports news, financial and economic news, entertainment news etc..It should be noted that SVM is typical Two disaggregated models, but also can be applied in the classification of three kinds or three kinds or more of categorical data.It is when concrete application The problem of polytypic PROBLEM DECOMPOSITION is classified for multiple two solves, and specific thought of decomposing commonly includes: one kind to remaining Class method, one-to-one method, directed acyclic method etc..Therefore the classification of training sample involved in the present embodiment can be two kinds or two kinds More than.If training sample is related to the training sample of two categories, directly using the side of the training sample processing in the present embodiment Method；If training sample is related to the training sample of two or more classifications, the classification problem of multiple classifications is resolved into multiple two points After class problem, using the processing side of training sample in the present embodiment when to the processing of each two classification problem corresponding training sample Method.

Raw data base is obtained, is that be converted to can in order to which the training sample being made of natural language is based on raw data base To carry out the training sample of the mathematical models such as SVM model calculating.

It is not each word is significant for the training for carrying out SVM model for word all in raw data base , i.e., for treat prediction data classified it is all meaningful.Meaningless word typically refers to the word of some neutrality, such as " " " obtaining " " " etc., it is not intended to adopted word is referred to as " noise word ".If therefore based on these " noise words " to by natural language structure At training sample converted, then convert after training sample for carry out SVM model training necessarily will cause unfavorable shadow It rings.Therefore it needs " noise word " in raw data base being removed processing.

For " the noise word " in removal raw data base, it is necessary first to " the noise word " in raw data base is found out, and Bayesian algorithm be exactly it is a kind of distinguish well a word whether be " noise word " method.Concrete application bayesian algorithm is to original Beginning database is removed the step of " noise word " as shown in step 102 and step 103.

102, the hypothesis probability of the word in raw data base is calculated based on bayesian algorithm.

Wherein, the hypothesis probability of the word in raw data base is calculated based on bayesian algorithm, it can be for based on Bayes's calculation Method calculates the corresponding hypothesis probability of each word in raw data base.The hypothesis probability indicates to appear in when the word in raw data base A possibility that training sample is the training sample of certain classification when in training sample size.Calculate each word in raw data base Assuming that probability is to screen below according to hypothesis probability to the word in raw data base.

103, it extracts and assumes the word of probability within a preset range, obtain target database.

The hypothesis probability of words all in obtained raw data base is compared with preset range, is extracted in preset range Interior word, obtains target database.Preset range be according to word for treat prediction data classified whether significant setting Preset range.The word extracted in the present embodiment is to treat prediction data to carry out significant word of classifying, i.e., for determining to pre- The contributive word of the classification of measured data.In practical applications, setting for preset range specifically is determined to the definition of " whether significant " It sets.

104, training sample is converted to by training sample matrix based on the word in target database, the training after being denoised Sample.

(mathematics such as SVM model can be carried out due to being converted to training sample matrix to training sample based on raw data base The training sample that model calculates), it is usually to indicate that each word in raw data base is being instructed by the member in training matrix sample Practice and whether occur in sample, the mathematical modulos such as SVM model can be carried out by realizing to be converted to the training sample that natural language is constituted The training sample that type calculates.Therefore both contributive comprising the classification to decision data to be predicted in the training sample matrix obtained The appearance situation of word also includes to the classification of decision data to be predicted without the appearance situation of the word of contribution.However, the present embodiment is Training sample is converted based on target database, since the word for including in target database is all to decision data to be predicted The contributive word of classification, eliminate " the noise word " of no contribution, therefore training sample is converted based on target database When, obtain the element in training sample matrix only and include the appearance situation to the contributive word of classification for determining data to be predicted. Therefore based on the training sample that the training sample matrix obtained after target database conversion is after denoising.

Further, due to going out in addition to meaningless noise word, relative to the sample obtained based on original sample database This matrix, the sample matrix after denoising also reduce dimension, therefore reduce calculation amount when carrying out SVM model training, therefore Also the rate for carrying out SVM model training is further increased.

The processing method of training sample provided in an embodiment of the present invention can obtain raw data base, initial data first It include all words that training sample may relate in library, training sample is related at least two different classes of training samples；Secondly, The hypothesis probability of the word in raw data base is calculated based on bayesian algorithm, wherein assuming that probability is when word appears in training sample When middle, training sample be certain classification training sample a possibility that size；Then, it extracts and assumes probability within a preset range Word obtains target database；Finally, training sample is converted to training sample matrix based on the word in target database, obtain Training sample after denoising.Compared with prior art, hypothesis probability is eliminated relative to raw data base due to target database The word of preset range is not met, that is, is eliminated for determining that training sample classification contributes lesser noise word, therefore be based on target When database be converted into training sample matrix to training sample, the noise word for including in training sample, institute will not be considered any further With the training sample after capable of being denoised, therefore SVM model is trained using the training sample after denoising and can be improved Trained efficiency.

Further, as the refinement and extension to method shown in Fig. 1, another embodiment of the present invention gives a kind of instruction Practice the processing method of sample.As shown in Fig. 2, this method comprises:

Firstly the need of explanation, the present embodiment is to be related to first kind training sample and the second class training sample two categories Training sample for be illustrated.

201, raw data base is obtained.

The implementation that raw data base is obtained in this step is identical as the implementation in Fig. 1 step 101.Corresponding Training sample is related to two kinds of different classes of training samples of first kind training sample and the second class training sample in step.

202, the hypothesis probability of the word in raw data base is calculated based on bayesian algorithm.

Wherein assume that probability indicates that training sample is certain type when the word in raw data base appears in training sample A possibility that other training sample size.Specifically by taking any one word in raw data base as an example, combined training sample is to meter It calculates and assumes that the process of probability is illustrated, specifically includes the following steps:

First, the probability of occurrence that word respectively appears in first kind training sample and the second class training sample is calculated, respectively It is denoted as Pr (W | H1), Pr (W | H2)；

It is specific when calculating Pr (W | H1), it is the quantity of the first kind training sample of word will occur divided by total training sample What quantity obtained, and same Pr (W | H2) it is the quantity of the second class training sample of word will occur divided by the quantity of total training sample It obtains.Specific example is provided to be illustrated: assuming that the quantity of first kind training sample is 2000, the second class training sample Quantity is 2000, wherein the quantity for the first kind training sample of upper predicate occur is 500, then corresponding Pr (W | H1)=500/ 2000=0.25, if the quantity for the second class training sample of upper predicate occur is 400, corresponding Pr (W | H2)=400/2000 =0.2.

Second, the sample number of the sample number and the second class training sample that calculate separately first kind training sample accounts for all classes The ratio of other training sample sum is denoted as Pr (H1), Pr (H2) respectively；

It is by the quantity of first kind training sample divided by all number of training (i.e. first when specific calculating Pr (H1) The sum of class training sample and the second class training samples number) obtain；It is by the second class training sample when similarly calculating Pr (H2) Quantity obtain divided by all number of training.

Third, the sample number that word is appeared in probability of occurrence and first kind training sample in first kind training sample account for institute There is the ratio of the training sample sum of classification to be multiplied, obtain the first probability, the first probability is that training sample is first kind training sample This and include word probability；Word is appeared in the sample number of the probability of occurrence and the second class training sample in the second class training sample The ratio for accounting for the training sample sum of all categories is multiplied, and obtains the second probability, and the second probability is that training sample is instructed for the second class Practice sample and includes the probability of word；

Specific first probability are as follows: and Pr (W | H1) Pr (H1)；Second probability are as follows: and Pr (W | H2) Pr (H2)；

Finally, the hypothesis probability based on the first probability and the second probability calculation word.

The hypothesis probability of word includes two kinds of definition modes, the mode of the hypothesis probability of the first defined terms in the present embodiment Are as follows: the hypothesis that a possibility that training sample is the first kind training sample when appearing in word in training sample size is defined as word is general Rate is denoted as Pr (H1 | W).The calculation formula of the hypothesis probability of the first corresponding definition mode are as follows:

The mode of the hypothesis probability of second of defined terms are as follows: training sample is the second class when appearing in word in training sample A possibility that training sample, size was defined as the hypothesis probability of word, was denoted as Pr (H2 | W).The hypothesis of corresponding second of definition mode The calculation formula of probability are as follows:

Wherein P_aFor above-mentioned first probability, P_bFor above-mentioned second probability.

It should be noted that the hypothesis probability of the word under two kinds of definition modes obtained above selects one in actual application It uses.

203, according to parameter preset, the maximum relationship for assuming probability and minimum hypothesis probability progress linear combination acquisition Formula determines preset range.

After the hypothesis probability for obtaining word in raw data base by step 202, need according to preset range to initial data " noise word " in library is removed, it is therefore desirable to first determine preset range, the determination of preset range includes the following steps:

Firstly, determining first threshold according to following formula:

θP₁+(1-θ)P_n

Secondly, determining second threshold according to following formula:

(1-θ)P₁+θP_n

Wherein, P₁Probability, P are assumed to be maximum in the corresponding hypothesis probability of word all in raw data base_nFor initial data Minimum in the corresponding hypothesis probability of all words in library to assume that probability, θ are parameter preset, parameter preset is for determining training sample The dimension of matrix, parameter preset is bigger, and the dimension of subsequent obtained training sample matrix is bigger, the value model of θ in the present embodiment It encloses for (0,0.5).

Finally by the numberical range no more than first threshold and not less than the numberical range of second threshold, it is determined as presetting Range, wherein first threshold is less than second threshold.

It provides specific example to be illustrated the determination of preset range: if P₁=0.8, P_n=0.1, θ=0.4 then calculates Obtaining first threshold is 0.4*0.8+0.6*0.1=0.38, second threshold 0.6*0.8+0.4*0.1=0.52, then presets model It encloses for (0,0.38] ∪ [0.52,1).

204, it extracts and assumes the word of probability within a preset range, obtain target database.

The hypothesis probability of the word obtained by step 202 is compared with preset range, extracts and assumes probability in default model Interior word is enclosed, target database is obtained.

205, training sample is converted to by training sample matrix based on the word in target database, the training after being denoised Sample.

Training sample is specifically converted to by training sample matrix based on the word in target database, the instruction after being denoised The process for practicing sample includes the following steps:

Firstly, multi-dimensional matrix is arranged for training sample, the word one in the element and target database in multi-dimensional matrix is a pair of It answers；

Secondly, the word for including in training sample is matched with the word in target database；

Finally, carrying out assignment to the element in multi-dimensional matrix according to matching result, training sample matrix, training sample are obtained Matrix is the training sample after denoising.

Assignment is specifically carried out to the element in multi-dimensional matrix according to matching result, comprising: if element pair in multi-dimensional matrix The word answered appears in training sample, then element is assigned a value of 1；If the corresponding word of element does not appear in trained sample in multi-dimensional matrix In this, then element is assigned a value of 0.

The process of training sample matrix is converted to showing by training sample for the above-mentioned word based in target database Example is illustrated: assuming that word in target database is red, orange, yellow, green, cyan, blue, purple, in some first kind training sample The word for including are as follows: red, orange, yellow, the word for including in some second class training sample are as follows: green, blue, purple,.It is then corresponding The process that above-mentioned two training sample is converted to training sample matrix based on target database is as follows:

The multi-dimensional matrix that a dimension is 7 is respectively set for two training sample settings first, wherein each multi-dimensional matrix In element respectively in target database red, orange, yellow, green, cyan, blue, purple correspond；It secondly will be in two training samples The word for including is matched with the word in target database respectively, for one of first kind training sample, number of targets According to only red, orange, yellow three words occurred in seven words of red, orange, yellow, green, cyan, blue, purple in library, then by corresponding 7 dimension square The corresponding element of red, orange, yellow three words in battle array is assigned a value of 1, and others are assigned a value of 0, finally obtains in above-mentioned example one the The training sample matrix of a kind of training sample is [1,1,1,0,0,0,0]^T.Similarly, the second class instruction in above-mentioned example is obtained The training sample matrix for practicing sample is [0,0,0,0,1,1,1]^T。

206, SVM model is trained by the training sample after denoising, obtains training result vector.

Training sample after the denoising obtained by step 205 is trained SVM model to obtain training result vector, has The training result vector of body obtained for SVM model is supporting vector, and supporting vector is for distinguishing different classes of data The corresponding vector of classification boundaries.In addition during obtaining supporting vector, it is also necessary to be distinguished not by different marks Generic training sample.Such as the mark of first kind training sample can be denoted as to 1, the mark note of the second class training sample It is -1.

207, the word in target database is screened according to the non-zero coefficient in training result vector, obtains classification With collection.

It is specific to obtain classification and matching collection, comprising: firstly, the non-zero coefficient in training result vector is determined as number of targets According to the coefficient of word corresponding with non-zero coefficient in library, it should be noted that training result vector is by the training sample after denoising Obtain, thus the dimension of training result vector with the dimension in the training sample after denoising there are corresponding relationship, in addition go Word in the dimension and target database of training sample after making an uproar is there are corresponding relationship, therefore the dimension and mesh of training result vector The word in database is marked there are corresponding relationship, the coefficient in training result vector is the dimension pair of corresponding training result vector The coefficient answered, therefore corresponding word in target database can will be found according to the nonzero coefficient in training result vector；Then, Retain the word in target database with coefficient, obtains classification and matching collection.

208, prediction data is treated according to classification and matching collection to classify.

It is as follows that the process that prediction data is classified specifically is treated according to classification and matching collection:

First, data to be predicted are concentrated in classification and matching and carry out multimode matching.Multimode matching is in a character string Find the process of multiple model strings.Multimode matching specifically refers in the present embodiment, and classification is found in data to be predicted Multiple words in set of matches.There are many algorithms for carrying out multimode matching, for example common includes dictionary tree (Trie tree, Trie Tree), AC automatic machine (Aho-Corasick automation, AC automatic machine) algorithm, Wu-Manber (Wu-Manber, VM) calculate Method etc..The specifically used specific any multimode matching algorithm of the unlimited system of the present embodiment is matched.Match the final result It is the word that the correspondence classification and matching for including is concentrated in determining data to be predicted.

Second, the existing word simultaneously in classification and matching concentration and data to be predicted is concentrated in classification and matching and is corresponded to Coefficient add up.

Third treats prediction data according to accumulation result and classifies.

Prediction data is treated according to accumulation result to classify, that is, determines the classification of data to be predicted.It is specific determine to The process of the classification of prediction data are as follows: a corresponding threshold range is respectively set to be different classes of, different threshold ranges is not There are intersections, then accumulation result are compared with all threshold ranges, which threshold range accumulation result belongs to, just Determine which kind of other data is corresponding data to be predicted belong to.

Further, the method for the processing of above-mentioned training sample is specifically described by taking mail sample as an example, it is as follows It is described.

Assuming that mail sample includes 20000 normal email samples and 20000 spam samples.

Firstly, obtaining the corresponding raw data base of mail sample.Assuming that mail sample is Chinese email entirely, then corresponding original Beginning database is the database of all Chinese words composition.

Secondly, calculating the hypothesis probability of each Chinese word.Assuming that assuming that the definition of probability is Chinese word in the present embodiment When appearing in mail sample, the mail sample be spam a possibility that size.Wherein, the hypothesis of each Chinese word is general The calculating process of rate is identical, therefore to be illustrated for the hypothesis probability for calculating some Chinese word:

If a Chinese word occurred in 10000 normal email samples, while in 4000 spam samples Occurred.It then calculates the Chinese word and appears in probability P r (W | H1)=10000/40000=0.25 in normal email sample, it should Chinese word appears in probability P r (W | H2)=4000/40000=0.1 in spam sample.

Calculate the ratio Pr that normal email sample number accounts for the sum of sample number of normal email sample and spam sample (H1)=20000/40000=0.5, spam sample number account for the sum of normal email sample and the sample number of spam sample Ratio Pr (H2)=20000/40000=0.5.

Calculate the hypothesis probability of the Chinese word are as follows:

The method that probability is assumed according to above-mentioned calculating, obtains the hypothesis probability of each Chinese word.

Third, according to parameter preset, all Chinese words hypothesis probability in maximum hypothesis probability and the smallest hypothesis Determine the probability preset range.

Assuming that maximum hypothesis probability P in the hypothesis probability of all Chinese words₁=0.8, the smallest hypothesis probability P_n=0.1, Parameter preset θ=0.4, then it is 0.4*0.8+0.6*0.1=0.38, second threshold 0.6*0.8+ that first threshold, which is calculated, 0.4*0.1=0.52, then preset range is (0,0.38] ∪ [0.52,1).

4th, it extracts and assumes probability corresponding Chinese word in preset range (0,0.38] ∪ [0.52,1), will extract All Chinese words form target database.It wherein, is 0.286 corresponding Chinese word for a hypothesis probability obtained above It should just be included in target database.

5th, each of mail sample mail is converted to by corresponding training sample based on obtained target database Matrix, the training sample after being denoised.

Assuming that including 5000 Chinese words in obtained target database, then one is established for each mail in mail sample The matrix [a1, a2 ... ..., a4999, a5000] of a 5000 dimension^T, wherein each element corresponds to one in target database Chinese word.Detailed process is as follows for the corresponding training sample matrix of each mail in acquisition mail sample:

Each mail is matched with the Chinese word in target database, if in a certain mail including target database In some Chinese word, then the corresponding ai of the Chinese word (wherein i=1,2 ..., 5000) is assigned a value of 1, if in a certain mail not Comprising some Chinese word in target database, then the corresponding ai of the Chinese word (wherein i=1,2 ..., 5000) is assigned a value of 0. For example, if comprising all Chinese words in target database in a certain mail, corresponding obtained training sample matrix be [1, 1 ..., 1,1]^T, the element that centre is omitted all is 1；If not including any one Chinese in target database in some mail Word, then corresponding obtained training sample matrix is [0,0 ..., 0,0]^T, the element that centre is omitted all is 0.

6th, using the corresponding training sample matrix of mails all in obtained mail sample as the training sample of SVM model Originally it is trained, wherein the corresponding training sample matrix of normal email is denoted as 1, the corresponding training sample matrix note of spam Make -1.After training, a supporting vector B is obtained, the nonzero coefficient (nonzero element in supporting vector) in B is determined as mesh Mark the coefficient of Chinese word corresponding with nonzero coefficient in database.Assuming that Chinese word corresponding with nonzero coefficient in target database For W1, W2, W3 ..., this 300 Chinese words are then formed classification and matching collection by W300, wherein the corresponding coefficient note of each Chinese word Make b1, b2, b3 ..., b300.

7th, classified according to obtained classification and matching collection to mail to be predicted.

If preceding 100 Chinese words concentrated in mail to be predicted comprising classification and matching, calculate:

W1 × b1+W2 × b2+W3 × b3+ ...+W100 × b100, then according to the result of calculating and preset threshold range It being matched, if the result calculated belongs to the preset threshold range of normal email, it is determined that the mail to be predicted is normal email, If the result calculated belongs to the preset threshold range of spam, it is determined that the mail to be predicted is spam.

Further, as the realization to the various embodiments described above, another embodiment of the embodiment of the present invention additionally provides one The processing unit of kind training sample, for realizing method described in above-mentioned Fig. 1 and Fig. 2.As shown in figure 3, the device includes: to obtain Unit 31, computing unit 32, extraction unit 33 and converting unit 34.

Acquiring unit 31, for obtaining raw data base, raw data base includes all words that training sample may relate to, Training sample is related at least two different classes of training samples.

Wherein, which includes all words that training sample may relate to, and training sample is related at least two not Generic training sample.Wherein " all words that may relate to " are illustrated, it is assumed that training sample is Chinese class The data of text-type, the then all words that may relate to refer to all Chinese vocabularies.In the present embodiment, training sample can be used for pair SVM model is trained, to obtain the SVM model treating prediction data and classifying.It, can for different classes of training sample Think different classes of mail, which includes normal email and spam；Or it is different classes of News, the different classes of news include sports news, financial and economic news, entertainment news etc..

Computing unit 32, for calculating the hypothesis probability of the word in raw data base based on bayesian algorithm, it is assumed that probability For a possibility that when word appears in training sample, training sample is the training sample of certain classification size.

Extraction unit 33 assumes the word of probability within a preset range for extracting, obtains target database.

Converting unit 34 is obtained for training sample to be converted to training sample matrix based on the word in target database Training sample after denoising.

Further, as shown in figure 4, computing unit 32, comprising:

First computing module 321 includes two type of first kind training sample and the second class training sample for training sample Other training sample calculates the probability of occurrence that word respectively appears in the other training sample of every type；

The probability of occurrence that word respectively appears in first kind training sample and the second class training sample is calculated, is denoted as Pr respectively (W|H1),Pr(W|H2)；

It is specific when calculating Pr (W | H1), it is the quantity of the first kind training sample of word will occur divided by total training sample What quantity obtained, and same Pr (W | H2) it is the quantity of the second class training sample of word will occur divided by the quantity of total training sample It obtains.

Second computing module 322, the sample number for calculating separately the other training sample of every type account for the instruction of all categories Practice the ratio of the total sample number of sample；

Calculate separately first kind training sample and the second class number of training account for all categories training sample sample The ratio of sum, is denoted as Pr (H1), Pr (H2) respectively；

First multiplication module 323, for word to be appeared in the probability of occurrence in first kind training sample and first kind training The ratio that the sample number of sample accounts for the total sample number of the training sample of all categories is multiplied, and obtains the first probability, and the first probability is Training sample be first kind training sample and include word probability；

Specific first probability are as follows: P_a=Pr (W | H1) Pr (H1)；

Second multiplication module 324, for word to be appeared in probability of occurrence and the training of the second class in the second class training sample The ratio that the sample number of sample accounts for the total sample number of the training sample of all categories is multiplied, and obtains the second probability, and the second probability is Training sample be the second class training sample and include word probability；

Second probability are as follows: P_b=Pr (W | H2) Pr (H2)；

Third computing module 325, for the hypothesis probability based on the first probability and the second probability calculation word.

Further, third computing module 325, is used for:

A possibility that training sample is the first kind training sample when word appears in training sample is calculated according to following formula Size:

A possibility that training sample is the second class training sample when word appears in training sample is calculated according to following formula Size:

Wherein P_aFor the first probability, P_bFor the second probability.

Further, as shown in figure 4, device further comprises:

Determination unit 35, for extracting the word of hypothesis probability within a preset range, before obtaining target database, according to Parameter preset, the maximum relational expression for assuming probability and minimum hypothesis probability progress linear combination acquisition, determine preset range, in advance Setting parameter is used to determine the dimension of training sample matrix.

Further, as shown in figure 4, determination unit 35, comprising:

First determining module 351, for determining first threshold according to following formula:

θP₁+(1-θ)P_n

Second determining module 352, for determining second threshold according to following formula:

(1-θ)P₁+θP_n

Wherein, P₁Probability, P are assumed for maximum_nProbability is assumed for minimum, and θ is parameter preset, and first threshold is less than the second threshold Value；

The value range of θ is (0,0.5) in the present embodiment.

Third determining module 353, for the numberical range of first threshold will to be not more than and not less than the number of second threshold It is worth range, is determined as preset range.

Further, as shown in figure 4, converting unit 34, comprising:

Setup module 341 is used to be training sample setting multi-dimensional matrix, in the element and target database in multi-dimensional matrix Word correspond；

Matching module 342, for matching the word for including in training sample with the word in target database；

Assignment module 343 obtains training sample square for carrying out assignment to the element in multi-dimensional matrix according to matching result Battle array.

Further, assignment module 343, is used for:

If the corresponding word of element appears in training sample in multi-dimensional matrix, element is assigned a value of 1；

If the corresponding word of element does not appear in training sample in multi-dimensional matrix, element is assigned a value of 0.

Further, as shown in figure 4, device further comprises:

Training unit 36, after the training sample after being denoised, by the training sample after denoising to support Vector machine SVM model is trained, and obtains training result vector.

Training sample after obtained denoising is trained SVM model to obtain training result vector, specifically for The training result vector obtained for SVM model is supporting vector, and supporting vector is used to distinguish the classification side of different classes of data The corresponding vector in boundary.In addition during obtaining supporting vector, it is also necessary to different classes of to distinguish by different marks Training sample.Such as the mark of first kind training sample can be denoted as to 1, the mark of the second class training sample is denoted as -1.

Obtaining unit 37, for being screened according to the non-zero coefficient in training result vector to the word in target database, Obtain classification and matching collection.

Taxon 38 is classified for treating prediction data according to classification and matching collection.

Third treats prediction data according to accumulation result and classifies.

Further, as shown in figure 4, obtaining unit 37, comprising:

4th determining module 371, for be determined as the non-zero coefficient in training result vector in target database with it is non-zero The coefficient of the corresponding word of coefficient；

It should be noted that training result vector is obtained by the training sample after denoising, therefore training result vector Dimension and the dimension in the training sample after denoising there are corresponding relationships, the dimension and mesh of the training sample after in addition denoising There are corresponding relationships for word in mark database, therefore there are corresponding with the word in target database for the dimension of training result vector Relationship, the coefficient in training result vector are the corresponding coefficient of dimension of corresponding training result vector, therefore can be by basis Nonzero coefficient in training result vector finds corresponding word in target database.

Module 372 is obtained, for retaining the word in target database with coefficient, obtains classification and matching collection.

The processing unit of training sample provided in an embodiment of the present invention can obtain raw data base, initial data first It include all words that training sample may relate in library, training sample is related at least two different classes of training samples；Secondly, The hypothesis probability of the word in raw data base is calculated based on bayesian algorithm, wherein assuming that probability is when word appears in training sample When middle, training sample be certain classification training sample a possibility that size；Then, it extracts and assumes probability within a preset range Word obtains target database；Finally, training sample is converted to training sample matrix based on the word in target database, obtain Training sample after denoising.Compared with prior art, hypothesis probability is eliminated relative to raw data base due to target database The word of preset range is not met, that is, is eliminated for determining that training sample classification contributes lesser noise word, therefore be based on target When database be converted into training sample matrix to training sample, the noise word for including in training sample, institute will not be considered any further With the training sample after capable of being denoised, therefore SVM model is trained using the training sample after denoising and can be improved Trained efficiency.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

It is understood that the correlated characteristic in the above method and device can be referred to mutually.In addition, in above-described embodiment " first ", " second " etc. be and not represent the superiority and inferiority of each embodiment for distinguishing each embodiment.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.

In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.

Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) realize denomination of invention according to an embodiment of the present invention (such as training sample Processing unit) in some or all components some or all functions.The present invention is also implemented as executing this In described method some or all device or device programs (for example, computer program and computer program Product).It is such to realize that program of the invention can store on a computer-readable medium, it either can have one or more The form of a signal.Such signal can be downloaded from an internet website to obtain, be perhaps provided on the carrier signal or with Any other form provides.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

Claims

1. a kind of processing method of training sample, which is characterized in that the described method includes:

Raw data base is obtained, the raw data base includes all words that training sample may relate to, and the training sample relates to And at least two different classes of training sample；

Calculate the hypothesis probability of the word in the raw data base based on bayesian algorithm, the hypothesis probability is goes out when institute's predicate When in present training sample, a possibility that training sample is the training sample of certain classification size；

Training sample is converted into training sample matrix based on the word in the target database, the training sample after being denoised This.

2. the method according to claim 1, wherein the training sample includes first kind training sample and second The training sample of class training sample two categories, the hypothesis that the word in the raw data base is calculated based on bayesian algorithm Probability, comprising:

Calculate separately the other training sample of every type sample number account for all categories training sample total sample number ratio；

The sample number that institute's predicate is appeared in probability of occurrence and the first kind training sample in first kind training sample accounts for institute There is the ratio of the total sample number of the training sample of classification to be multiplied, obtain the first probability, first probability is that training sample is the A kind of training sample and include institute's predicate probability；

The sample number that institute's predicate is appeared in probability of occurrence and the second class training sample in the second class training sample accounts for institute There is the ratio of the total sample number of the training sample of classification to be multiplied, obtain the second probability, second probability is that training sample is the Two class training samples and include institute's predicate probability；

3. according to the method described in claim 2, it is characterized in that, described based on first probability and second probability Calculate the hypothesis probability of institute's predicate, comprising:

A possibility that training sample is the first kind training sample when institute's predicate appears in training sample is calculated according to following formula Size:

Alternatively, according to following formula calculate institute's predicate appear in training sample when training sample be the second class training sample can Energy property size:

Wherein P_aFor first probability, the P_bFor second probability.

4. the method according to claim 1, wherein assume probability word within a preset range in described extract, Before obtaining target database, the method further includes:

According to parameter preset, the maximum relational expression for assuming probability and minimum hypothesis probability progress linear combination acquisition, institute is determined Preset range is stated, the parameter preset is used to determine the dimension of the training sample matrix.

5. according to the method described in claim 4, it is characterized in that, it is described according to parameter preset, maximum assume probability and most The small relational expression assumed probability and carry out linear combination acquisition, determines the preset range, comprising:

First threshold is determined according to following formula:

θP₁+(1-θ)P_n

Second threshold is determined according to following formula:

(1-θ)P₁+θP_n

Wherein, P₁For the maximum hypothesis probability, P_nFor the minimum hypothesis probability, θ is the parameter preset, first threshold Value is less than the second threshold；

By the numberical range no more than the first threshold and not less than the numberical range of the second threshold, it is determined as described Preset range.

6. the method according to claim 1, wherein the word based in the target database will training sample Originally training sample matrix, the training sample after being denoised are converted to, comprising:

Multi-dimensional matrix is set for the training sample, the word in element and the target database in the multi-dimensional matrix is one by one It is corresponding；

7. according to the method described in claim 6, it is characterized in that, it is described according to the matching result in the multi-dimensional matrix Element carry out assignment, comprising:

If the corresponding word of element does not appear in the training sample in the multi-dimensional matrix, the element is assigned a value of 0.

8. the method according to claim 1, wherein after the training sample after being denoised, the method Further comprise:

The word in the target database is screened according to the non-zero coefficient in the training result vector, obtains classification With collection；

9. according to the method described in claim 8, it is characterized in that, according to the non-zero coefficient in the training result vector to institute The word stated in target database is screened, and classification and matching collection is obtained, comprising:

By the non-zero coefficient in the training result vector, be determined as word corresponding with the non-zero coefficient in target database is Number；

10. a kind of processing unit of training sample, which is characterized in that described device includes:

Acquiring unit, for obtaining raw data base, the raw data base includes all words that training sample may relate to, institute It states training sample and is related at least two different classes of training samples；

Computing unit, it is described to assume generally for calculating the hypothesis probability of the word in the raw data base based on bayesian algorithm Rate is a possibility that when institute's predicate appears in training sample, the training sample is the training sample of certain classification size；

Converting unit is gone for training sample to be converted to training sample matrix based on the word in the target database Training sample after making an uproar.

11. device according to claim 10, which is characterized in that the computing unit, comprising:

First computing module includes first kind training sample and the second class training sample two categories for the training sample Training sample calculates the probability of occurrence that institute's predicate respectively appears in the other training sample of every type；

Second computing module, the sample number for calculating separately the other training sample of every type account for the training sample of all categories The ratio of total sample number；

First multiplication module, for institute's predicate to be appeared in the probability of occurrence in first kind training sample and first kind training The ratio that the sample number of sample accounts for the total sample number of the training sample of all categories is multiplied, and obtains the first probability, and described first is general Rate be training sample for first kind training sample and include institute's predicate probability；

Second multiplication module, for institute's predicate to be appeared in the probability of occurrence in the second class training sample and second class training The ratio that the sample number of sample accounts for the total sample number of the training sample of all categories is multiplied, and obtains the second probability, and described second is general Rate be training sample for the second class training sample and include institute's predicate probability；

12. device according to claim 11, which is characterized in that the third computing module is used for:

A possibility that training sample is the second class training sample when institute's predicate appears in training sample is calculated according to following formula Size:

Wherein P_aFor first probability, the P_bFor second probability.

13. device according to claim 10, which is characterized in that described device further comprises:

Determination unit, for assuming the word of probability within a preset range in described extract, before obtaining target database, according to pre- Setting parameter, the maximum relational expression for assuming probability and minimum hypothesis probability progress linear combination acquisition, determine the preset range, The parameter preset is used to determine the dimension of the training sample matrix.

14. device according to claim 13, which is characterized in that the determination unit, comprising:

θP₁+(1-θ)P_n

(1-θ)P₁+θP_n

Third determining module, for the numberical range of the first threshold will to be not more than and not less than the number of the second threshold It is worth range, is determined as the preset range.

15. device according to claim 10, which is characterized in that the converting unit, comprising:

Setup module, for multi-dimensional matrix to be arranged for the training sample, the element and the number of targets in the multi-dimensional matrix It is corresponded according to the word in library；

Assignment module obtains the training for carrying out assignment to the element in the multi-dimensional matrix according to the matching result Sample matrix.

16. device according to claim 15, which is characterized in that the assignment module is used for:

17. device according to claim 10, which is characterized in that described device further comprises:

Training unit, after the training sample after being denoised, by the training sample after the denoising to support to Amount machine SVM model is trained, and obtains training result vector；

Obtaining unit, for being sieved according to the non-zero coefficient in the training result vector to the word in the target database Choosing obtains classification and matching collection；

18. device according to claim 17, which is characterized in that the obtaining unit, comprising:

4th determining module, for be determined as the non-zero coefficient in the training result vector in target database with it is described non- The coefficient of the corresponding word of 0 coefficient；