CN105069470A

CN105069470A - Classification model training method and device

Info

Publication number: CN105069470A
Application number: CN201510456761.2A
Authority: CN
Inventors: 叶幸春
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2015-07-29
Filing date: 2015-07-29
Publication date: 2015-11-18

Abstract

The present invention discloses a classification model training method and a device, belonging to the technical field of data processing. The method comprise a step of carrying out model training according to a current round of positive example samples and a current round of negative example samples and obtaining a current round of classification model, a step of using the current round of classification model to classify all samples if the current round of classification model does not satisfies a specified condition and selecting a specified sample in all samples according to a classification result, a step of taking the current round of positive example samples and the specified samples as a next round of positive example samples and determining a next round of negative example samples according to the next round of positive example samples, and a step of continuously executing the above model training and sample processing process according to the next round of positive example samples and the next round of negative example samples until the classification model which satisfies the specified condition is obtained. With the increase of the positive example sample number, the potential positive example samples in the negative example samples decrease, the purity of the negative example samples can be effectively improved, the stability of model obtained through training according to superimposed number of positive example samples and negative example samples, and the classification accuracy is high.

Description

Disaggregated model training method and device

Technical field

The present invention relates to technical field of data processing, particularly a kind of disaggregated model training method and device.

Background technology

Along with the development of infotech, step into large data age at present.Such as, the various service platforms that businessman or enterprise etc. provide by it collect mass users data.Wherein, various useful information is usually concealed in mass data, these information can provide huge help to aspects such as business management, production control, market analysis, engineering design or Science Explorationss usually, and therefore data mining technology receives the very big concern of every field personnel.Wherein, the basic task of data mining is classified to mass data, and usually realize based on the disaggregated model trained Data classification.

Now in the art when train classification models, first choose the positive example sample for model training and negative routine sample.Wherein, positive example sample refers to the sample be labeled in whole samples of training pattern.Such as, positive example sample can be a class crowd with same requirements or interest.Choose in the sample that negative routine sample is not labeled from whole sample, number of samples is consistent with positive example number of samples.Afterwards, carry out taking turns model training according to this positive example sample and this negative routine sample, obtain disaggregated model.

Realizing in process of the present invention, inventor finds that prior art at least exists following problem:

When whole sample size is less, the quantity of positive example sample and negative routine sample reduces thereupon, and some positive example samples may be included in negative routine sample, cause the degree of purity of negative routine sample not high, because the discrimination of positive example sample and negative routine sample is bad, so the disaggregated model stability obtained after one takes turns model training according to such positive example sample and negative routine sample is bad, classification degree of accuracy is lower, even there will be poor fitting phenomenon.

Summary of the invention

In order to solve the problem of prior art, embodiments provide a kind of disaggregated model training method and device.Described technical scheme is as follows:

On the one hand, provide a kind of disaggregated model training method, described method comprises:

Bear routine sample according to epicycle positive example sample and epicycle and carry out model training, obtain epicycle disaggregated model;

If toe fixed condition is discontented with by described epicycle disaggregated model, described epicycle disaggregated model is then utilized to classify to whole sample, in described whole sample, choose specific sample according to classification results, described specific sample is predicted as positive example sample by described epicycle disaggregated model;

Using described epicycle positive example sample and described specific sample as next round positive example sample, determine that described next round bears routine sample according to described next round positive example sample;

Bear routine sample according to described next round positive example sample and described next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of described specified requirements.

On the other hand, provide a kind of disaggregated model trainer, described device comprises:

Model training module, carrying out model training for bearing routine sample according to epicycle positive example sample and epicycle, obtaining epicycle disaggregated model;

Sample process module, if be discontented with toe fixed condition for described epicycle disaggregated model, then utilize described epicycle disaggregated model to classify to whole sample, choose specific sample according to classification results in described whole sample, described specific sample is predicted as positive example sample by described epicycle disaggregated model; Using described epicycle positive example sample and described specific sample as next round positive example sample, determine that described next round bears routine sample according to described next round positive example sample;

Described model training module, for bearing routine sample according to described next round positive example sample and described next round, continues to perform above-mentioned model training process;

Described sample process module, performs above-mentioned sample process process according to next round disaggregated model, until be met the disaggregated model of described specified requirements for continuing.

The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:

Carrying out in model training process, bearing routine sample according to epicycle positive example sample and epicycle and carry out model training, obtain epicycle disaggregated model; If toe fixed condition is discontented with by epicycle disaggregated model, then utilizes epicycle disaggregated model to classify to whole sample, in whole sample, choose specific sample according to classification results; Using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round; Routine sample is born afterwards according to next round positive example sample and next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of specified requirements, after the present invention takes turns model training one, if toe fixed condition is discontented with by the disaggregated model obtained, then choose specific sample based on this disaggregated model to be added in positive example sample, and carry out model training by many wheel iterative process.Along with the continuous increase of positive example sample size, the potential positive example sample comprised in negative routine sample can decline thereupon, effectively can promote the sample degree of purity in negative routine sample, because the discrimination of positive example sample and negative routine sample is better, so the disaggregated model stability obtained after carrying out many wheel model trainings according to the positive example sample and negative routine sample that repeatedly carry out quantity superposition is better, classification degree of accuracy is higher.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of a kind of disaggregated model training method that the embodiment of the present invention provides;

Fig. 2 is the process flow diagram of a kind of disaggregated model training method that the embodiment of the present invention provides;

Fig. 3 is the structural representation of a kind of disaggregated model trainer that the embodiment of the present invention provides;

Fig. 4 is the structural representation of a kind of server that the embodiment of the present invention provides.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Fig. 1 is the process flow diagram of a kind of disaggregated model training method that the embodiment of the present invention provides.See Fig. 1, the method flow that the embodiment of the present invention provides comprises:

101, bear routine sample according to epicycle positive example sample and epicycle and carry out model training, obtain epicycle disaggregated model.

If toe fixed condition is discontented with by 102 epicycle disaggregated models, then utilize epicycle disaggregated model to classify to whole sample, choose specific sample according to classification results in whole sample, specific sample is predicted as positive example sample by epicycle disaggregated model.

103, using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round.

104, bear routine sample according to next round positive example sample and next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of specified requirements.

The method that the embodiment of the present invention provides, is carrying out in model training process, is bearing routine sample and carry out model training, obtain epicycle disaggregated model according to epicycle positive example sample and epicycle; If toe fixed condition is discontented with by epicycle disaggregated model, then utilizes epicycle disaggregated model to classify to whole sample, in whole sample, choose specific sample according to classification results; Using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round; Routine sample is born afterwards according to next round positive example sample and next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of specified requirements, after the present invention takes turns model training one, if toe fixed condition is discontented with by the disaggregated model obtained, then choose specific sample based on this disaggregated model to be added in positive example sample, and carry out model training by many wheel iterative process.Along with the continuous increase of positive example sample size, the potential positive example sample comprised in negative routine sample can decline thereupon, effectively can promote the sample degree of purity in negative routine sample, because the discrimination of positive example sample and negative routine sample is better, so the disaggregated model stability obtained after carrying out many wheel model trainings according to the positive example sample and negative routine sample that repeatedly carry out quantity superposition is better, classification degree of accuracy is higher.

Alternatively, bear before routine sample carries out model training according to epicycle positive example sample and epicycle, the method also comprises:

Based on the quantity of epicycle positive example sample, choose epicycle in the residue sample in whole sample except positive example sample and bear routine sample;

In epicycle positive example sample, choose the first sample, bear in routine sample choose the second sample in epicycle, the first sample is consistent with the sample size comprised in the second sample;

Bear routine sample according to epicycle positive example sample and epicycle and carry out model training, comprising:

According to the residue sample in the residue sample in positive example sample except the first sample, negative routine sample except the second sample, carry out model training.

Alternatively, utilize before epicycle disaggregated model classifies to whole sample, the method also comprises:

According to the first sample and the second sample, epicycle disaggregated model is assessed, obtains assessment result;

Judge whether epicycle disaggregated model meets specified requirements according to assessment result;

When assessment result is better than the classification performance index arranged, determine that epicycle disaggregated model meets specified requirements.

Alternatively, in whole sample, choose the appointment sample being classified as positive example sample according to classification results, comprising:

According to classification results, in whole sample, determine prediction positive example sample;

For each sample in prediction positive example sample, be classified as the probability of positive example sample according to classification results determination sample;

In prediction positive example sample, choose the preset number sample that the probability that is classified as positive example sample is the highest;

A preset number sample is defined as specific sample.

Alternatively, bear routine sample according to epicycle positive example sample and epicycle and carry out model training, comprising:

Based on treating training pattern, calculating epicycle positive example sample and epicycle bear the proper vector of routine sample, treat that training pattern is the disaggregated model that last round of training process obtains, and treat that the class categories of training pattern is determined according to the sample characteristics data of configuration;

Bear the proper vector of each sample in routine sample according to the proper vector of each sample in epicycle positive example sample and epicycle, routine sample is born to epicycle positive example sample and epicycle and classifies;

According to sample classification result and the mark result to epicycle positive example sample, optimize the parameters treating training pattern, obtain epicycle disaggregated model.

Above-mentioned all alternatives, can adopt and combine arbitrarily formation optional embodiment of the present invention, this is no longer going to repeat them.

Fig. 2 is the process flow diagram of a kind of disaggregated model training method that the embodiment of the present invention provides.See Fig. 2, the method flow that the embodiment of the present invention provides comprises:

201, in whole sample, first run positive example sample is chosen.

In embodiments of the present invention, sample refers to the part individuality of actual observation or investigation in research.Such as, whole sample can be whole registered users of a certain application, and whole users belonging to a certain area etc., the embodiment of the present invention does not specifically limit this.Wherein, positive example sample, refers in the sample that in two disaggregated models, training pattern is used by the sample of labeling.Also namely, positive example sample is by hand labeled, and its belonging kinds is known.In addition, why be called that two disaggregated models are because its classification results is only "Yes" or "No" two kinds of situations.Two disaggregated models comprise Logic Regression Models, decision-tree model and supporting vector machine model etc.Wherein, seed crowd belongs to the positive example sample of label under line.Seed crowd normally collects under specific transactions scene, refers to crowd product or service to same requirements and interest, and the quantity of seed crowd is few, usually all below 100,000.Such as, in whole registered users of a certain application, the user of same brand automobile is liked just can to belong to same Ziren group.

It should be noted that, why positive example sample is referred to as first run positive example sample by this step, is because may carry out the model training process of many wheel iteration in subsequent process.And each quantity of taking turns positive example sample is all not identical, due to each, to take turns the positive example sample standard deviation that model training process uses not identical, and the positive example sample therefore in order to take turns each is distinguished, and takes the call that first run positive example sample, next round positive example sample are such.Negative routine sample therewith in like manner.

When choosing first run positive example sample in whole sample, because the quality of the sample data for model training is most important, use this kind of labeling data of seed crowd as positive example sample so general.Such as, a seed crowd of same interest characteristics can will be had as first run positive example sample in whole registered user, also can using the seed crowd of a certain newly-increased service or product that employs as first run positive example sample, the embodiment of the present invention does not specifically limit this, can choose different seed crowds as first run positive example sample based on different classification demands.Wherein, the sample standard deviation in each seed crowd manually carries out choosing and marking in advance, and the embodiment of the present invention does not specifically limit this.

202, based on the quantity of first run positive example sample, choose the first run in the residue sample in whole sample except first run positive example sample and bear routine sample.

Wherein, negative routine sample, refers in the sample that in two disaggregated models, training pattern is used not by the sample of labeling.Also namely, the sample standard deviation in positive example sample is labeled, and classification is clear and definite.Sample standard deviation in negative routine sample is not labeled, and classification is unknown.Lift a simple case, if positive example sample is the student that in a class, part is marked as schoolgirl, so negative routine sample just refers to the student be not labeled in this class, and both may have schoolgirl in the student be not labeled, and also may have schoolboy.

In embodiments of the present invention, after have chosen first run positive example sample, also the first run chosen in whole sample for model training need bear routine sample.Wherein, the quantity of first run positive example sample is consistent with the quantity that the first run bears routine sample.When carrying out the first run and bearing choosing of routine sample, first in whole sample, reject first run positive example sample, obtain remaining sample.Afterwards, in residue sample, carry out sample at random and choose, choose the sample that the quantity of same first run positive example sample is consistent, these samples are born routine sample as the first run.

203, bear in routine sample at first run positive example sample and the first run and choose retain sample.

Wherein, retain sample refers to the sample in subsequent process, the disaggregated model trained being carried out to testing evaluation.

In embodiments of the present invention, first run positive example sample and the first run bear choose retain sample in routine sample time, following manner can be taked to realize: in first run positive example sample, choose the first sample, bear in routine sample in the first run and choose the second sample.Wherein, the first sample is consistent with the sample size comprised in the second sample.Also namely, bear in routine sample in first run positive example sample and the first run and choose the identical sample of quantity together as retain sample.Wherein, the quantity of the first sample and the second sample is generally first run positive example sample and the first run bears 30% of routine sample size.That is, in first run positive example sample, 30% sample is chosen, as the first sample; Bear in routine sample in the first run and choose 30% sample, as the second sample; Using the first sample together with the second sample as retain sample.

204, the residue sample born in routine sample except retain sample according to first run positive example sample and the first run carries out model training, obtains first run disaggregated model.

In embodiments of the present invention, have chosen retain sample owing to bearing in routine sample at first run positive example sample and the first run, so when bearing routine sample according to first run positive example sample and the first run and carrying out model training, also need to reject above-mentioned retain sample.Also, namely, when carrying out model training, only bearing the residue sample in routine sample except the second sample according to the residue sample in first run positive example sample except the first sample, the first run, carrying out model training.

When carrying out model training, can realize with reference to following manner:

Parameters in the first step, initialization first run disaggregated model.

Due to be first time carry out model training, so also need the parameters in first initialization disaggregated model.The model training mentioned in the embodiment of the present invention may be a successive ignition process, and non-carry out model training for the first time time, without the need to performing this step, directly can perform following second step based on the last round of disaggregated model obtained.This step is only for first time model training process.

Wherein, disaggregated model is a kind of mapping being input to output in itself, it can learn the mapping relations between a large amount of constrained input, and without any need for the accurate mathematic(al) representation between input and output, only by known pattern to the training of preliminary classification model, the disaggregated model obtained just has the mapping ability between inputoutput pair.Before beginning train classification models, all parameters all should carry out initialization by some different little random numbers.In disaggregated model training process, the parameters that stochastic gradient descent or back-propagating method are come in Optimum Classification model can be used, thus minimize error in classification as much as possible.The embodiment of the present invention does not specifically limit this.

Second step, based on initialized disaggregated model, calculate the proper vector that first run positive example sample and the first run bear routine sample.

Wherein, when carrying out model training, in order to the class categories of clear and definite disaggregated model, also can obtain the sample characteristics data of configured in advance, determining the classification feature of disaggregated model to be trained according to these sample characteristics data.Wherein, sample characteristics data specify and go out based on positive example sample and negative routine sample training the sorter which kind of classification feature is have.Such as, positive example sample be a certain social activity application registered user in the age 20-30 year user.Because the user in positive example sample is electric commercial family, and the age is less, whether so like certain a network game to predict according to this positive example sample to 20-30 year young man, predicting the outcome, it is more accurately than whether doting on fund class treasury management services according to this positive example sample to 20-30 year young man to be certain to.So the classification feature of disaggregated model can be specified by sample characteristics data.

In embodiments of the present invention, after parameters in initialization disaggregated model, because disaggregated model is a kind of mapping being input to output in itself, so for a disaggregated model, input a training sample to disaggregated model, disaggregated model just can calculate the proper vector of this training sample.It should be noted that, for the non-process of model training first, directly based on last round of disaggregated model, calculating epicycle positive example sample and epicycle bear the proper vector of routine sample.

3rd step, bear the proper vector of each sample in routine sample according to the proper vector of each sample in epicycle positive example sample and epicycle, routine sample is born to epicycle positive example sample and epicycle and classifies.

For this step, for any two training samples, the proper vector distance on feature space of the two is nearer, and illustrate that two training samples are more similar, it is higher that the two belongs to of a sort probability.Wherein, proper vector can be tens dimensions or hundreds of dimension, and the embodiment of the present invention does not specifically limit this.When classifying to whole sample according to proper vector, can realize according to the distance between proper vector, the embodiment of the present invention does not specifically limit this.

4th step, according to sample classification result and the mark result to first run positive example sample, optimize and treat the parameters of training pattern to obtain epicycle disaggregated model.

For this step, the training process of disaggregated model is the process of a parameter successive optimization.First run positive example sample and the first run born in routine sample after the residue sample removed outside retain sample classifies at feature based vector, can judge that whether the disaggregated model that initial training goes out is correct to the classification of sample based on first run positive example sample.Also namely, the gap between the concrete class belonged to according to sample and prediction classification constantly adjusts the parameter of disaggregated model, and the parameters step by step in Optimum Classification model, obtains disaggregated model.

205, according to retain sample, first run disaggregated model is assessed.

After obtaining first run disaggregated model, in order to detect the classification performance of first run disaggregated model, also need to assess first run disaggregated model according to retain sample.Wherein, retain sample includes the first sample coming from positive example sample, comes from the second sample of negative routine sample.

According to the first sample and described second sample, when first run disaggregated model is assessed, the classification accuracy of first run disaggregated model, recall rate, AUC (AreaUnderROCCurve, the area under Receiver operating curve) etc. index can be assessed.Wherein, classification accuracy, refers to the ratio be classified as in such other sample shared by the correct sample of actual classification.That is, certain classification in the classification of classification accuracy correspondence, molecule is the sample size that this classification of prediction is correct, denominator is the quantity being predicted as such other whole sample, the evaluation of its to be disaggregated model by sample predictions be accuracy of some classifications, be worth larger, model prediction accuracy is higher.Recall rate is also referred to as recall ratio, refers to the ratio being accounted for all retain samples by the sample of correctly classifying.

AUC is a kind of standard being used for measuring disaggregated model quality.ROC (ReceiverOperatingCharacteristic, Receiver operating curve), its main analytical tools is a curve ROCcurve being drawn on two dimensional surface.The horizontal ordinate of plane is FPR (FalsePositiveRate), and ordinate is TPR (truepositiverate).For disaggregated model, a TPR and FPR point can be obtained according to its performance on retain sample right.Like this, this sorter just can be mapped to a point in ROC plane.The threshold value used when adjusting the classification of this sorter, can obtain a process (0,0), the curve of (1,1), the ROC curve of Here it is disaggregated model.Generally, this curve all should be in (0,0) and the top of (1,1) line.Because what in fact the ROC curve that (0,0) and (1,1) line is formed represented is a probabilistic classifier.Although the performance carrying out presentation class device with ROCcurve is very intuitively handy., people always wish to have a numerical value to carry out the quality of label category device, so AUC has occurred.The value of AUC is exactly the size of the part area be in below ROCcurve.Usually, the value of AUC is between 0.5 to 1.0, and larger AUC represents good performance.That is, be worth more that large-sized model is more perfect, between different sample, specificity is more remarkable, higher to sample area calibration.

Wherein, every classification performance index of disaggregated model can be set in advance.Such as, accuracy rate is greater than 90%, and recall rate is greater than 97%, AUC value and is greater than 0.8 etc., and the embodiment of the present invention does not specifically limit this.After first run training pattern is assessed, if property indices is all better than the classification performance index arranged in advance in the assessment result obtained, then determine that first run disaggregated model meets specified requirements.If at least one performance index are lower than the classification performance index arranged in advance in the assessment result obtained, then determine that toe fixed condition is discontented with by first run disaggregated model.In addition, when judging whether the assessment result obtained meets specified requirements, also can judge whether the property indices of assessment result no longer promotes.Also namely, no matter a numerical value is all maintained through how many iterative process classification accuracy, recall rate, AUC etc. of taking turns constant.The type of the embodiment of the present invention to this specified requirements does not specifically limit.

If the assessment result of 206 first run disaggregated models is discontented with toe fixed condition, then utilizes first run disaggregated model to classify to whole sample, in whole sample, choose specific sample according to classification results.

In the disclosed embodiments, when the assessment result of first run disaggregated model is discontented with toe fixed condition, also need again to carry out model training.Before upper once model training, first in whole sample, choose next round positive example sample based on first run training pattern and next round bears routine sample.Wherein, next round positive example sample is the superposition of first run positive example sample and specific sample, namely extends the quantity of positive example sample in next round iterative process.

Wherein, when choosing specific sample according to classification results in whole sample, following manner can be taked to realize:

According to classification results, in whole sample, determine prediction positive example sample; For each sample in prediction positive example sample, determine that this sample is classified as the probability of positive example sample according to classification results; In prediction positive example sample, choose the preset number sample that the probability that is classified as positive example sample is the highest; This preset number sample is defined as specific sample.

Wherein, predict that positive example sample is the sample selected in whole sample according to first run disaggregated model.For these samples, first run disaggregated model all predicts that it is the sample with first run positive example sample with similar or identical feature.But each sample there are differences again with the similarity degree of first run positive example sample in prediction positive example sample.In prediction positive example sample each sample standard deviation corresponding one with the similar probable value of first run positive example sample.Just this probable value is comprised in the classification results that first run disaggregated model exports.Represent two samples with numerical value 1 completely the same, it is example that numerical value 0 represents two samples completely inconsistent, then the probable value that different prediction positive example sample is corresponding can be 0.6,0.8,0.87,0.95 etc.Probable value is larger, illustrates that this prediction positive example sample is more close with the feature of first run positive example sample.Generally, the quantity of prediction positive example sample, can be several times as much as the quantity of first run positive example sample.The quantity of such side light seed crowd, compared to still extremely little whole sample size, therefore also needs in magnanimity crowd, to excavate the crowd with it with same characteristic features according to seed crowd and disaggregated model.

In order to choose the specific sample for expanding first run positive example sample in prediction positive example sample, also need to sort to prediction positive example sample according to probable value.Such as, can arrange according to probable value order from big to small.After prediction positive example sample is sorted, probable value can be chosen and be arranged in topN sample above, using this topN sample as specific sample.

207, using first run positive example sample and specific sample as next round positive example sample.

After selecting specific sample according to above-mentioned steps 206, in first run positive example that this specific sample is added to sample, obtain next round positive example sample, to expand the quantity of positive example sample, reach the object of the more samples similar to first run positive example sample characteristics height of collection.

208, above-mentioned steps 202 to step 207 is repeated, until the assessment result of the disaggregated model obtained meets specified requirements.

After obtaining next round positive example sample, the method continued according to above-mentioned steps 202 is chosen next round and is born routine sample.Owing to having carried out quantity expansion to positive example sample, so potential positive example sample just can reduce in negative routine sample, the degree of purity of negative routine sample effectively can be improved.Afterwards, continue to bear routine sample according to next round positive example sample and next round and carry out model training, obtain next round disaggregated model; Continue to assess next round disaggregated model; If the assessment result of next round disaggregated model is discontented with toe fixed condition, then continue to choose specific sample in whole sample, in next round that specific sample is added to disaggregated model, obtain lower whorl positive example sample, repeat above-mentioned steps 202 to 207, until the disaggregated model obtained meets specified requirements.

Such as, a certain electric business under a certain social activity application registration whole users in determine certain interested all user of a hand trip.Under normal circumstances, this electric business the mode such as manually to mark and only can know that a minimum part swims interested user to this hand, i.e. seed crowd.Due to registered user's magnanimity, several necessarily even more than one hundred million, the mode manually chosen and mark is obviously unrealistic, so also need to carry out data mining according to seed crowd in mass users, excavates the potential crowd that Ziren group of the same race has similar features.The disaggregated model training method taking the embodiment of the present invention to provide, just can address this problem well.And due to by successive ignition process to disaggregated model training, and each is taken turns and has carried out quantity expansion to positive example sample standard deviation, so classifying quality is more excellent.The similarity of the positive example sample Ziren of the same race group sorted out is high.Continue for above-mentioned example, the sorting technique taking the embodiment of the present invention to provide accurately can be determined this hand trip other users interested based on this seed crowd in mass users.After excavating this kind of crowd with same interest feature, game advertisement input can be carried out to this part crowd, game products is recommended etc.In addition, opponent swims interested crowd and is usually the young male sex, can also carry out correlated-product recommendation such as automobile, ball sports etc. so accordingly.

The method that the embodiment of the present invention provides, is carrying out in model training process, is bearing routine sample and carry out model training, obtain epicycle disaggregated model according to epicycle positive example sample and epicycle; If toe fixed condition is discontented with by epicycle disaggregated model, then utilize epicycle disaggregated model to classify to whole sample, in whole sample, choose the highest specific sample of the probability that is predicted to be positive example sample according to classification results.Using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round; Routine sample is born afterwards according to next round positive example sample and next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of specified requirements, after the present invention takes turns model training one, if toe fixed condition is discontented with by the disaggregated model obtained, then choose specific sample based on this disaggregated model to be added in positive example sample, and carry out model training by many wheel iterative process.Along with the continuous increase of positive example sample size, the potential positive example sample comprised in negative routine sample can decline thereupon, effectively can promote the sample degree of purity in negative routine sample, because the discrimination of positive example sample and negative routine sample is better, so the disaggregated model stability obtained after carrying out many wheel model trainings according to the positive example sample and negative routine sample that repeatedly carry out quantity superposition is better, classification degree of accuracy is higher.

Fig. 3 is a kind of disaggregated model trainer that the embodiment of the present invention provides.See Fig. 3, this device comprises: model training module 301, sample process module 302.

Wherein, model training module 301 is connected with sample process module 302, carrying out model training, obtaining epicycle disaggregated model for bearing routine sample according to epicycle positive example sample and epicycle; Sample process module 302, if be discontented with toe fixed condition for epicycle disaggregated model, then utilize epicycle disaggregated model to classify to whole sample, choose specific sample according to classification results in whole sample, specific sample is predicted as positive example sample by epicycle disaggregated model; Using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round; Model training module 301, for bearing routine sample according to next round positive example sample and next round, continues to perform above-mentioned model training process; Sample process module 302, performs above-mentioned sample process process according to next round disaggregated model, until be met the disaggregated model of specified requirements for continuing.

Alternatively, this device also comprises:

Module chosen by first sample, for the quantity based on epicycle positive example sample, chooses epicycle and bear routine sample in the residue sample in whole sample except positive example sample;

Module chosen by second sample, and for choosing the first sample in epicycle positive example sample, bearing in routine sample choose the second sample in epicycle, the first sample is consistent with the sample size comprised in the second sample;

Model training module, for according to the residue sample in the residue sample in positive example sample except the first sample, negative routine sample except the second sample, carries out model training.

Alternatively, this device also comprises:

Model evaluation module, for according to the first sample and the second sample, assesses epicycle disaggregated model, obtains assessment result; Judge whether epicycle disaggregated model meets specified requirements according to assessment result; When assessment result is better than the classification performance index arranged, determine that epicycle disaggregated model meets specified requirements.

Alternatively, sample process module, for according to classification results, determines prediction positive example sample in whole sample; For each sample in prediction positive example sample, be classified as the probability of positive example sample according to classification results determination sample; In prediction positive example sample, choose the preset number sample that the probability that is classified as positive example sample is the highest; A preset number sample is defined as specific sample.

Alternatively, model training module, for based on treating training pattern, calculates the proper vector that epicycle positive example sample and epicycle bear routine sample, treat that training pattern is the disaggregated model that last round of training process obtains, treat that the class categories of training pattern is determined according to the sample characteristics data of configuration; Bear the proper vector of each sample in routine sample according to the proper vector of each sample in epicycle positive example sample and epicycle, routine sample is born to epicycle positive example sample and epicycle and classifies; According to sample classification result and the mark result to epicycle positive example sample, optimize the parameters treating training pattern, obtain epicycle disaggregated model.

To sum up, the device that the embodiment of the present invention provides, is carrying out in model training process, is bearing routine sample and carry out model training, obtain epicycle disaggregated model according to epicycle positive example sample and epicycle; If toe fixed condition is discontented with by epicycle disaggregated model, then utilizes epicycle disaggregated model to classify to whole sample, in whole sample, choose specific sample according to classification results; Using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round; Routine sample is born afterwards according to next round positive example sample and next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of specified requirements, after the present invention takes turns model training one, if toe fixed condition is discontented with by the disaggregated model obtained, then choose specific sample based on this disaggregated model to be added in positive example sample, and carry out model training by many wheel iterative process.Along with the continuous increase of positive example sample size, the potential positive example sample comprised in negative routine sample can decline thereupon, effectively can promote the sample degree of purity in negative routine sample, because the discrimination of positive example sample and negative routine sample is better, so the disaggregated model stability obtained after carrying out many wheel model trainings according to the positive example sample and negative routine sample that repeatedly carry out quantity superposition is better, classification degree of accuracy is higher.

It should be noted that: the disaggregated model trainer that above-described embodiment provides is when train classification models, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by device is divided into different functional modules, to complete all or part of function described above.In addition, the disaggregated model trainer that above-described embodiment provides and disaggregated model training method embodiment belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

Fig. 4 is a kind of server according to an exemplary embodiment, and this server may be used for the disaggregated model method shown in above-mentioned arbitrary exemplary embodiment of implementing.Specifically: see Fig. 4, this server 400 can produce larger difference because of configuration or performance difference, one or more central processing units (CentralProcessingUnit can be comprised, CPU) 422 (such as, one or more processors) and storer 432, one or more store the storage medium 430 (such as one or more mass memory units) of application program 442 or data 444.Wherein, storer 432 and storage medium 430 can be of short duration storages or store lastingly.The program being stored in storage medium 430 can comprise one or more modules (diagram does not mark).

Server 400 can also comprise one or more power supplys 426, one or more wired or wireless network interfaces 440, one or more IO interface 448, and/or, one or more operating system 441, such as WindowsServer ^tM, MacOSX ^tM, Unix ^tM, Linux ^tM, FreeBSD ^tMetc..

More than one or one program is stored in storer, and is configured to be performed by more than one or one processor, and more than one or one routine package is containing the instruction for carrying out following operation:

If toe fixed condition is discontented with by epicycle disaggregated model, then utilize epicycle disaggregated model to classify to whole sample, choose specific sample according to classification results in whole sample, specific sample is predicted as positive example sample by epicycle disaggregated model;

Using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round;

Bear routine sample according to next round positive example sample and next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of specified requirements.

A preset number sample is defined as specific sample.

The server that the embodiment of the present invention provides, is carrying out in model training process, is bearing routine sample and carry out model training, obtain epicycle disaggregated model according to epicycle positive example sample and epicycle; If toe fixed condition is discontented with by epicycle disaggregated model, then utilizes epicycle disaggregated model to classify to whole sample, in whole sample, choose specific sample according to classification results; Using epicycle positive example sample and specific sample as next round positive example sample, bear routine sample according to next round positive example sample determination next round; Routine sample is born afterwards according to next round positive example sample and next round, continue to perform above-mentioned model training and sample process process, until be met the disaggregated model of specified requirements, after the present invention takes turns model training one, if toe fixed condition is discontented with by the disaggregated model obtained, then choose specific sample based on this disaggregated model to be added in positive example sample, and carry out model training by many wheel iterative process.Along with the continuous increase of positive example sample size, the potential positive example sample comprised in negative routine sample can decline thereupon, effectively can promote the sample degree of purity in negative routine sample, because the discrimination of positive example sample and negative routine sample is better, so the disaggregated model stability obtained after carrying out many wheel model trainings according to the positive example sample and negative routine sample that repeatedly carry out quantity superposition is better, classification degree of accuracy is higher.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a disaggregated model training method, is characterized in that, described method comprises:

2. method according to claim 1, is characterized in that, describedly bears before routine sample carries out model training according to epicycle positive example sample and epicycle, and described method also comprises:

Based on the quantity of described epicycle positive example sample, choose described epicycle in the residue sample in whole sample except described positive example sample and bear routine sample;

In described epicycle positive example sample, choose the first sample, bear in routine sample choose the second sample in described epicycle, described first sample is consistent with the sample size comprised in described second sample;

Describedly bear routine sample according to epicycle positive example sample and epicycle and carry out model training, comprising:

According to the residue sample in the residue sample in described positive example sample except described first sample, described negative routine sample except described second sample, carry out model training.

3. method according to claim 2, is characterized in that, describedly utilizes before described epicycle disaggregated model classifies to whole sample, and described method also comprises:

According to described first sample and described second sample, described epicycle disaggregated model is assessed, obtains assessment result;

Judge whether described epicycle disaggregated model meets described specified requirements according to described assessment result;

When described assessment result is better than the classification performance index arranged, determine that described epicycle disaggregated model meets described specified requirements.

4. method according to claim 1, is characterized in that, describedly in whole sample, chooses the appointment sample being classified as positive example sample according to classification results, comprising:

According to described classification results, in whole sample, determine prediction positive example sample;

For each sample in prediction positive example sample, determine that described sample is classified as the probability of positive example sample according to described classification results;

In described prediction positive example sample, choose the preset number sample that the probability that is classified as positive example sample is the highest;

A described preset number sample is defined as described specific sample.

5. method according to claim 1, is characterized in that, describedly bears routine sample according to epicycle positive example sample and epicycle and carries out model training, comprising:

Based on treating training pattern, calculate the proper vector that described epicycle positive example sample and described epicycle bear routine sample, describedly treat that training pattern is the disaggregated model that last round of training process obtains, described in treat that the class categories of training pattern is determined according to the sample characteristics data of configuration;

Bear the proper vector of each sample in routine sample according to the proper vector of each sample in described epicycle positive example sample and described epicycle, routine sample is born to epicycle positive example sample and epicycle and classifies;

According to sample classification result and the mark result to described epicycle positive example sample, treat the parameters of training pattern described in optimization, obtain described epicycle disaggregated model.

6. a disaggregated model trainer, is characterized in that, described device comprises:

7. device according to claim 6, is characterized in that, described device also comprises:

Module chosen by first sample, for the quantity based on described epicycle positive example sample, chooses described epicycle and bear routine sample in the residue sample in whole sample except described positive example sample;

Module chosen by second sample, for choosing the first sample in described epicycle positive example sample, bearing in routine sample choose the second sample in described epicycle, and described first sample is consistent with the sample size comprised in described second sample;

Described model training module, for according to the residue sample in the residue sample in described positive example sample except described first sample, described negative routine sample except described second sample, carries out model training.

8. device according to claim 7, is characterized in that, described device also comprises:

Model evaluation module, for according to described first sample and described second sample, assesses described epicycle disaggregated model, obtains assessment result; Judge whether described epicycle disaggregated model meets described specified requirements according to described assessment result; When described assessment result is better than the classification performance index arranged, determine that described epicycle disaggregated model meets described specified requirements.

9. device according to claim 6, is characterized in that, described sample process module, for according to described classification results, determines prediction positive example sample in whole sample; For each sample in prediction positive example sample, determine that described sample is classified as the probability of positive example sample according to described classification results; In described prediction positive example sample, choose the preset number sample that the probability that is classified as positive example sample is the highest; A described preset number sample is defined as described specific sample.

10. device according to claim 6, it is characterized in that, described model training module, for based on treating training pattern, calculate the proper vector that described epicycle positive example sample and described epicycle bear routine sample, describedly treat that training pattern is the disaggregated model that last round of training process obtains, described in treat that the class categories of training pattern is determined according to the sample characteristics data of configuration; Bear the proper vector of each sample in routine sample according to the proper vector of each sample in described epicycle positive example sample and described epicycle, routine sample is born to epicycle positive example sample and epicycle and classifies; According to sample classification result and the mark result to described epicycle positive example sample, treat the parameters of training pattern described in optimization, obtain described epicycle disaggregated model.