CN105718490A

CN105718490A - Method and device for updating classifying model

Info

Publication number: CN105718490A
Application number: CN201410737856.7A
Authority: CN
Inventors: 沈雄
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2016-06-29

Abstract

The invention discloses a method for updating a classifying model. The method comprises following steps of obtaining incremental data in a preset time period as a training sample set from a server for providing user behavior data; determining the quantity of newly added decision trees; generating the decision trees of the newly added quantity through adoption of a random forest algorithm according to the training sample set; ordering the decision trees contained in the classifying model and the newly generated decision trees according to a classifying effect; selecting the decision trees of a preset quantity, wherein the sequence of the decision trees is located at a high-order; summarizing the decision trees, thus obtaining the updated classifying model. The invention also provides a device for updating the classifying model. Through adoption of the method provided by the invention, it is unnecessary to train according to all data, an incremental updating method is adopted on the basis of the original classifying model, therefore, the model training efficiency is improved, and the rapid response to the service is realized.

Description

A kind of method for updating disaggregated model and device

Technical field

The application relates to the disaggregated model based on decision tree, is specifically related to a kind of method for updating disaggregated model.The application relates to a kind of device for updating disaggregated model simultaneously.

Background technology

Development along with Internet technology, occur in that substantial amounts of network application, such as: network social intercourse, network reading etc., network application provider is in order to recommend more targeted information to user or carry out the monitoring management of necessity, typically require according to user's operation behavior in network application, classification prediction is carried out for target set in advance, for instance: whether user belongs to is enlivened colony, whether the operation behavior of user exists potential risk etc..In order to improve predictive efficiency and accuracy, disaggregated model is generally adopted to carry out classification prediction in most of network applications.

So-called disaggregated model (being also often referred to as grader), refers to that the feature according to data is mapped to some in given classification the sample of unknown classification.The process of structural classification model is generally divided into training and two stages of test.In the training stage, tectonic model is carried out by the training sample set of attribute description by analyzing, at test phase, test sample set is used to carry out the classification accuracy of assessment models, if classification accuracy meets certain requirement, then disaggregated model just can put into practical application, and the sample of unknown classification is carried out classification prediction.

The construction process of disaggregated model is generally realized by machine-learning process, and wherein decision tree (decisiontree) is the prediction carrying out classification by finding the relation between attribute and classification.In order to promote the correctness of classification, the basis of decision tree occurs in that again random forest grader, that is: the grader being made up of multiple decision trees, when sample to be sorted enters random forest, in fact exactly allow each decision tree classify, finally choose and selected classification that number of times is maximum as final classification results by all decision trees.

The above-mentioned disaggregated model based on decision tree is widely used at internet arena, machine-learning process for this class model, substantially off-line learning mode is all adopted: by the study of historic user behavioral data of full dose, analysis, draw the knowledge about classification, thus completing the structure to disaggregated model and deployment is reached the standard grade.As time goes on; the Behavior law of user would generally change; among commodity that each network application or website present to user, information etc. are also at constantly adjusting; cause that the degree inconsistent with actual classification that predict the outcome of disaggregated model is beyond scope set in advance; that is: disaggregated model is degenerated, and the accuracy rate of its classification cannot meet requirement.For this situation, prior art generally adopts manual intervention to carry out the mode of calculated off line, utilizes full dose historical data re-training disaggregated model, and again disposes the disaggregated model trained reaching the standard grade and carry out classification prediction.

, there is following defect in the mode of above-mentioned renewal disaggregated model:

1) adopt full dose data to carry out the structure of disaggregated model every time, along with the increase of data volume, process time can extend accordingly, cause that model training efficiency reduces；

2) generally after disaggregated model is degenerated, training is just re-started, that is disaggregated model can not make corresponding adjustment in real time or in time according to the change of data, cause that service response is slow, in some more sensitive business scopes that compares, such as: risk control field, lawless person, it is possible to utilize this defect to reach to walk around disaggregated model, avoid the purpose that is identified by repeatedly attempting, causes the hysteresis quality of system prevention and control.

Summary of the invention

The application provides a kind of method for updating disaggregated model, and to solve, existing disaggregated model update mode training effectiveness is low and the problem not in time of renewal.The application additionally provides a kind of device for updating disaggregated model.

The application provides a kind of method for updating disaggregated model, and described disaggregated model is made up of the decision tree of predetermined quantity, and for carrying out class prediction according to the user behavior data in network application, described method includes:

From the server that described user behavior data is provided, obtain the incremental data in predetermined amount of time, as training sample set；

Determine the quantity of newly-increased decision tree；

According to described training sample set, random forests algorithm is adopted to generate the decision tree of described newly-increased quantity；

The decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity；

Collect selected decision tree, the disaggregated model after being updated.

Optionally, the described quantity determining newly-increased decision tree refers to, uses disaggregated model described in described training sample set pair to be verified, and determines the quantity of newly-increased decision tree according to the result.

Optionally, disaggregated model described in the described training sample set pair of described use is verified, and determines the quantity of newly-increased decision tree according to the result, including:

Use each sample that described training sample is concentrated, verify the correctness of described disaggregated model；

According to the result, calculate the accuracy that described training sample set is classified by described disaggregated model；

The parameter value of Poisson distribution is determined so that the parameter value of described accuracy and described Poisson distribution meets inverse relation according to described accuracy；Described Poisson distribution is to perform sampling with replacement for training sample set to obtain the discrete probability distribution that new samples collection is followed；

Parameter value according to described Poisson distribution is determined for compliance with the random number of described discrete probability distribution, and using this random number quantity as described newly-increased decision tree.

Optionally, each sample that the described training sample of described use is concentrated, verify the correctness of described disaggregated model, including:

According to the attribute information that training sample comprises, described disaggregated model is adopted to carry out class prediction；

Judge that whether the classification of prediction is consistent with the concrete class of described training sample；

If it is consistent, it is determined that the classification results of described training sample is correct by described disaggregated model.

Optionally, described according to described training sample set, adopt random forests algorithm to generate the decision tree of described newly-increased quantity, including:

The mode adopting sampling with replacement according to described training sample set builds bootstrap sample set；

Use described bootstrap sample set, adopt and choose attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node；Described choose attribute according to predetermined policy and refer to, from the sample attribute randomly choosed, choose attribute according to predetermined policy；

Forward to and described adopt the mode of sampling with replacement to build the step of bootstrap sample set according to described training sample set to continue executing with, until generating the decision tree of described newly-increased quantity.

Optionally, described choose attribute according to predetermined policy and include: choose attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.

Optionally, in described employing after each node chooses attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree, operations described below is performed:

The index of the classifying quality of new decision tree described in computational representation；

Accordingly, the described decision tree described disaggregated model comprised according to classifying quality and described newly-generated decision tree are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity, including:

For every decision tree that described disaggregated model comprises, the index of its classifying quality of computational representation；

The decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to described index are ranked up；

Tagmeme is selected to be in a high position, described certain amount of decision tree from the decision tree after sequence.

Optionally, the index of the classifying quality of the described new decision tree of described sign refers to, the outer error in data of bag；

Accordingly, described every the decision tree comprised for described disaggregated model, the index of its classifying quality of computational representation includes:

Outer for the bag of every new decision tree data summarization is obtained the outer data acquisition system of bag；

Use the outer data acquisition system of described bag, the outer error in data of the bag of the classifying quality of every decision tree that disaggregated model described in computational representation comprises.

Optionally, before performing the step of the described quantity determining newly-increased decision tree, perform operations described below:

Judge whether to create described disaggregated model；

If it is not, the described quantity determining newly-increased decision tree refers to, using the quantity set in advance, that disaggregated model the comprises decision tree quantity as described newly-increased decision tree；Accordingly, after the decision tree performing the described employing random forests algorithm described newly-increased quantity of generation, collecting the step of the disaggregated model after selected decision tree is updated described in directly performing, the decision tree of the described newly-increased quantity generated is selected decision tree.

Accordingly, the application also provides for a kind of device for updating disaggregated model, including:

Training sample set acquiring unit, for, from the server providing described user behavior data, obtaining the incremental data in predetermined amount of time, as training sample set；

Newly-increased quantity determines unit, for determining the quantity of newly-increased decision tree；

Decision tree creating unit, for according to described training sample set, adopting random forests algorithm to generate the decision tree of described newly-increased quantity；

Decision tree screening unit, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity；

Disaggregated model output unit, for collecting selected decision tree, the disaggregated model after being updated.

Optionally, described newly-increased quantity determine unit specifically for, use disaggregated model described in described training sample set pair to be verified, and determine the quantity of newly-increased decision tree according to the result.

Optionally, described newly-increased quantity determines that unit includes:

Verification of correctness subelement, for each sample using described training sample to concentrate, verifies the correctness of described disaggregated model；

Accuracy computation subunit, for according to the result, calculating the accuracy that described training sample set is classified by described disaggregated model；

Parameter for Poisson Distribution determines subelement, for determining the parameter value of Poisson distribution according to described accuracy so that the parameter value of described accuracy and described Poisson distribution meets inverse relation；Described Poisson distribution is to perform sampling with replacement for training sample set to obtain the discrete probability distribution that new samples collection is followed；

Random number determines subelement, for being determined for compliance with the random number of described discrete probability distribution according to the parameter value of described Poisson distribution, and using this random number quantity as described newly-increased decision tree.

Optionally, described verification of correctness subelement includes:

First loop control subelement, for each sample concentrated for described training sample, triggers the work of following subelement successively；

Class prediction subelement, for the attribute information comprised according to training sample, adopts described disaggregated model to carry out class prediction；

Judgment sub-unit, whether the classification for judging prediction is consistent with the concrete class of described training sample；If it is consistent, it is determined that the classification results of described training sample is correct by described disaggregated model.

Optionally, described decision tree creating unit includes:

Second loop control subelement, for judging whether the decision tree created reaches described newly-increased quantity, if it is not, then trigger following subelement successively to create new decision tree；

Bootstrap samples subelement, for adopting the mode of sampling with replacement to build bootstrap sample set according to described training sample set；

Create and perform subelement, be used for using described bootstrap sample set, adopt and choose attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node, and trigger described second loop control subelement work；Described choose attribute according to predetermined policy and refer to, from the sample attribute randomly choosed, choose attribute according to predetermined policy.

Optionally, the predetermined policy that described establishment execution subelement adopts when choosing attribute includes: chooses attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.

Optionally, described decision tree creating unit also includes:

Newly-built index computation subunit, is used for after described establishment performs the subelement new decision tree of establishment, the index of the classifying quality of new decision tree described in computational representation；

Accordingly, described decision tree screening unit includes:

Original index computation subunit, for every the decision tree comprised for described disaggregated model, the index of its classifying quality of computational representation；

Sequence subelement, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to described index are ranked up；

Select subelement, for selecting tagmeme to be in a high position, described certain amount of decision tree from the decision tree after sequence.

Optionally, the index that described newly-built index computation subunit calculates refers to, the outer error in data of bag；

Accordingly, described original index computation subunit includes:

The outer data acquisition subelement of bag, for obtaining the outer data acquisition system of bag by outer for the bag of every new decision tree data summarization；

Error Calculation subelement, is used for using the outer data acquisition system of described bag, the outer error in data of the bag of the classifying quality of every decision tree that disaggregated model described in computational representation comprises.

Optionally, described device includes:

Disaggregated model judgment sub-unit, is used for judging whether to create described disaggregated model；

Accordingly, described newly-increased quantity determines that unit is when described disaggregated model judgment sub-unit is output as "No", for using the quantity set in advance, that disaggregated model the comprises decision tree quantity as described newly-increased decision tree；

Accordingly, described decision tree creating unit, after completing its operation, directly triggers the work of described disaggregated model output unit, and described disaggregated model output unit is specifically for collecting the decision tree of the described newly-increased quantity generated, the disaggregated model after being updated.

Compared with prior art, the application has the advantage that

The method for updating disaggregated model that the application provides, choose the incremental data in nearest a period of time as training sample set, the newly-generated a number of decision tree of random forests algorithm is adopted according to described training sample set, and the decision tree of the predetermined quantity that selection sort effect is best the decision tree comprised from disaggregated model and newly-generated decision tree, as the disaggregated model after updating.Adopt said method, owing to need not be trained according to full dose data, but on the basis of original disaggregated model, adopt incremental update mode, therefore can as required disaggregated model be carried out the renewal of various time granularity, such as: per diem update or approximate real time renewal, it is possible not only to improve the efficiency of model training, realize the quick response to business, and manual intervention that need not be extra in the service period of disaggregated model, reduce cost of labor.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of embodiment of the method for updating disaggregated model of the application；

Fig. 2 is the flow chart of the processing procedure of the newly-increased decision tree quantity of determination that the embodiment of the present application provides；

Fig. 3 is the flow chart of the processing procedure generating decision tree that the embodiment of the present application provides；

Fig. 4 is the flow chart of the processing procedure of the screening decision tree that the embodiment of the present application provides；

Fig. 5 is the schematic diagram of a kind of device embodiment for updating disaggregated model of the application.

Detailed description of the invention

Elaborate a lot of detail in the following description so that fully understanding the application.But the application can implement being much different from alternate manner described here, and those skilled in the art can do similar popularization when without prejudice to the application intension, therefore the application is by the following public restriction being embodied as.

In this application, each provide a kind of method for updating disaggregated model and a kind of device for updating disaggregated model, be described in detail one by one in the following embodiments.

Refer to Fig. 1, it is the flow chart of a kind of embodiment of the method for updating disaggregated model of the application.Described method comprises the steps:

Step 101: from the server providing described user behavior data, obtains the incremental data in predetermined amount of time, as training sample set.

The method for updating disaggregated model that the application provides, its core is in that, do not have to adopt the conventional offline computational methods rebuilding disaggregated model based on full dose data, but adopt incremental data to carry out the renewal of disaggregated model, thus disaggregated model can make corresponding adjustment in time or near real-time according to the change of sample data, it is achieved that disaggregated model is synchronization with up-to-date sample data.With traditional off-line learning method comparatively speaking, the method that the application provides, it is possible to disaggregated model is carried out increment improvement based on the up-to-date sample data generated on line, therefore can also regard a kind of on-line study method as.

In order to realize the technical scheme of the application, this step to obtain the incremental data in predetermined amount of time as training sample set from the server providing user behavior data.Described training sample set, refers to that the sample set being made up of several samples, each sample have similar form as follows: (x₁,x₂,......x_n: c), wherein x_iRepresenting the specific object value of this sample, c then represents the concrete class of this sample.Such as, in an object lesson of the present embodiment, risk control business scope at internet business platform, adopt disaggregated model whether customer transaction behavior exists risk and carry out classification prediction, the attribute of each sample includes: the information such as information attribute value and dealing money such as the personal attribute information such as user account, age, the classification of tradable commodity, title, price, and classification then includes black/white sample two kind (respectively corresponding risky and devoid of risk).

In described object lesson, this step obtains the customer transaction data being positioned at the predetermined amount of time before current time as training sample set.Wherein, described predetermined amount of time can be configured according to concrete demand, such as can in units of sky, by hour in units of, even by minute in units of be all possible, as long as the data in the described time period be can obtain and as training sample set use (that is: comprising complete attribute information and concrete class information).

Step 102: determine the quantity of newly-increased decision tree.

In the technical scheme of the application, described disaggregated model (that is: disaggregated model to be updated) is made up of the decision tree of predetermined quantity, the technical scheme of the application is on the basis of described disaggregated model, some new decision trees are generated according to the training sample set obtained in a step 101, and from the decision tree of the optimum predetermined quantity of selection sort effect original decision tree and newly-increased decision tree of described disaggregated model, as the disaggregated model after updating, it is achieved thereby that purpose disaggregated model being updated according to incremental data.

Assume that training sample set comprises N number of sample, subsequent step 103 to be concentrated from described training sample and obtains k sample set by sampling with replacement, and use this k sample set to set up k new decision tree respectively, therefore the main purpose of this step just determines that the quantity needing newly-increased decision tree, it may be assumed that above-mentioned k value.

As the simple embodiment of one, it is possible to reference to the quantity of the decision tree that the concrete application scenarios of disaggregated model, the complexity of sample data, described disaggregated model comprise, rule of thumb arrange a fixed value.Such as: in the above-mentioned internet, applications carrying out risk control, disaggregated model generally comprises 200 400 decision trees, therefore can set that the quantity of newly-increased decision tree is 10.Above-mentioned is only an example, various factors can be considered in being embodied as be configured, such as can also using the scale of the training sample set of acquisition as reference factor etc., training sample concentrates the sample comprised more many, then can suitably increase the quantity of newly-increased decision tree.

Manner described above is more simple and easy to do, but do not consider the classifying quality of the described disaggregated model training sample set to having obtained, the technical scheme of the application provides a kind of preferred implementation for this problem: uses disaggregated model described in described training sample set pair to be verified, and determines the quantity of newly-increased decision tree according to the result.Concrete processing procedure includes step 102-1 to 102-4, is described further below in conjunction with accompanying drawing 2.

Step 102-1: use each sample that described training sample is concentrated, verify the correctness of described disaggregated model.

Specifically, for some sample, according to its attribute information, adopt every decision tree in described disaggregated model that it is classified, choose and selected the maximum classification of number of times as final prediction classification (this process is also commonly referred to as the voting process that the minority is subordinate to the majority) by all decision trees, then judge that whether prediction classification is consistent with the concrete class being currently classified sample, if unanimously, then it is assumed that the classification results of current sample is correct by described disaggregated model.

For each sample that the training sample obtained in step 101 is concentrated, aforesaid way is adopted to be verified, and the number of times that record sort result is correct.

Step 102-2: according to the result, calculate the accuracy that described training sample set is classified by described disaggregated model.

In this step, it is possible to sample training sample concentrated with described disaggregated model carries out the number of times of correct classification, and the ratio of total sample number, as the accuracy that training sample set is classified by described disaggregated model.This numerical value just reflects the described disaggregated model classifying quality to training sample set, just can determine the quantity of newly-increased decision tree in subsequent step according to this numerical value.

Step 102-3: determine the parameter value of Poisson distribution according to described accuracy so that the parameter value of described accuracy and described Poisson distribution meets inverse relation.

Process owing to obtaining k sample set from the training sample concentration comprising N number of sample by sampling with replacement meets binomial distribution as follows:

P (K = k) = (\begin{matrix} N \\ k \end{matrix}) {(\frac{1}{N})}^{k} {(1 - \frac{1}{N})}^{N - k}

Further, when N value is very big or trends towards infinity, the above-mentioned binomial distribution about k tends to Poisson distribution:

P (X = k) = \frac{e^{- λ} λ^{k}}{k!}

Ultimate principle based on above-mentioned discrete probability distribution, it is possible to according to the accuracy that training sample set is classified by described disaggregated model, the parameter lambda of Poisson distribution is adjusted, is then determined for compliance with the discrete values k of above-mentioned Poisson distribution further according to the value of parameter lambda.

Specifically, if the classification more correct (accuracy is high) that described disaggregated model is to training sample set, then reduce the value of parameter lambda accordingly, otherwise increase the value of parameter lambda accordingly.In being embodied as, in scope set in advance, the value of parameter lambda can be adjusted, such as: the span pre-setting parameter lambda is 1～20, if described classification accuracy rate reaches 80%, then can take λ=10, if described classification accuracy rate is more than 80%, then correspondingly can adjust the value of λ between 1～9, if described classification accuracy rate is less than 80%, then correspondingly can adjust the value of λ between 11～20.Above-mentioned example is merely schematic, is referred to above-mentioned thinking and carries out concrete setting in being embodied as, as long as it is just passable to make described accuracy and λ value substantially meet inverse relation.

Step 102-4: be determined for compliance with the random number of described distribution according to the parameter value of described Poisson distribution, and using this random number quantity as described newly-increased decision tree.

Having determined that the value of Parameter for Poisson Distribution λ in step 102-3, this step is exactly the quantity of newly-increased decision tree in subsequent step 103 according to λ random number k, the k being determined for compliance with Poisson distribution.In being embodied as, generally adopt mode calculated as below:

Expression formula according to Poisson distribution:

The relation of item before and after it of can deriving is:

Therefore, initial setting up p=exp (-λ), then to integer k from 1 to just infinite, generate a decimal by rand () or similar function every time, then exporting k if less than current p value, k now is the value meeting Poisson distribution, otherwise arranges p=p* λ/(k+1), circulation performs above-mentioned steps, until output k value.The above-mentioned value according to Parameter for Poisson Distribution λ is determined for compliance with the computational methods of the random number of Poisson distribution, belongs to prior art, repeats no more herein.

So far, just obtain concrete k value according to the lambda parameter of Poisson distribution, subsequent step 103 just can generate according to this k value the new decision tree of respective numbers.Owing to this step is when determining k value, the classifying quality of training sample set is included within the scope of considering by described disaggregated model, the accuracy rate of described disaggregated model is more high, the k value obtained is more little, be equivalent to reduce the quantity of subsequent step sample drawn collection and newly-increased decision tree, it may be assumed that described disaggregated model is only compared trickle adjustment；The accuracy rate of described disaggregated model is more low, and the k value obtained is more big, is equivalent to increase the quantity of subsequent step sample drawn collection and newly-increased decision tree, it may be assumed that described disaggregated model is carried out relatively large adjustment.Adopt in this way so that the disaggregated model after renewal can relatively accurately reflect the change of incremental training data on original base.

Step 103: according to described training sample set, adopts random forests algorithm to generate the decision tree of described newly-increased quantity.

According to training sample set, adopting random forests algorithm to sequentially generate k (that is: described newly-increased quantity) decision tree, the process wherein generating every decision tree includes step 103-1 to 103-3 as follows, is described further below in conjunction with accompanying drawing 3.

Step 103-1: adopt the mode of sampling with replacement to build bootstrap sample set according to described training sample set.

Bootstrap sampling approach (also referred to as bootstrapping or Bootstrap sampling method), it it is a kind of uniform sampling method having and putting back to, it is widely used in the field such as mathematical statistics, model calculating, when initial representativeness of sample is good, can enlarged sample amount by bootstrap sampling approach, and when initial sample is sufficiently large, bootstrap sampling can unbiased ground close to the population distribution of sample data.

This step concentrates the mode adopting sampling with replacement to extract N number of sample from the training sample comprising N number of sample, in extraction process, part sample in described training sample set is not likely pumped to, and part sample is likely to be extracted repeatedly, N number of sample one the bootstrap sample set of composition that will finally extract.

Bootstrap sampling approach is adopted to build sample set, generate when decision tree according to described sample set follow-up, owing to the input sample of every one tree is not whole samples that training sample is concentrated, containing relatively small number of noise data, therefore can avoid newly-built decision tree that the phenomenon of over-fitting occurs.

Step 103-2: use described bootstrap sample set, adopts and chooses attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node.

Using described bootstrap sample set, adopt the mode of dot splitting section by section to generate a new decision tree, it it is critical only that the selection of Split Attribute of each node.Specifically, for comprising the sample of M attribute, when each node decision tree needs to divide, first from M attribute, select at random m attribute (generally satisfy condition m < < M), then from m selected attribute, 1 optimum attributes Split Attribute as this node is chosen according to reservation strategy, then according to this attribute divides.Repeating said process at each node, until some node cannot continue division or its all samples comprised broadly fall into same classification, now fission process terminates, and a new decision tree creates complete.

In being embodied as, randomly choose the number of attribute to adopt and calculate square root and the mode that rounds obtains, such as: each sample packages is containing M=100 attribute, m=sqrt (M)=10 attribute can be randomly choosed so every time, other modes can certainly be adopted to determine the number randomly choosing attribute, as long as meeting the m < < condition of M.

Optimum attributes is chosen as from the attribute randomly selected, can adopt and carry out, based on Geordie index, the mode that divides, namely, formula as follows is first adopted to calculate impurity level, then utilize impurity level calculating to carry out the Geordie index divided according to each attribute, and the branch as tree selecting wherein Geordie index minimum divide:

Gini (D) = 1 - Σ_{i = 1}^{k} {p_{i}}^{2}

Wherein, p_iIt is choose certain attribute when dividing, the probability that each sample belongs to a different category.When choosing optimum attributes, except that according to above-mentioned Geordie index, it is also possible to choose according to information gain, or choose according to information gain-ratio, all equally possible technical scheme realizing the application.Adopt above-mentioned three kinds of modes choose optimum attributes go forward side by side line splitting generate decision tree process, belong to the prior art of comparative maturity, no longer detailed process be further described herein.

Step 103-3: the index of the classifying quality of new decision tree described in computational representation.

Decision tree is screened, after an often newly-built decision tree, it is possible to the index of the classifying quality of this decision tree of computational representation for the ease of subsequent step 104.It is for instance possible to use the classifying quality of the test newly-built decision tree of sample set pair is estimated, and calculate corresponding index.

Due in the technical scheme of the application, the input sample that newly-built decision tree adopts carries out sampling by bootstrap method on training sample set and obtains, when the sample size that training sample is concentrated is sufficiently large, described training sample is concentrated would generally have the sample of about 1/3 not appear in bootstrap sample set, this part sample is referred to as the outer data (outofbag is called for short oob) of bag, these part data generally can be used to replace the classifying quality of new decision tree described in test sample set pair be estimated, and use the outer error in data (ooberror is called for short oobe) of corresponding bag as the index of the classifying quality characterizing described new decision tree.

Specifically, first will be contained in described training sample and concentrate and be not included in the screening sample in bootstrap sample set out, the outer data of composition bag；Then for each sample in the outer data of bag, with newly-built decision tree, it is carried out classification prediction, and judge that whether the actual classification predicted the outcome with this sample is consistent, if unanimously, illustrate that the classification results of this sample is correct by newly-built decision tree；Finally according to classification results each time, calculate the outer error in data of bag of described newly-built decision tree.

Such as, altogether comprising 100 samples in the outer data of bag, the classification results of wherein 90 samples is correct by newly-built decision tree, then the outer error in data of the bag of this decision tree is exactly: (100-90)/100=10%.

So far, by above-mentioned steps 103-1 to step 103-3, it has been created that a new decision tree, and has obtained the index characterizing its classifying quality.Circulation performs step 103-1 to 103-3 k time altogether, it is possible to generate k (that is: a described newly-increased quantity) decision tree.

Be can be seen that by above description, adopt in the process of newly-built k the decision tree of random forests algorithm, owing to adopting bootstrap to carry out stochastical sampling, and select optimum attributes to divide from the attribute randomly selected, these two aspects combines and has fully demonstrated the randomness of random forests algorithm such that it is able to ensure that newly created decision tree does not have the phenomenon of over-fitting.

Step 104: the decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity.

Described disaggregated model has contained the decision tree of predetermined quantity, for the ease of following description, described predetermined quantity is designated as T, in step 103 according to the incremental training sample set obtained, generating again k new decision tree, the sample that described training sample is usually concentrated by this k new decision tree compares correct classification (accuracy rate is relatively high).

Considering that described training sample set is only the incremental data in nearest a period of time, the change of data is had certain representativeness, but does not generally have universality, it is inappropriate for therefore substituting original T decision tree with k newly created decision tree；If merely k decision tree being added described disaggregated model (disaggregated model after renewal includes T+k tree altogether), so along with the increase of update times, disaggregated model can be excessively huge, therefore the technical scheme of the application adopts following processing mode: from original T the decision tree of described disaggregated model and k newly created decision tree, T decision tree of selection sort best results.T so selected decision tree, can reflect the change of increment sample data, reach the purpose that described disaggregated model is adjusted, can ensure that again the scale of described disaggregated model remains unchanged.

Specifically, it is possible to use described in test sample set pair, the classifying quality of the original decision tree of disaggregated model and newly-built decision tree is estimated, and therefrom screens T decision tree according to classifying quality.In the present embodiment, owing to having calculated the index of the classifying quality of the new decision tree of sign every in the process of the newly-built decision tree of step 103, that is: the outer error in data of bag, therefore this step can for original every the decision tree of described disaggregated model, also the correspondingly outer error in data of the bag of its classifying quality of computational representation, and according to this index above-mentioned decision tree be ranked up and screen.Below in conjunction with accompanying drawing 4, further illustrate the processing procedure of this step.

Step 104-1: every the decision tree comprised for described disaggregated model, the outer error in data of the bag of its classifying quality of computational representation.

First, outer for the bag of every newly created decision tree data summarization is obtained the outer data acquisition system of bag together, then use the sample in this bag of outer data acquisition system as input, calculate the outer error in data of bag of every decision tree that described disaggregated model comprises.Concrete computational methods are essentially identical with step 103-3, refer to the associated description of step 103-3, repeat no more herein.

Step 104-2: the decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to the outer error in data of described bag are ranked up.

T the decision tree that described disaggregated model is comprised and k newly-generated decision tree, it is ranked up according to the outer error in data of described bag, that is: the decision tree of good classification effect (the outer error in data of bag is little) is come before the decision tree of classifying quality relative mistake (the outer error in data of bag is relatively big), thus obtaining the minimum such a sequence of position sequence residing for the decision tree that residing for the decision tree that classifying quality is best, position sequence is the highest, classifying quality is worst.

Step 104-3: select tagmeme to be in a high position, described certain amount of decision tree from the decision tree after sequence.

The process of this step is relatively simple, according to the step 104-2 ranking results obtained, selects tagmeme to be in T decision tree of a high position, abandon remaining k decision tree from T+k decision tree.

In an object lesson of the present embodiment, for the disaggregated model in risk control field, the Internet, the value of T is generally within 200～400 scopes, and updates the quantity of decision tree newly created during disaggregated model generally within 0～20 scope every time.Adopt the process of above-mentioned steps screening decision tree every time, it is actually the process that the decision tree in this disaggregated model is carried out part renewal, that is: with classifying quality for foundation, the decision tree of the respective numbers in this disaggregated model is replaced with newly created all or part of decision tree.

Step 105: collect selected decision tree, the disaggregated model after being updated.

T the decision tree that step 104 screening obtains is collected, just obtains the disaggregated model after renewal, it is possible to continue large-scale data carries out classification prediction on line.Due to the partial decision tree in the disaggregated model after updating, it is generate according to the increment sample data obtained in a step 101, that is, this disaggregated model is made that corresponding adjustment according to up-to-date sample data on the original basis in time, thus ensureing that its classifying quality disclosure satisfy that requirement set in advance all the time, it is common that will not degenerate.

It should be noted that, above-mentioned steps 101-105, emphasis describes how the process adopting the method that the application provides that disaggregated model is updated, in being embodied as, if disaggregated model not yet creates, also the method that the application provides can still be adopted, then the renewal process of described disaggregated model is actually the establishment process that disaggregated model grows out of nothing.

Specifically, before execution step 102 determines the quantity of newly-increased decision tree, first judging whether to have created disaggregated model, if created, being updated according to process described above；Otherwise, using quantity set in advance, that disaggregated model comprises decision tree as described newly-increased quantity, that is: k=T is directly set, and create k decision tree according to step 103, then directly perform step 105 and k decision tree (that is: T decision tree) is collected output, just obtain the disaggregated model created.Afterwards, it is possible to adopt the present processes according to incremental data, this disaggregated model to be updated.

In the specific implementation, owing to the renewal process of model is usually directed to test and the training learning process of more sample data, in order to improve treatment effeciency further, reaches real-time or approximate real time renewal effect, MapReduce technology generally can be adopted to realize.

Such as, when disaggregated model is verified the quantity determining newly-increased decision tree by step 102, MapReduce programming model can be adopted, in the Map stage, each Map is responsible for the single decision tree in described disaggregated model, training sample set being predicted, then carry out collecting, according to the result in Map stage, the parameter value obtaining Poisson distribution in the Reduce stage, and further determine that the quantity of newly-increased decision tree；Generating in the process of decision tree in step 103, it would however also be possible to employ MapReduce programming model, in the Map stage, each Map is responsible for generating a decision tree according to bootstrap sample set, and the Reduce stage, then collect all decision trees and go forward side by side row filter.

In sum, adopt the method for updating disaggregated model that the application provides, need not according to full dose data train classification models, but choose the incremental data in nearest a period of time as training sample set, a number of decision tree is generated according to training sample set, and the original partial decision tree of described disaggregated model being replaced according to classifying quality, it is achieved thereby that the incremental update to disaggregated model.In a particular application, as required described disaggregated model can be carried out the renewal of various time granularity, such as: per diem update, update by the hour or approximate real time renewal, it is possible not only to improve the efficiency of model training, realize the quick response to business, and manual intervention that need not be extra in the service period of disaggregated model, reduce cost of labor.

In the above-described embodiment, it is provided that a kind of method for updating disaggregated model, corresponding, the application also provides for a kind of device for updating disaggregated model.Refer to Fig. 5, it is the schematic diagram of a kind of device embodiment for updating disaggregated model of the application.Owing to device embodiment is substantially similar to embodiment of the method, so describing fairly simple, relevant part illustrates referring to the part of embodiment of the method.Device embodiment described below is merely schematic.

A kind of device for updating disaggregated model of the present embodiment, including: training sample set acquiring unit 501, for obtaining the incremental data in predetermined amount of time from the historical data applying described disaggregated model, as training sample set；Newly-increased quantity determines unit 502, for determining the quantity of newly-increased decision tree；Decision tree creating unit 503, for according to described training sample set, adopting random forests algorithm to generate the decision tree of described newly-increased quantity；Decision tree screening unit 504, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity；Disaggregated model output unit 505, for collecting selected decision tree, the disaggregated model after being updated.

Optionally, described newly-increased quantity determines that unit includes:

Optionally, described verification of correctness subelement includes:

Optionally, described decision tree creating unit includes:

Optionally, described decision tree creating unit also includes:

Accordingly, described decision tree screening unit includes:

Accordingly, described original index computation subunit includes:

Optionally, described device includes:

Although the application is with preferred embodiment openly as above; but it is not for limiting the application; any those skilled in the art are without departing from spirit and scope; can making possible variation and amendment, therefore the protection domain of the application should be as the criterion with the scope that the application claim defines.

In a typical configuration, computing equipment includes one or more processor (CPU), input/output interface, network interface and internal memory.

Internal memory potentially includes the forms such as the volatile memory in computer-readable medium, random access memory (RAM) and/or Nonvolatile memory, such as read only memory (ROM) or flash memory (flashRAM).Internal memory is the example of computer-readable medium.

1, computer-readable medium includes permanent and impermanency, removable and non-removable media can by any method or technology to realize information storage.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computer includes, but it is not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read only memory (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, can be used for the information that storage can be accessed by a computing device.According to defining herein, computer-readable medium does not include non-temporary computer readable media (transitorymedia), such as data signal and the carrier wave of modulation.

2, it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program.Therefore, the application can adopt the form of complete hardware embodiment, complete software implementation or the embodiment in conjunction with software and hardware aspect.And, the application can adopt the form at one or more upper computer programs implemented of computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) wherein including computer usable program code.

Claims

1. the method for updating disaggregated model, described disaggregated model is made up of the decision tree of predetermined quantity, for carrying out class prediction according to the user behavior data in network application, it is characterised in that including:

Determine the quantity of newly-increased decision tree；

Collect selected decision tree, the disaggregated model after being updated.

2. the method for updating disaggregated model according to claim 1, it is characterised in that the described quantity determining newly-increased decision tree refers to, uses disaggregated model described in described training sample set pair to be verified, and determines the quantity of newly-increased decision tree according to the result.

3. the method for updating disaggregated model according to claim 2, it is characterised in that disaggregated model described in the described training sample set pair of described use is verified, and determines the quantity of newly-increased decision tree according to the result, including:

4. the method for updating disaggregated model according to claim 3, it is characterised in that each sample that the described training sample of described use is concentrated, verifies the correctness of described disaggregated model, including:

5., according to the arbitrary described method for updating disaggregated model of claim 1-4, it is characterised in that described according to described training sample set, adopt random forests algorithm to generate the decision tree of described newly-increased quantity, including:

6. the method for updating disaggregated model according to claim 5, it is characterised in that described choose attribute according to predetermined policy and include: choose attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.

7. the method for updating disaggregated model according to claim 5, it is characterized in that, in described employing after each node chooses attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree, perform operations described below:

8. the method for updating disaggregated model according to claim 7, it is characterised in that the index of the classifying quality of the described new decision tree of described sign refers to, the outer error in data of bag；

9. the method for updating disaggregated model according to claim 1, it is characterised in that before performing the step of the described quantity determining newly-increased decision tree, performs operations described below:

Judge whether to create described disaggregated model；

10. the device being used for updating disaggregated model, it is characterised in that including:

11. the device for updating disaggregated model according to claim 10, it is characterized in that, described newly-increased quantity determine unit specifically for, use disaggregated model described in described training sample set pair to be verified, and determine the quantity of newly-increased decision tree according to the result.

12. the device for updating disaggregated model according to claim 11, it is characterised in that described newly-increased quantity determines that unit includes:

13. the device for updating disaggregated model according to claim 12, it is characterised in that described verification of correctness subelement includes:

14. according to the arbitrary described device for updating disaggregated model of claim 10-13, it is characterised in that described decision tree creating unit includes:

15. the device for updating disaggregated model according to claim 14, it is characterized in that, the predetermined policy that described establishment execution subelement adopts when choosing attribute includes: chooses attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.

16. the device for updating disaggregated model according to claim 14, it is characterised in that described decision tree creating unit also includes:

Accordingly, described decision tree screening unit includes:

17. the device for updating disaggregated model according to claim 16, it is characterised in that the index that described newly-built index computation subunit calculates refers to, the outer error in data of bag；

Accordingly, described original index computation subunit includes:

18. the device for updating disaggregated model according to claim 10, it is characterised in that described device includes: