CN105718490A - Method and device for updating classifying model - Google Patents

Method and device for updating classifying model Download PDF

Info

Publication number
CN105718490A
CN105718490A CN201410737856.7A CN201410737856A CN105718490A CN 105718490 A CN105718490 A CN 105718490A CN 201410737856 A CN201410737856 A CN 201410737856A CN 105718490 A CN105718490 A CN 105718490A
Authority
CN
China
Prior art keywords
decision tree
disaggregated model
newly
training sample
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410737856.7A
Other languages
Chinese (zh)
Inventor
沈雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201410737856.7A priority Critical patent/CN105718490A/en
Publication of CN105718490A publication Critical patent/CN105718490A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for updating a classifying model. The method comprises following steps of obtaining incremental data in a preset time period as a training sample set from a server for providing user behavior data; determining the quantity of newly added decision trees; generating the decision trees of the newly added quantity through adoption of a random forest algorithm according to the training sample set; ordering the decision trees contained in the classifying model and the newly generated decision trees according to a classifying effect; selecting the decision trees of a preset quantity, wherein the sequence of the decision trees is located at a high-order; summarizing the decision trees, thus obtaining the updated classifying model. The invention also provides a device for updating the classifying model. Through adoption of the method provided by the invention, it is unnecessary to train according to all data, an incremental updating method is adopted on the basis of the original classifying model, therefore, the model training efficiency is improved, and the rapid response to the service is realized.

Description

A kind of method for updating disaggregated model and device
Technical field
The application relates to the disaggregated model based on decision tree, is specifically related to a kind of method for updating disaggregated model.The application relates to a kind of device for updating disaggregated model simultaneously.
Background technology
Development along with Internet technology, occur in that substantial amounts of network application, such as: network social intercourse, network reading etc., network application provider is in order to recommend more targeted information to user or carry out the monitoring management of necessity, typically require according to user's operation behavior in network application, classification prediction is carried out for target set in advance, for instance: whether user belongs to is enlivened colony, whether the operation behavior of user exists potential risk etc..In order to improve predictive efficiency and accuracy, disaggregated model is generally adopted to carry out classification prediction in most of network applications.
So-called disaggregated model (being also often referred to as grader), refers to that the feature according to data is mapped to some in given classification the sample of unknown classification.The process of structural classification model is generally divided into training and two stages of test.In the training stage, tectonic model is carried out by the training sample set of attribute description by analyzing, at test phase, test sample set is used to carry out the classification accuracy of assessment models, if classification accuracy meets certain requirement, then disaggregated model just can put into practical application, and the sample of unknown classification is carried out classification prediction.
The construction process of disaggregated model is generally realized by machine-learning process, and wherein decision tree (decisiontree) is the prediction carrying out classification by finding the relation between attribute and classification.In order to promote the correctness of classification, the basis of decision tree occurs in that again random forest grader, that is: the grader being made up of multiple decision trees, when sample to be sorted enters random forest, in fact exactly allow each decision tree classify, finally choose and selected classification that number of times is maximum as final classification results by all decision trees.
The above-mentioned disaggregated model based on decision tree is widely used at internet arena, machine-learning process for this class model, substantially off-line learning mode is all adopted: by the study of historic user behavioral data of full dose, analysis, draw the knowledge about classification, thus completing the structure to disaggregated model and deployment is reached the standard grade.As time goes on; the Behavior law of user would generally change; among commodity that each network application or website present to user, information etc. are also at constantly adjusting; cause that the degree inconsistent with actual classification that predict the outcome of disaggregated model is beyond scope set in advance; that is: disaggregated model is degenerated, and the accuracy rate of its classification cannot meet requirement.For this situation, prior art generally adopts manual intervention to carry out the mode of calculated off line, utilizes full dose historical data re-training disaggregated model, and again disposes the disaggregated model trained reaching the standard grade and carry out classification prediction.
, there is following defect in the mode of above-mentioned renewal disaggregated model:
1) adopt full dose data to carry out the structure of disaggregated model every time, along with the increase of data volume, process time can extend accordingly, cause that model training efficiency reduces;
2) generally after disaggregated model is degenerated, training is just re-started, that is disaggregated model can not make corresponding adjustment in real time or in time according to the change of data, cause that service response is slow, in some more sensitive business scopes that compares, such as: risk control field, lawless person, it is possible to utilize this defect to reach to walk around disaggregated model, avoid the purpose that is identified by repeatedly attempting, causes the hysteresis quality of system prevention and control.
Summary of the invention
The application provides a kind of method for updating disaggregated model, and to solve, existing disaggregated model update mode training effectiveness is low and the problem not in time of renewal.The application additionally provides a kind of device for updating disaggregated model.
The application provides a kind of method for updating disaggregated model, and described disaggregated model is made up of the decision tree of predetermined quantity, and for carrying out class prediction according to the user behavior data in network application, described method includes:
From the server that described user behavior data is provided, obtain the incremental data in predetermined amount of time, as training sample set;
Determine the quantity of newly-increased decision tree;
According to described training sample set, random forests algorithm is adopted to generate the decision tree of described newly-increased quantity;
The decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity;
Collect selected decision tree, the disaggregated model after being updated.
Optionally, the described quantity determining newly-increased decision tree refers to, uses disaggregated model described in described training sample set pair to be verified, and determines the quantity of newly-increased decision tree according to the result.
Optionally, disaggregated model described in the described training sample set pair of described use is verified, and determines the quantity of newly-increased decision tree according to the result, including:
Use each sample that described training sample is concentrated, verify the correctness of described disaggregated model;
According to the result, calculate the accuracy that described training sample set is classified by described disaggregated model;
The parameter value of Poisson distribution is determined so that the parameter value of described accuracy and described Poisson distribution meets inverse relation according to described accuracy;Described Poisson distribution is to perform sampling with replacement for training sample set to obtain the discrete probability distribution that new samples collection is followed;
Parameter value according to described Poisson distribution is determined for compliance with the random number of described discrete probability distribution, and using this random number quantity as described newly-increased decision tree.
Optionally, each sample that the described training sample of described use is concentrated, verify the correctness of described disaggregated model, including:
According to the attribute information that training sample comprises, described disaggregated model is adopted to carry out class prediction;
Judge that whether the classification of prediction is consistent with the concrete class of described training sample;
If it is consistent, it is determined that the classification results of described training sample is correct by described disaggregated model.
Optionally, described according to described training sample set, adopt random forests algorithm to generate the decision tree of described newly-increased quantity, including:
The mode adopting sampling with replacement according to described training sample set builds bootstrap sample set;
Use described bootstrap sample set, adopt and choose attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node;Described choose attribute according to predetermined policy and refer to, from the sample attribute randomly choosed, choose attribute according to predetermined policy;
Forward to and described adopt the mode of sampling with replacement to build the step of bootstrap sample set according to described training sample set to continue executing with, until generating the decision tree of described newly-increased quantity.
Optionally, described choose attribute according to predetermined policy and include: choose attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.
Optionally, in described employing after each node chooses attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree, operations described below is performed:
The index of the classifying quality of new decision tree described in computational representation;
Accordingly, the described decision tree described disaggregated model comprised according to classifying quality and described newly-generated decision tree are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity, including:
For every decision tree that described disaggregated model comprises, the index of its classifying quality of computational representation;
The decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to described index are ranked up;
Tagmeme is selected to be in a high position, described certain amount of decision tree from the decision tree after sequence.
Optionally, the index of the classifying quality of the described new decision tree of described sign refers to, the outer error in data of bag;
Accordingly, described every the decision tree comprised for described disaggregated model, the index of its classifying quality of computational representation includes:
Outer for the bag of every new decision tree data summarization is obtained the outer data acquisition system of bag;
Use the outer data acquisition system of described bag, the outer error in data of the bag of the classifying quality of every decision tree that disaggregated model described in computational representation comprises.
Optionally, before performing the step of the described quantity determining newly-increased decision tree, perform operations described below:
Judge whether to create described disaggregated model;
If it is not, the described quantity determining newly-increased decision tree refers to, using the quantity set in advance, that disaggregated model the comprises decision tree quantity as described newly-increased decision tree;Accordingly, after the decision tree performing the described employing random forests algorithm described newly-increased quantity of generation, collecting the step of the disaggregated model after selected decision tree is updated described in directly performing, the decision tree of the described newly-increased quantity generated is selected decision tree.
Accordingly, the application also provides for a kind of device for updating disaggregated model, including:
Training sample set acquiring unit, for, from the server providing described user behavior data, obtaining the incremental data in predetermined amount of time, as training sample set;
Newly-increased quantity determines unit, for determining the quantity of newly-increased decision tree;
Decision tree creating unit, for according to described training sample set, adopting random forests algorithm to generate the decision tree of described newly-increased quantity;
Decision tree screening unit, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity;
Disaggregated model output unit, for collecting selected decision tree, the disaggregated model after being updated.
Optionally, described newly-increased quantity determine unit specifically for, use disaggregated model described in described training sample set pair to be verified, and determine the quantity of newly-increased decision tree according to the result.
Optionally, described newly-increased quantity determines that unit includes:
Verification of correctness subelement, for each sample using described training sample to concentrate, verifies the correctness of described disaggregated model;
Accuracy computation subunit, for according to the result, calculating the accuracy that described training sample set is classified by described disaggregated model;
Parameter for Poisson Distribution determines subelement, for determining the parameter value of Poisson distribution according to described accuracy so that the parameter value of described accuracy and described Poisson distribution meets inverse relation;Described Poisson distribution is to perform sampling with replacement for training sample set to obtain the discrete probability distribution that new samples collection is followed;
Random number determines subelement, for being determined for compliance with the random number of described discrete probability distribution according to the parameter value of described Poisson distribution, and using this random number quantity as described newly-increased decision tree.
Optionally, described verification of correctness subelement includes:
First loop control subelement, for each sample concentrated for described training sample, triggers the work of following subelement successively;
Class prediction subelement, for the attribute information comprised according to training sample, adopts described disaggregated model to carry out class prediction;
Judgment sub-unit, whether the classification for judging prediction is consistent with the concrete class of described training sample;If it is consistent, it is determined that the classification results of described training sample is correct by described disaggregated model.
Optionally, described decision tree creating unit includes:
Second loop control subelement, for judging whether the decision tree created reaches described newly-increased quantity, if it is not, then trigger following subelement successively to create new decision tree;
Bootstrap samples subelement, for adopting the mode of sampling with replacement to build bootstrap sample set according to described training sample set;
Create and perform subelement, be used for using described bootstrap sample set, adopt and choose attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node, and trigger described second loop control subelement work;Described choose attribute according to predetermined policy and refer to, from the sample attribute randomly choosed, choose attribute according to predetermined policy.
Optionally, the predetermined policy that described establishment execution subelement adopts when choosing attribute includes: chooses attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.
Optionally, described decision tree creating unit also includes:
Newly-built index computation subunit, is used for after described establishment performs the subelement new decision tree of establishment, the index of the classifying quality of new decision tree described in computational representation;
Accordingly, described decision tree screening unit includes:
Original index computation subunit, for every the decision tree comprised for described disaggregated model, the index of its classifying quality of computational representation;
Sequence subelement, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to described index are ranked up;
Select subelement, for selecting tagmeme to be in a high position, described certain amount of decision tree from the decision tree after sequence.
Optionally, the index that described newly-built index computation subunit calculates refers to, the outer error in data of bag;
Accordingly, described original index computation subunit includes:
The outer data acquisition subelement of bag, for obtaining the outer data acquisition system of bag by outer for the bag of every new decision tree data summarization;
Error Calculation subelement, is used for using the outer data acquisition system of described bag, the outer error in data of the bag of the classifying quality of every decision tree that disaggregated model described in computational representation comprises.
Optionally, described device includes:
Disaggregated model judgment sub-unit, is used for judging whether to create described disaggregated model;
Accordingly, described newly-increased quantity determines that unit is when described disaggregated model judgment sub-unit is output as "No", for using the quantity set in advance, that disaggregated model the comprises decision tree quantity as described newly-increased decision tree;
Accordingly, described decision tree creating unit, after completing its operation, directly triggers the work of described disaggregated model output unit, and described disaggregated model output unit is specifically for collecting the decision tree of the described newly-increased quantity generated, the disaggregated model after being updated.
Compared with prior art, the application has the advantage that
The method for updating disaggregated model that the application provides, choose the incremental data in nearest a period of time as training sample set, the newly-generated a number of decision tree of random forests algorithm is adopted according to described training sample set, and the decision tree of the predetermined quantity that selection sort effect is best the decision tree comprised from disaggregated model and newly-generated decision tree, as the disaggregated model after updating.Adopt said method, owing to need not be trained according to full dose data, but on the basis of original disaggregated model, adopt incremental update mode, therefore can as required disaggregated model be carried out the renewal of various time granularity, such as: per diem update or approximate real time renewal, it is possible not only to improve the efficiency of model training, realize the quick response to business, and manual intervention that need not be extra in the service period of disaggregated model, reduce cost of labor.
Accompanying drawing explanation
Fig. 1 is the flow chart of a kind of embodiment of the method for updating disaggregated model of the application;
Fig. 2 is the flow chart of the processing procedure of the newly-increased decision tree quantity of determination that the embodiment of the present application provides;
Fig. 3 is the flow chart of the processing procedure generating decision tree that the embodiment of the present application provides;
Fig. 4 is the flow chart of the processing procedure of the screening decision tree that the embodiment of the present application provides;
Fig. 5 is the schematic diagram of a kind of device embodiment for updating disaggregated model of the application.
Detailed description of the invention
Elaborate a lot of detail in the following description so that fully understanding the application.But the application can implement being much different from alternate manner described here, and those skilled in the art can do similar popularization when without prejudice to the application intension, therefore the application is by the following public restriction being embodied as.
In this application, each provide a kind of method for updating disaggregated model and a kind of device for updating disaggregated model, be described in detail one by one in the following embodiments.
Refer to Fig. 1, it is the flow chart of a kind of embodiment of the method for updating disaggregated model of the application.Described method comprises the steps:
Step 101: from the server providing described user behavior data, obtains the incremental data in predetermined amount of time, as training sample set.
The method for updating disaggregated model that the application provides, its core is in that, do not have to adopt the conventional offline computational methods rebuilding disaggregated model based on full dose data, but adopt incremental data to carry out the renewal of disaggregated model, thus disaggregated model can make corresponding adjustment in time or near real-time according to the change of sample data, it is achieved that disaggregated model is synchronization with up-to-date sample data.With traditional off-line learning method comparatively speaking, the method that the application provides, it is possible to disaggregated model is carried out increment improvement based on the up-to-date sample data generated on line, therefore can also regard a kind of on-line study method as.
In order to realize the technical scheme of the application, this step to obtain the incremental data in predetermined amount of time as training sample set from the server providing user behavior data.Described training sample set, refers to that the sample set being made up of several samples, each sample have similar form as follows: (x1,x2,......xn: c), wherein xiRepresenting the specific object value of this sample, c then represents the concrete class of this sample.Such as, in an object lesson of the present embodiment, risk control business scope at internet business platform, adopt disaggregated model whether customer transaction behavior exists risk and carry out classification prediction, the attribute of each sample includes: the information such as information attribute value and dealing money such as the personal attribute information such as user account, age, the classification of tradable commodity, title, price, and classification then includes black/white sample two kind (respectively corresponding risky and devoid of risk).
In described object lesson, this step obtains the customer transaction data being positioned at the predetermined amount of time before current time as training sample set.Wherein, described predetermined amount of time can be configured according to concrete demand, such as can in units of sky, by hour in units of, even by minute in units of be all possible, as long as the data in the described time period be can obtain and as training sample set use (that is: comprising complete attribute information and concrete class information).
Step 102: determine the quantity of newly-increased decision tree.
In the technical scheme of the application, described disaggregated model (that is: disaggregated model to be updated) is made up of the decision tree of predetermined quantity, the technical scheme of the application is on the basis of described disaggregated model, some new decision trees are generated according to the training sample set obtained in a step 101, and from the decision tree of the optimum predetermined quantity of selection sort effect original decision tree and newly-increased decision tree of described disaggregated model, as the disaggregated model after updating, it is achieved thereby that purpose disaggregated model being updated according to incremental data.
Assume that training sample set comprises N number of sample, subsequent step 103 to be concentrated from described training sample and obtains k sample set by sampling with replacement, and use this k sample set to set up k new decision tree respectively, therefore the main purpose of this step just determines that the quantity needing newly-increased decision tree, it may be assumed that above-mentioned k value.
As the simple embodiment of one, it is possible to reference to the quantity of the decision tree that the concrete application scenarios of disaggregated model, the complexity of sample data, described disaggregated model comprise, rule of thumb arrange a fixed value.Such as: in the above-mentioned internet, applications carrying out risk control, disaggregated model generally comprises 200 400 decision trees, therefore can set that the quantity of newly-increased decision tree is 10.Above-mentioned is only an example, various factors can be considered in being embodied as be configured, such as can also using the scale of the training sample set of acquisition as reference factor etc., training sample concentrates the sample comprised more many, then can suitably increase the quantity of newly-increased decision tree.
Manner described above is more simple and easy to do, but do not consider the classifying quality of the described disaggregated model training sample set to having obtained, the technical scheme of the application provides a kind of preferred implementation for this problem: uses disaggregated model described in described training sample set pair to be verified, and determines the quantity of newly-increased decision tree according to the result.Concrete processing procedure includes step 102-1 to 102-4, is described further below in conjunction with accompanying drawing 2.
Step 102-1: use each sample that described training sample is concentrated, verify the correctness of described disaggregated model.
Specifically, for some sample, according to its attribute information, adopt every decision tree in described disaggregated model that it is classified, choose and selected the maximum classification of number of times as final prediction classification (this process is also commonly referred to as the voting process that the minority is subordinate to the majority) by all decision trees, then judge that whether prediction classification is consistent with the concrete class being currently classified sample, if unanimously, then it is assumed that the classification results of current sample is correct by described disaggregated model.
For each sample that the training sample obtained in step 101 is concentrated, aforesaid way is adopted to be verified, and the number of times that record sort result is correct.
Step 102-2: according to the result, calculate the accuracy that described training sample set is classified by described disaggregated model.
In this step, it is possible to sample training sample concentrated with described disaggregated model carries out the number of times of correct classification, and the ratio of total sample number, as the accuracy that training sample set is classified by described disaggregated model.This numerical value just reflects the described disaggregated model classifying quality to training sample set, just can determine the quantity of newly-increased decision tree in subsequent step according to this numerical value.
Step 102-3: determine the parameter value of Poisson distribution according to described accuracy so that the parameter value of described accuracy and described Poisson distribution meets inverse relation.
Process owing to obtaining k sample set from the training sample concentration comprising N number of sample by sampling with replacement meets binomial distribution as follows:
P ( K = k ) = N k ( 1 N ) k ( 1 - 1 N ) N - k
Further, when N value is very big or trends towards infinity, the above-mentioned binomial distribution about k tends to Poisson distribution:
P ( X = k ) = e - λ λ k k !
Ultimate principle based on above-mentioned discrete probability distribution, it is possible to according to the accuracy that training sample set is classified by described disaggregated model, the parameter lambda of Poisson distribution is adjusted, is then determined for compliance with the discrete values k of above-mentioned Poisson distribution further according to the value of parameter lambda.
Specifically, if the classification more correct (accuracy is high) that described disaggregated model is to training sample set, then reduce the value of parameter lambda accordingly, otherwise increase the value of parameter lambda accordingly.In being embodied as, in scope set in advance, the value of parameter lambda can be adjusted, such as: the span pre-setting parameter lambda is 1~20, if described classification accuracy rate reaches 80%, then can take λ=10, if described classification accuracy rate is more than 80%, then correspondingly can adjust the value of λ between 1~9, if described classification accuracy rate is less than 80%, then correspondingly can adjust the value of λ between 11~20.Above-mentioned example is merely schematic, is referred to above-mentioned thinking and carries out concrete setting in being embodied as, as long as it is just passable to make described accuracy and λ value substantially meet inverse relation.
Step 102-4: be determined for compliance with the random number of described distribution according to the parameter value of described Poisson distribution, and using this random number quantity as described newly-increased decision tree.
Having determined that the value of Parameter for Poisson Distribution λ in step 102-3, this step is exactly the quantity of newly-increased decision tree in subsequent step 103 according to λ random number k, the k being determined for compliance with Poisson distribution.In being embodied as, generally adopt mode calculated as below:
Expression formula according to Poisson distribution:
The relation of item before and after it of can deriving is:
Therefore, initial setting up p=exp (-λ), then to integer k from 1 to just infinite, generate a decimal by rand () or similar function every time, then exporting k if less than current p value, k now is the value meeting Poisson distribution, otherwise arranges p=p* λ/(k+1), circulation performs above-mentioned steps, until output k value.The above-mentioned value according to Parameter for Poisson Distribution λ is determined for compliance with the computational methods of the random number of Poisson distribution, belongs to prior art, repeats no more herein.
So far, just obtain concrete k value according to the lambda parameter of Poisson distribution, subsequent step 103 just can generate according to this k value the new decision tree of respective numbers.Owing to this step is when determining k value, the classifying quality of training sample set is included within the scope of considering by described disaggregated model, the accuracy rate of described disaggregated model is more high, the k value obtained is more little, be equivalent to reduce the quantity of subsequent step sample drawn collection and newly-increased decision tree, it may be assumed that described disaggregated model is only compared trickle adjustment;The accuracy rate of described disaggregated model is more low, and the k value obtained is more big, is equivalent to increase the quantity of subsequent step sample drawn collection and newly-increased decision tree, it may be assumed that described disaggregated model is carried out relatively large adjustment.Adopt in this way so that the disaggregated model after renewal can relatively accurately reflect the change of incremental training data on original base.
Step 103: according to described training sample set, adopts random forests algorithm to generate the decision tree of described newly-increased quantity.
According to training sample set, adopting random forests algorithm to sequentially generate k (that is: described newly-increased quantity) decision tree, the process wherein generating every decision tree includes step 103-1 to 103-3 as follows, is described further below in conjunction with accompanying drawing 3.
Step 103-1: adopt the mode of sampling with replacement to build bootstrap sample set according to described training sample set.
Bootstrap sampling approach (also referred to as bootstrapping or Bootstrap sampling method), it it is a kind of uniform sampling method having and putting back to, it is widely used in the field such as mathematical statistics, model calculating, when initial representativeness of sample is good, can enlarged sample amount by bootstrap sampling approach, and when initial sample is sufficiently large, bootstrap sampling can unbiased ground close to the population distribution of sample data.
This step concentrates the mode adopting sampling with replacement to extract N number of sample from the training sample comprising N number of sample, in extraction process, part sample in described training sample set is not likely pumped to, and part sample is likely to be extracted repeatedly, N number of sample one the bootstrap sample set of composition that will finally extract.
Bootstrap sampling approach is adopted to build sample set, generate when decision tree according to described sample set follow-up, owing to the input sample of every one tree is not whole samples that training sample is concentrated, containing relatively small number of noise data, therefore can avoid newly-built decision tree that the phenomenon of over-fitting occurs.
Step 103-2: use described bootstrap sample set, adopts and chooses attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node.
Using described bootstrap sample set, adopt the mode of dot splitting section by section to generate a new decision tree, it it is critical only that the selection of Split Attribute of each node.Specifically, for comprising the sample of M attribute, when each node decision tree needs to divide, first from M attribute, select at random m attribute (generally satisfy condition m < < M), then from m selected attribute, 1 optimum attributes Split Attribute as this node is chosen according to reservation strategy, then according to this attribute divides.Repeating said process at each node, until some node cannot continue division or its all samples comprised broadly fall into same classification, now fission process terminates, and a new decision tree creates complete.
In being embodied as, randomly choose the number of attribute to adopt and calculate square root and the mode that rounds obtains, such as: each sample packages is containing M=100 attribute, m=sqrt (M)=10 attribute can be randomly choosed so every time, other modes can certainly be adopted to determine the number randomly choosing attribute, as long as meeting the m < < condition of M.
Optimum attributes is chosen as from the attribute randomly selected, can adopt and carry out, based on Geordie index, the mode that divides, namely, formula as follows is first adopted to calculate impurity level, then utilize impurity level calculating to carry out the Geordie index divided according to each attribute, and the branch as tree selecting wherein Geordie index minimum divide:
Gini ( D ) = 1 - &Sigma; i = 1 k p i 2
Wherein, piIt is choose certain attribute when dividing, the probability that each sample belongs to a different category.When choosing optimum attributes, except that according to above-mentioned Geordie index, it is also possible to choose according to information gain, or choose according to information gain-ratio, all equally possible technical scheme realizing the application.Adopt above-mentioned three kinds of modes choose optimum attributes go forward side by side line splitting generate decision tree process, belong to the prior art of comparative maturity, no longer detailed process be further described herein.
Step 103-3: the index of the classifying quality of new decision tree described in computational representation.
Decision tree is screened, after an often newly-built decision tree, it is possible to the index of the classifying quality of this decision tree of computational representation for the ease of subsequent step 104.It is for instance possible to use the classifying quality of the test newly-built decision tree of sample set pair is estimated, and calculate corresponding index.
Due in the technical scheme of the application, the input sample that newly-built decision tree adopts carries out sampling by bootstrap method on training sample set and obtains, when the sample size that training sample is concentrated is sufficiently large, described training sample is concentrated would generally have the sample of about 1/3 not appear in bootstrap sample set, this part sample is referred to as the outer data (outofbag is called for short oob) of bag, these part data generally can be used to replace the classifying quality of new decision tree described in test sample set pair be estimated, and use the outer error in data (ooberror is called for short oobe) of corresponding bag as the index of the classifying quality characterizing described new decision tree.
Specifically, first will be contained in described training sample and concentrate and be not included in the screening sample in bootstrap sample set out, the outer data of composition bag;Then for each sample in the outer data of bag, with newly-built decision tree, it is carried out classification prediction, and judge that whether the actual classification predicted the outcome with this sample is consistent, if unanimously, illustrate that the classification results of this sample is correct by newly-built decision tree;Finally according to classification results each time, calculate the outer error in data of bag of described newly-built decision tree.
Such as, altogether comprising 100 samples in the outer data of bag, the classification results of wherein 90 samples is correct by newly-built decision tree, then the outer error in data of the bag of this decision tree is exactly: (100-90)/100=10%.
So far, by above-mentioned steps 103-1 to step 103-3, it has been created that a new decision tree, and has obtained the index characterizing its classifying quality.Circulation performs step 103-1 to 103-3 k time altogether, it is possible to generate k (that is: a described newly-increased quantity) decision tree.
Be can be seen that by above description, adopt in the process of newly-built k the decision tree of random forests algorithm, owing to adopting bootstrap to carry out stochastical sampling, and select optimum attributes to divide from the attribute randomly selected, these two aspects combines and has fully demonstrated the randomness of random forests algorithm such that it is able to ensure that newly created decision tree does not have the phenomenon of over-fitting.
Step 104: the decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity.
Described disaggregated model has contained the decision tree of predetermined quantity, for the ease of following description, described predetermined quantity is designated as T, in step 103 according to the incremental training sample set obtained, generating again k new decision tree, the sample that described training sample is usually concentrated by this k new decision tree compares correct classification (accuracy rate is relatively high).
Considering that described training sample set is only the incremental data in nearest a period of time, the change of data is had certain representativeness, but does not generally have universality, it is inappropriate for therefore substituting original T decision tree with k newly created decision tree;If merely k decision tree being added described disaggregated model (disaggregated model after renewal includes T+k tree altogether), so along with the increase of update times, disaggregated model can be excessively huge, therefore the technical scheme of the application adopts following processing mode: from original T the decision tree of described disaggregated model and k newly created decision tree, T decision tree of selection sort best results.T so selected decision tree, can reflect the change of increment sample data, reach the purpose that described disaggregated model is adjusted, can ensure that again the scale of described disaggregated model remains unchanged.
Specifically, it is possible to use described in test sample set pair, the classifying quality of the original decision tree of disaggregated model and newly-built decision tree is estimated, and therefrom screens T decision tree according to classifying quality.In the present embodiment, owing to having calculated the index of the classifying quality of the new decision tree of sign every in the process of the newly-built decision tree of step 103, that is: the outer error in data of bag, therefore this step can for original every the decision tree of described disaggregated model, also the correspondingly outer error in data of the bag of its classifying quality of computational representation, and according to this index above-mentioned decision tree be ranked up and screen.Below in conjunction with accompanying drawing 4, further illustrate the processing procedure of this step.
Step 104-1: every the decision tree comprised for described disaggregated model, the outer error in data of the bag of its classifying quality of computational representation.
First, outer for the bag of every newly created decision tree data summarization is obtained the outer data acquisition system of bag together, then use the sample in this bag of outer data acquisition system as input, calculate the outer error in data of bag of every decision tree that described disaggregated model comprises.Concrete computational methods are essentially identical with step 103-3, refer to the associated description of step 103-3, repeat no more herein.
Step 104-2: the decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to the outer error in data of described bag are ranked up.
T the decision tree that described disaggregated model is comprised and k newly-generated decision tree, it is ranked up according to the outer error in data of described bag, that is: the decision tree of good classification effect (the outer error in data of bag is little) is come before the decision tree of classifying quality relative mistake (the outer error in data of bag is relatively big), thus obtaining the minimum such a sequence of position sequence residing for the decision tree that residing for the decision tree that classifying quality is best, position sequence is the highest, classifying quality is worst.
Step 104-3: select tagmeme to be in a high position, described certain amount of decision tree from the decision tree after sequence.
The process of this step is relatively simple, according to the step 104-2 ranking results obtained, selects tagmeme to be in T decision tree of a high position, abandon remaining k decision tree from T+k decision tree.
In an object lesson of the present embodiment, for the disaggregated model in risk control field, the Internet, the value of T is generally within 200~400 scopes, and updates the quantity of decision tree newly created during disaggregated model generally within 0~20 scope every time.Adopt the process of above-mentioned steps screening decision tree every time, it is actually the process that the decision tree in this disaggregated model is carried out part renewal, that is: with classifying quality for foundation, the decision tree of the respective numbers in this disaggregated model is replaced with newly created all or part of decision tree.
Step 105: collect selected decision tree, the disaggregated model after being updated.
T the decision tree that step 104 screening obtains is collected, just obtains the disaggregated model after renewal, it is possible to continue large-scale data carries out classification prediction on line.Due to the partial decision tree in the disaggregated model after updating, it is generate according to the increment sample data obtained in a step 101, that is, this disaggregated model is made that corresponding adjustment according to up-to-date sample data on the original basis in time, thus ensureing that its classifying quality disclosure satisfy that requirement set in advance all the time, it is common that will not degenerate.
It should be noted that, above-mentioned steps 101-105, emphasis describes how the process adopting the method that the application provides that disaggregated model is updated, in being embodied as, if disaggregated model not yet creates, also the method that the application provides can still be adopted, then the renewal process of described disaggregated model is actually the establishment process that disaggregated model grows out of nothing.
Specifically, before execution step 102 determines the quantity of newly-increased decision tree, first judging whether to have created disaggregated model, if created, being updated according to process described above;Otherwise, using quantity set in advance, that disaggregated model comprises decision tree as described newly-increased quantity, that is: k=T is directly set, and create k decision tree according to step 103, then directly perform step 105 and k decision tree (that is: T decision tree) is collected output, just obtain the disaggregated model created.Afterwards, it is possible to adopt the present processes according to incremental data, this disaggregated model to be updated.
In the specific implementation, owing to the renewal process of model is usually directed to test and the training learning process of more sample data, in order to improve treatment effeciency further, reaches real-time or approximate real time renewal effect, MapReduce technology generally can be adopted to realize.
Such as, when disaggregated model is verified the quantity determining newly-increased decision tree by step 102, MapReduce programming model can be adopted, in the Map stage, each Map is responsible for the single decision tree in described disaggregated model, training sample set being predicted, then carry out collecting, according to the result in Map stage, the parameter value obtaining Poisson distribution in the Reduce stage, and further determine that the quantity of newly-increased decision tree;Generating in the process of decision tree in step 103, it would however also be possible to employ MapReduce programming model, in the Map stage, each Map is responsible for generating a decision tree according to bootstrap sample set, and the Reduce stage, then collect all decision trees and go forward side by side row filter.
In sum, adopt the method for updating disaggregated model that the application provides, need not according to full dose data train classification models, but choose the incremental data in nearest a period of time as training sample set, a number of decision tree is generated according to training sample set, and the original partial decision tree of described disaggregated model being replaced according to classifying quality, it is achieved thereby that the incremental update to disaggregated model.In a particular application, as required described disaggregated model can be carried out the renewal of various time granularity, such as: per diem update, update by the hour or approximate real time renewal, it is possible not only to improve the efficiency of model training, realize the quick response to business, and manual intervention that need not be extra in the service period of disaggregated model, reduce cost of labor.
In the above-described embodiment, it is provided that a kind of method for updating disaggregated model, corresponding, the application also provides for a kind of device for updating disaggregated model.Refer to Fig. 5, it is the schematic diagram of a kind of device embodiment for updating disaggregated model of the application.Owing to device embodiment is substantially similar to embodiment of the method, so describing fairly simple, relevant part illustrates referring to the part of embodiment of the method.Device embodiment described below is merely schematic.
A kind of device for updating disaggregated model of the present embodiment, including: training sample set acquiring unit 501, for obtaining the incremental data in predetermined amount of time from the historical data applying described disaggregated model, as training sample set;Newly-increased quantity determines unit 502, for determining the quantity of newly-increased decision tree;Decision tree creating unit 503, for according to described training sample set, adopting random forests algorithm to generate the decision tree of described newly-increased quantity;Decision tree screening unit 504, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity;Disaggregated model output unit 505, for collecting selected decision tree, the disaggregated model after being updated.
Optionally, described newly-increased quantity determine unit specifically for, use disaggregated model described in described training sample set pair to be verified, and determine the quantity of newly-increased decision tree according to the result.
Optionally, described newly-increased quantity determines that unit includes:
Verification of correctness subelement, for each sample using described training sample to concentrate, verifies the correctness of described disaggregated model;
Accuracy computation subunit, for according to the result, calculating the accuracy that described training sample set is classified by described disaggregated model;
Parameter for Poisson Distribution determines subelement, for determining the parameter value of Poisson distribution according to described accuracy so that the parameter value of described accuracy and described Poisson distribution meets inverse relation;Described Poisson distribution is to perform sampling with replacement for training sample set to obtain the discrete probability distribution that new samples collection is followed;
Random number determines subelement, for being determined for compliance with the random number of described discrete probability distribution according to the parameter value of described Poisson distribution, and using this random number quantity as described newly-increased decision tree.
Optionally, described verification of correctness subelement includes:
First loop control subelement, for each sample concentrated for described training sample, triggers the work of following subelement successively;
Class prediction subelement, for the attribute information comprised according to training sample, adopts described disaggregated model to carry out class prediction;
Judgment sub-unit, whether the classification for judging prediction is consistent with the concrete class of described training sample;If it is consistent, it is determined that the classification results of described training sample is correct by described disaggregated model.
Optionally, described decision tree creating unit includes:
Second loop control subelement, for judging whether the decision tree created reaches described newly-increased quantity, if it is not, then trigger following subelement successively to create new decision tree;
Bootstrap samples subelement, for adopting the mode of sampling with replacement to build bootstrap sample set according to described training sample set;
Create and perform subelement, be used for using described bootstrap sample set, adopt and choose attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node, and trigger described second loop control subelement work;Described choose attribute according to predetermined policy and refer to, from the sample attribute randomly choosed, choose attribute according to predetermined policy.
Optionally, the predetermined policy that described establishment execution subelement adopts when choosing attribute includes: chooses attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.
Optionally, described decision tree creating unit also includes:
Newly-built index computation subunit, is used for after described establishment performs the subelement new decision tree of establishment, the index of the classifying quality of new decision tree described in computational representation;
Accordingly, described decision tree screening unit includes:
Original index computation subunit, for every the decision tree comprised for described disaggregated model, the index of its classifying quality of computational representation;
Sequence subelement, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to described index are ranked up;
Select subelement, for selecting tagmeme to be in a high position, described certain amount of decision tree from the decision tree after sequence.
Optionally, the index that described newly-built index computation subunit calculates refers to, the outer error in data of bag;
Accordingly, described original index computation subunit includes:
The outer data acquisition subelement of bag, for obtaining the outer data acquisition system of bag by outer for the bag of every new decision tree data summarization;
Error Calculation subelement, is used for using the outer data acquisition system of described bag, the outer error in data of the bag of the classifying quality of every decision tree that disaggregated model described in computational representation comprises.
Optionally, described device includes:
Disaggregated model judgment sub-unit, is used for judging whether to create described disaggregated model;
Accordingly, described newly-increased quantity determines that unit is when described disaggregated model judgment sub-unit is output as "No", for using the quantity set in advance, that disaggregated model the comprises decision tree quantity as described newly-increased decision tree;
Accordingly, described decision tree creating unit, after completing its operation, directly triggers the work of described disaggregated model output unit, and described disaggregated model output unit is specifically for collecting the decision tree of the described newly-increased quantity generated, the disaggregated model after being updated.
Although the application is with preferred embodiment openly as above; but it is not for limiting the application; any those skilled in the art are without departing from spirit and scope; can making possible variation and amendment, therefore the protection domain of the application should be as the criterion with the scope that the application claim defines.
In a typical configuration, computing equipment includes one or more processor (CPU), input/output interface, network interface and internal memory.
Internal memory potentially includes the forms such as the volatile memory in computer-readable medium, random access memory (RAM) and/or Nonvolatile memory, such as read only memory (ROM) or flash memory (flashRAM).Internal memory is the example of computer-readable medium.
1, computer-readable medium includes permanent and impermanency, removable and non-removable media can by any method or technology to realize information storage.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computer includes, but it is not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read only memory (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, can be used for the information that storage can be accessed by a computing device.According to defining herein, computer-readable medium does not include non-temporary computer readable media (transitorymedia), such as data signal and the carrier wave of modulation.
2, it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program.Therefore, the application can adopt the form of complete hardware embodiment, complete software implementation or the embodiment in conjunction with software and hardware aspect.And, the application can adopt the form at one or more upper computer programs implemented of computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) wherein including computer usable program code.

Claims (18)

1. the method for updating disaggregated model, described disaggregated model is made up of the decision tree of predetermined quantity, for carrying out class prediction according to the user behavior data in network application, it is characterised in that including:
From the server that described user behavior data is provided, obtain the incremental data in predetermined amount of time, as training sample set;
Determine the quantity of newly-increased decision tree;
According to described training sample set, random forests algorithm is adopted to generate the decision tree of described newly-increased quantity;
The decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity;
Collect selected decision tree, the disaggregated model after being updated.
2. the method for updating disaggregated model according to claim 1, it is characterised in that the described quantity determining newly-increased decision tree refers to, uses disaggregated model described in described training sample set pair to be verified, and determines the quantity of newly-increased decision tree according to the result.
3. the method for updating disaggregated model according to claim 2, it is characterised in that disaggregated model described in the described training sample set pair of described use is verified, and determines the quantity of newly-increased decision tree according to the result, including:
Use each sample that described training sample is concentrated, verify the correctness of described disaggregated model;
According to the result, calculate the accuracy that described training sample set is classified by described disaggregated model;
The parameter value of Poisson distribution is determined so that the parameter value of described accuracy and described Poisson distribution meets inverse relation according to described accuracy;Described Poisson distribution is to perform sampling with replacement for training sample set to obtain the discrete probability distribution that new samples collection is followed;
Parameter value according to described Poisson distribution is determined for compliance with the random number of described discrete probability distribution, and using this random number quantity as described newly-increased decision tree.
4. the method for updating disaggregated model according to claim 3, it is characterised in that each sample that the described training sample of described use is concentrated, verifies the correctness of described disaggregated model, including:
According to the attribute information that training sample comprises, described disaggregated model is adopted to carry out class prediction;
Judge that whether the classification of prediction is consistent with the concrete class of described training sample;
If it is consistent, it is determined that the classification results of described training sample is correct by described disaggregated model.
5., according to the arbitrary described method for updating disaggregated model of claim 1-4, it is characterised in that described according to described training sample set, adopt random forests algorithm to generate the decision tree of described newly-increased quantity, including:
The mode adopting sampling with replacement according to described training sample set builds bootstrap sample set;
Use described bootstrap sample set, adopt and choose attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node;Described choose attribute according to predetermined policy and refer to, from the sample attribute randomly choosed, choose attribute according to predetermined policy;
Forward to and described adopt the mode of sampling with replacement to build the step of bootstrap sample set according to described training sample set to continue executing with, until generating the decision tree of described newly-increased quantity.
6. the method for updating disaggregated model according to claim 5, it is characterised in that described choose attribute according to predetermined policy and include: choose attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.
7. the method for updating disaggregated model according to claim 5, it is characterized in that, in described employing after each node chooses attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree, perform operations described below:
The index of the classifying quality of new decision tree described in computational representation;
Accordingly, the described decision tree described disaggregated model comprised according to classifying quality and described newly-generated decision tree are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity, including:
For every decision tree that described disaggregated model comprises, the index of its classifying quality of computational representation;
The decision tree and the described newly-generated decision tree that described disaggregated model are comprised according to described index are ranked up;
Tagmeme is selected to be in a high position, described certain amount of decision tree from the decision tree after sequence.
8. the method for updating disaggregated model according to claim 7, it is characterised in that the index of the classifying quality of the described new decision tree of described sign refers to, the outer error in data of bag;
Accordingly, described every the decision tree comprised for described disaggregated model, the index of its classifying quality of computational representation includes:
Outer for the bag of every new decision tree data summarization is obtained the outer data acquisition system of bag;
Use the outer data acquisition system of described bag, the outer error in data of the bag of the classifying quality of every decision tree that disaggregated model described in computational representation comprises.
9. the method for updating disaggregated model according to claim 1, it is characterised in that before performing the step of the described quantity determining newly-increased decision tree, performs operations described below:
Judge whether to create described disaggregated model;
If it is not, the described quantity determining newly-increased decision tree refers to, using the quantity set in advance, that disaggregated model the comprises decision tree quantity as described newly-increased decision tree;Accordingly, after the decision tree performing the described employing random forests algorithm described newly-increased quantity of generation, collecting the step of the disaggregated model after selected decision tree is updated described in directly performing, the decision tree of the described newly-increased quantity generated is selected decision tree.
10. the device being used for updating disaggregated model, it is characterised in that including:
Training sample set acquiring unit, for, from the server providing described user behavior data, obtaining the incremental data in predetermined amount of time, as training sample set;
Newly-increased quantity determines unit, for determining the quantity of newly-increased decision tree;
Decision tree creating unit, for according to described training sample set, adopting random forests algorithm to generate the decision tree of described newly-increased quantity;
Decision tree screening unit, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to classifying quality are ranked up, and therefrom select tagmeme to be in a decision tree high position, described predetermined quantity;
Disaggregated model output unit, for collecting selected decision tree, the disaggregated model after being updated.
11. the device for updating disaggregated model according to claim 10, it is characterized in that, described newly-increased quantity determine unit specifically for, use disaggregated model described in described training sample set pair to be verified, and determine the quantity of newly-increased decision tree according to the result.
12. the device for updating disaggregated model according to claim 11, it is characterised in that described newly-increased quantity determines that unit includes:
Verification of correctness subelement, for each sample using described training sample to concentrate, verifies the correctness of described disaggregated model;
Accuracy computation subunit, for according to the result, calculating the accuracy that described training sample set is classified by described disaggregated model;
Parameter for Poisson Distribution determines subelement, for determining the parameter value of Poisson distribution according to described accuracy so that the parameter value of described accuracy and described Poisson distribution meets inverse relation;Described Poisson distribution is to perform sampling with replacement for training sample set to obtain the discrete probability distribution that new samples collection is followed;
Random number determines subelement, for being determined for compliance with the random number of described discrete probability distribution according to the parameter value of described Poisson distribution, and using this random number quantity as described newly-increased decision tree.
13. the device for updating disaggregated model according to claim 12, it is characterised in that described verification of correctness subelement includes:
First loop control subelement, for each sample concentrated for described training sample, triggers the work of following subelement successively;
Class prediction subelement, for the attribute information comprised according to training sample, adopts described disaggregated model to carry out class prediction;
Judgment sub-unit, whether the classification for judging prediction is consistent with the concrete class of described training sample;If it is consistent, it is determined that the classification results of described training sample is correct by described disaggregated model.
14. according to the arbitrary described device for updating disaggregated model of claim 10-13, it is characterised in that described decision tree creating unit includes:
Second loop control subelement, for judging whether the decision tree created reaches described newly-increased quantity, if it is not, then trigger following subelement successively to create new decision tree;
Bootstrap samples subelement, for adopting the mode of sampling with replacement to build bootstrap sample set according to described training sample set;
Create and perform subelement, be used for using described bootstrap sample set, adopt and choose attribute according to predetermined policy and the mode that carries out dividing according to selected properties generates a new decision tree at each node, and trigger described second loop control subelement work;Described choose attribute according to predetermined policy and refer to, from the sample attribute randomly choosed, choose attribute according to predetermined policy.
15. the device for updating disaggregated model according to claim 14, it is characterized in that, the predetermined policy that described establishment execution subelement adopts when choosing attribute includes: chooses attribute according to information gain, choose attribute according to information gain-ratio or according to Geordie selecting index attribute.
16. the device for updating disaggregated model according to claim 14, it is characterised in that described decision tree creating unit also includes:
Newly-built index computation subunit, is used for after described establishment performs the subelement new decision tree of establishment, the index of the classifying quality of new decision tree described in computational representation;
Accordingly, described decision tree screening unit includes:
Original index computation subunit, for every the decision tree comprised for described disaggregated model, the index of its classifying quality of computational representation;
Sequence subelement, decision tree and described newly-generated decision tree for described disaggregated model being comprised according to described index are ranked up;
Select subelement, for selecting tagmeme to be in a high position, described certain amount of decision tree from the decision tree after sequence.
17. the device for updating disaggregated model according to claim 16, it is characterised in that the index that described newly-built index computation subunit calculates refers to, the outer error in data of bag;
Accordingly, described original index computation subunit includes:
The outer data acquisition subelement of bag, for obtaining the outer data acquisition system of bag by outer for the bag of every new decision tree data summarization;
Error Calculation subelement, is used for using the outer data acquisition system of described bag, the outer error in data of the bag of the classifying quality of every decision tree that disaggregated model described in computational representation comprises.
18. the device for updating disaggregated model according to claim 10, it is characterised in that described device includes:
Disaggregated model judgment sub-unit, is used for judging whether to create described disaggregated model;
Accordingly, described newly-increased quantity determines that unit is when described disaggregated model judgment sub-unit is output as "No", for using the quantity set in advance, that disaggregated model the comprises decision tree quantity as described newly-increased decision tree;
Accordingly, described decision tree creating unit, after completing its operation, directly triggers the work of described disaggregated model output unit, and described disaggregated model output unit is specifically for collecting the decision tree of the described newly-increased quantity generated, the disaggregated model after being updated.
CN201410737856.7A 2014-12-04 2014-12-04 Method and device for updating classifying model Pending CN105718490A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410737856.7A CN105718490A (en) 2014-12-04 2014-12-04 Method and device for updating classifying model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410737856.7A CN105718490A (en) 2014-12-04 2014-12-04 Method and device for updating classifying model

Publications (1)

Publication Number Publication Date
CN105718490A true CN105718490A (en) 2016-06-29

Family

ID=56143916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410737856.7A Pending CN105718490A (en) 2014-12-04 2014-12-04 Method and device for updating classifying model

Country Status (1)

Country Link
CN (1) CN105718490A (en)

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106255116A (en) * 2016-08-24 2016-12-21 王瀚辰 A kind of recognition methods harassing number
CN106296282A (en) * 2016-08-08 2017-01-04 南京大学 A kind of net purchase Product evaluation method marked based on user comment and history
CN106339593A (en) * 2016-08-31 2017-01-18 青岛睿帮信息技术有限公司 Kawasaki disease classification and prediction method based on medical data modeling
CN106845537A (en) * 2017-01-09 2017-06-13 北京邮电大学 A kind of grader radius based on adaptive threshold determines method and device
CN106874574A (en) * 2017-01-22 2017-06-20 清华大学 Mobile solution performance bottleneck analysis method and device based on decision tree
CN107132266A (en) * 2017-06-21 2017-09-05 佛山科学技术学院 A kind of Classification of water Qualities method and system based on random forest
CN107132268A (en) * 2017-06-21 2017-09-05 佛山科学技术学院 A kind of data processing equipment and system for being used to recognize cancerous lung tissue
CN107203866A (en) * 2017-06-26 2017-09-26 北京京东尚科信息技术有限公司 The processing method and device of order
CN107330464A (en) * 2017-06-30 2017-11-07 众安信息技术服务有限公司 Data processing method and device
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning
CN107632995A (en) * 2017-03-13 2018-01-26 平安科技(深圳)有限公司 The method and model training control system of Random Forest model training
CN107818344A (en) * 2017-10-31 2018-03-20 上海壹账通金融科技有限公司 The method and system that user behavior is classified and predicted
CN107894827A (en) * 2017-10-31 2018-04-10 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
CN108206046A (en) * 2017-12-28 2018-06-26 新华三大数据技术有限公司 A kind of data processing method and device
CN108418851A (en) * 2018-01-12 2018-08-17 阿里巴巴集团控股有限公司 Policy issue system, method, apparatus and equipment
CN108717548A (en) * 2018-04-10 2018-10-30 中国科学院计算技术研究所 A kind of increased Activity recognition model update method of facing sensing device dynamic and system
CN108805416A (en) * 2018-05-22 2018-11-13 阿里巴巴集团控股有限公司 A kind of risk prevention system processing method, device and equipment
CN109033154A (en) * 2018-06-12 2018-12-18 佛山欧神诺陶瓷有限公司 A kind of category management method
CN109063722A (en) * 2018-06-08 2018-12-21 中国科学院计算技术研究所 A kind of Activity recognition method and system based on chance perception
CN109101562A (en) * 2018-07-13 2018-12-28 中国平安人寿保险股份有限公司 Find method, apparatus, computer equipment and the storage medium of target group
CN109218211A (en) * 2017-07-06 2019-01-15 阿里巴巴集团控股有限公司 The method of adjustment of threshold value, device and equipment in the control strategy of data flow
CN109325625A (en) * 2018-09-28 2019-02-12 成都信息工程大学 A kind of bicycle quantitative forecasting technique based on binary Gauss nonhomogeneous Poisson process
WO2019041773A1 (en) * 2017-08-29 2019-03-07 平安科技(深圳)有限公司 Apparatus and method for updating prediction model, and computer-readable storage medium
CN110033276A (en) * 2019-03-08 2019-07-19 阿里巴巴集团控股有限公司 It is a kind of for security strategy generation method, device and the equipment transferred accounts
WO2019165673A1 (en) * 2018-02-27 2019-09-06 平安科技(深圳)有限公司 Reimbursement form risk prediction method, apparatus, terminal device, and storage medium
CN110321945A (en) * 2019-06-21 2019-10-11 深圳前海微众银行股份有限公司 Exptended sample method, terminal, device and readable storage medium storing program for executing
CN110377828A (en) * 2019-07-22 2019-10-25 腾讯科技(深圳)有限公司 Information recommendation method, device, server and storage medium
CN110688273A (en) * 2018-07-05 2020-01-14 马上消费金融股份有限公司 Classification model monitoring method and device, terminal and computer storage medium
CN110766071A (en) * 2019-10-21 2020-02-07 北京工业大学 Brain network data enhancement method based on forest self-encoder
CN110888668A (en) * 2018-09-07 2020-03-17 腾讯科技(北京)有限公司 System, method and device for updating model, terminal equipment and medium
CN111259273A (en) * 2018-11-30 2020-06-09 顺丰科技有限公司 Webpage classification model construction method, classification method and device
CN111309706A (en) * 2020-01-20 2020-06-19 北京明略软件系统有限公司 Model training method and device, readable storage medium and electronic equipment
WO2020125477A1 (en) * 2018-12-18 2020-06-25 北京数安鑫云信息技术有限公司 Method and apparatus for improving crawler identification recall rate, and medium and device
CN111353600A (en) * 2020-02-20 2020-06-30 第四范式(北京)技术有限公司 Abnormal behavior detection method and device
CN111428804A (en) * 2020-04-01 2020-07-17 广东电网有限责任公司 Random forest electricity stealing user detection method with optimized weighting
CN112000872A (en) * 2019-05-27 2020-11-27 北京地平线机器人技术研发有限公司 Recommendation method based on user vector, training method of model and training device
CN112598234A (en) * 2020-12-14 2021-04-02 广东电网有限责任公司广州供电局 Low-voltage transformer area line loss abnormity analysis method, device and equipment
WO2021114676A1 (en) * 2019-12-13 2021-06-17 浪潮电子信息产业股份有限公司 Method, apparatus, and device for updating hard disk prediction model, and medium
CN115168577A (en) * 2022-06-30 2022-10-11 北京百度网讯科技有限公司 Model updating method and device, electronic equipment and storage medium

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296282A (en) * 2016-08-08 2017-01-04 南京大学 A kind of net purchase Product evaluation method marked based on user comment and history
CN106255116A (en) * 2016-08-24 2016-12-21 王瀚辰 A kind of recognition methods harassing number
CN106339593A (en) * 2016-08-31 2017-01-18 青岛睿帮信息技术有限公司 Kawasaki disease classification and prediction method based on medical data modeling
CN106339593B (en) * 2016-08-31 2023-04-18 北京万灵盘古科技有限公司 Kawasaki disease classification prediction method based on medical data modeling
CN106845537A (en) * 2017-01-09 2017-06-13 北京邮电大学 A kind of grader radius based on adaptive threshold determines method and device
CN106845537B (en) * 2017-01-09 2020-12-04 北京邮电大学 Classifier radius determination method and device based on self-adaptive threshold
CN106874574B (en) * 2017-01-22 2019-10-29 清华大学 Mobile application performance bottleneck analysis method and device based on decision tree
CN106874574A (en) * 2017-01-22 2017-06-20 清华大学 Mobile solution performance bottleneck analysis method and device based on decision tree
CN107632995B (en) * 2017-03-13 2018-09-11 平安科技(深圳)有限公司 The method and model training control system of Random Forest model training
CN107632995A (en) * 2017-03-13 2018-01-26 平安科技(深圳)有限公司 The method and model training control system of Random Forest model training
CN107368892A (en) * 2017-06-07 2017-11-21 无锡小天鹅股份有限公司 Model training method and device based on machine learning
CN107368892B (en) * 2017-06-07 2020-06-16 无锡小天鹅电器有限公司 Model training method and device based on machine learning
CN107132268A (en) * 2017-06-21 2017-09-05 佛山科学技术学院 A kind of data processing equipment and system for being used to recognize cancerous lung tissue
CN107132266A (en) * 2017-06-21 2017-09-05 佛山科学技术学院 A kind of Classification of water Qualities method and system based on random forest
CN107203866A (en) * 2017-06-26 2017-09-26 北京京东尚科信息技术有限公司 The processing method and device of order
CN107203866B (en) * 2017-06-26 2021-02-26 北京京东尚科信息技术有限公司 Order processing method and device
CN107330464A (en) * 2017-06-30 2017-11-07 众安信息技术服务有限公司 Data processing method and device
WO2019001359A1 (en) * 2017-06-30 2019-01-03 众安信息技术服务有限公司 Data processing method and data processing apparatus
CN109218211B (en) * 2017-07-06 2022-04-19 创新先进技术有限公司 Method, device and equipment for adjusting threshold value in control strategy of data stream
CN109218211A (en) * 2017-07-06 2019-01-15 阿里巴巴集团控股有限公司 The method of adjustment of threshold value, device and equipment in the control strategy of data flow
WO2019041773A1 (en) * 2017-08-29 2019-03-07 平安科技(深圳)有限公司 Apparatus and method for updating prediction model, and computer-readable storage medium
CN107894827A (en) * 2017-10-31 2018-04-10 广东欧珀移动通信有限公司 Using method for cleaning, device, storage medium and electronic equipment
CN107818344A (en) * 2017-10-31 2018-03-20 上海壹账通金融科技有限公司 The method and system that user behavior is classified and predicted
CN107818344B (en) * 2017-10-31 2020-01-07 深圳壹账通智能科技有限公司 Method and system for classifying and predicting user behaviors
CN108206046B (en) * 2017-12-28 2021-07-02 新华三大数据技术有限公司 Data processing method and device
CN108206046A (en) * 2017-12-28 2018-06-26 新华三大数据技术有限公司 A kind of data processing method and device
CN108418851A (en) * 2018-01-12 2018-08-17 阿里巴巴集团控股有限公司 Policy issue system, method, apparatus and equipment
CN108418851B (en) * 2018-01-12 2020-12-04 创新先进技术有限公司 Policy issuing system, method, device and equipment
WO2019165673A1 (en) * 2018-02-27 2019-09-06 平安科技(深圳)有限公司 Reimbursement form risk prediction method, apparatus, terminal device, and storage medium
CN108717548A (en) * 2018-04-10 2018-10-30 中国科学院计算技术研究所 A kind of increased Activity recognition model update method of facing sensing device dynamic and system
CN108805416A (en) * 2018-05-22 2018-11-13 阿里巴巴集团控股有限公司 A kind of risk prevention system processing method, device and equipment
CN109063722A (en) * 2018-06-08 2018-12-21 中国科学院计算技术研究所 A kind of Activity recognition method and system based on chance perception
CN109033154A (en) * 2018-06-12 2018-12-18 佛山欧神诺陶瓷有限公司 A kind of category management method
CN110688273A (en) * 2018-07-05 2020-01-14 马上消费金融股份有限公司 Classification model monitoring method and device, terminal and computer storage medium
CN110688273B (en) * 2018-07-05 2021-02-19 马上消费金融股份有限公司 Classification model monitoring method and device, terminal and computer storage medium
CN109101562A (en) * 2018-07-13 2018-12-28 中国平安人寿保险股份有限公司 Find method, apparatus, computer equipment and the storage medium of target group
CN109101562B (en) * 2018-07-13 2023-07-21 中国平安人寿保险股份有限公司 Method, device, computer equipment and storage medium for searching target group
CN110888668B (en) * 2018-09-07 2024-04-16 腾讯科技(北京)有限公司 Model updating system, method, device, terminal equipment and medium
CN110888668A (en) * 2018-09-07 2020-03-17 腾讯科技(北京)有限公司 System, method and device for updating model, terminal equipment and medium
CN109325625A (en) * 2018-09-28 2019-02-12 成都信息工程大学 A kind of bicycle quantitative forecasting technique based on binary Gauss nonhomogeneous Poisson process
CN109325625B (en) * 2018-09-28 2019-12-17 成都信息工程大学 Bicycle quantity prediction method based on binary Gaussian heterogeneous poisson process
CN111259273A (en) * 2018-11-30 2020-06-09 顺丰科技有限公司 Webpage classification model construction method, classification method and device
WO2020125477A1 (en) * 2018-12-18 2020-06-25 北京数安鑫云信息技术有限公司 Method and apparatus for improving crawler identification recall rate, and medium and device
CN111343127A (en) * 2018-12-18 2020-06-26 北京数安鑫云信息技术有限公司 Method, device, medium and equipment for improving crawler recognition recall rate
CN111343127B (en) * 2018-12-18 2021-03-16 北京数安鑫云信息技术有限公司 Method, device, medium and equipment for improving crawler recognition recall rate
CN110033276A (en) * 2019-03-08 2019-07-19 阿里巴巴集团控股有限公司 It is a kind of for security strategy generation method, device and the equipment transferred accounts
CN112000872A (en) * 2019-05-27 2020-11-27 北京地平线机器人技术研发有限公司 Recommendation method based on user vector, training method of model and training device
CN110321945A (en) * 2019-06-21 2019-10-11 深圳前海微众银行股份有限公司 Exptended sample method, terminal, device and readable storage medium storing program for executing
CN110377828A (en) * 2019-07-22 2019-10-25 腾讯科技(深圳)有限公司 Information recommendation method, device, server and storage medium
CN110377828B (en) * 2019-07-22 2023-05-26 腾讯科技(深圳)有限公司 Information recommendation method, device, server and storage medium
CN110766071A (en) * 2019-10-21 2020-02-07 北京工业大学 Brain network data enhancement method based on forest self-encoder
CN110766071B (en) * 2019-10-21 2023-04-28 北京工业大学 Brain network data enhancement method based on forest self-encoder
WO2021114676A1 (en) * 2019-12-13 2021-06-17 浪潮电子信息产业股份有限公司 Method, apparatus, and device for updating hard disk prediction model, and medium
CN111309706A (en) * 2020-01-20 2020-06-19 北京明略软件系统有限公司 Model training method and device, readable storage medium and electronic equipment
CN111353600A (en) * 2020-02-20 2020-06-30 第四范式(北京)技术有限公司 Abnormal behavior detection method and device
CN111353600B (en) * 2020-02-20 2023-12-12 第四范式(北京)技术有限公司 Abnormal behavior detection method and device
CN111428804A (en) * 2020-04-01 2020-07-17 广东电网有限责任公司 Random forest electricity stealing user detection method with optimized weighting
CN112598234A (en) * 2020-12-14 2021-04-02 广东电网有限责任公司广州供电局 Low-voltage transformer area line loss abnormity analysis method, device and equipment
CN115168577A (en) * 2022-06-30 2022-10-11 北京百度网讯科技有限公司 Model updating method and device, electronic equipment and storage medium
CN115168577B (en) * 2022-06-30 2023-03-21 北京百度网讯科技有限公司 Model updating method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105718490A (en) Method and device for updating classifying model
CN106156809A (en) For updating the method and device of disaggregated model
US9984336B2 (en) Classification rule sets creation and application to decision making
CN104679743A (en) Method and device for determining preference model of user
CN111815432B (en) Financial service risk prediction method and device
CN111143685A (en) Recommendation system construction method and device
CN112288455A (en) Label generation method and device, computer readable storage medium and electronic equipment
CN110310114A (en) Object classification method, device, server and storage medium
CN111199469A (en) User payment model generation method and device and electronic equipment
CN113342976A (en) Method, device, storage medium and equipment for automatically acquiring and processing data
CN110706015A (en) Advertisement click rate prediction oriented feature selection method
CN114969528A (en) User portrait and learning path recommendation method, device and equipment based on capability evaluation
CN116911994B (en) External trade risk early warning system
CN109977977A (en) A kind of method and corresponding intrument identifying potential user
CN106874286B (en) Method and device for screening user characteristics
CN107092599B (en) Method and equipment for providing knowledge information for user
CN113869973A (en) Product recommendation method, product recommendation system, and computer-readable storage medium
CN113469819A (en) Recommendation method of fund product, related device and computer storage medium
CN108241643A (en) The achievement data analysis method and device of keyword
KAMLEY et al. Multiple regression: A data mining approach for predicting the stock market trends based on open, close and high price of the month
Wirawan et al. Application of data mining to prediction of timeliness graduation of students (a case study)
CN112150276A (en) Training method, using method, device and equipment of machine learning model
CN111612626A (en) Method and device for preprocessing bond evaluation data
CN112232944B (en) Method and device for creating scoring card and electronic equipment
Price et al. Making monitoring manageable: a framework to guide learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200922

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200922

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160629