CN110348580A - Construct the method, apparatus and prediction technique, device of GBDT model - Google Patents

Construct the method, apparatus and prediction technique, device of GBDT model Download PDF

Info

Publication number
CN110348580A
CN110348580A CN201910526406.6A CN201910526406A CN110348580A CN 110348580 A CN110348580 A CN 110348580A CN 201910526406 A CN201910526406 A CN 201910526406A CN 110348580 A CN110348580 A CN 110348580A
Authority
CN
China
Prior art keywords
training
data
sample
sample data
positive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910526406.6A
Other languages
Chinese (zh)
Other versions
CN110348580B (en
Inventor
王海
涂威威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202210493503.1A priority Critical patent/CN114819186A/en
Priority to CN201910526406.6A priority patent/CN110348580B/en
Publication of CN110348580A publication Critical patent/CN110348580A/en
Application granted granted Critical
Publication of CN110348580B publication Critical patent/CN110348580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses the method and devices that building gradient promotes decision tree GBDT model, are related to machine learning techniques field, main purpose is to solve the problems, such as that the accuracy rate of existing trained decision-tree model is lower.The main technical solution of the present invention are as follows: obtain sample data set, the sample data concentration includes the positive sample data with positive label and the unmarked sample data without label;In each regression tree of training GBDT model, a positive sample training subset is constructed based on the positive sample data that the sample data is concentrated, sampling one negative sample training subset of building is carried out to the unmarked sample data that the sample data is concentrated, it is combined the positive sample training subset and the multiple negative sample training subset to obtain the training set of current regression tree, and based on the current regression tree of the training set of current regression tree training, decision tree GBDT model is promoted further according to each regression tree building gradient.The present invention is used to promote gradient in the building process of decision tree.

Description

Construct the method, apparatus and prediction technique, device of GBDT model
Technical field
The present invention relates to the sides that machine learning techniques field more particularly to a kind of building gradient promote decision tree GBDT model Method, device and the method, apparatus predicted using the model.
Background technique
With the continuous progress of technology, artificial intelligence technology also gradually develops.Wherein, machine learning is artificial intelligence study The inevitable outcome for developing to certain phase is dedicated to the means by calculating, improves the performance of system itself using experience. In computer systems, " experience " exists usually in the form of " data ", and by machine learning algorithm, " mould can be generated from data Type ", that is to say, that empirical data is supplied to machine learning algorithm, model can be generated based on these empirical datas, faced When news, model can provide corresponding judgement, that is, prediction result.Whether training machine learning model, or utilize instruction The machine learning model perfected is predicted that data require to be converted to the machine learning sample including various features.
Currently, in practical application, what the acquisition of data was relatively easy to, and the label of data is then needed to spend higher The resources such as human and material resources, therefore can often have a small amount of marked data in some data set, be denoted as positive sample, with And a large amount of unlabelled data.In this case, it can generally select to learn (Positive and unlabeled using PU Learning, abbreviation PU Learning) combine gradient to promote the training of decision Tree algorithms progress decision-tree model, such as choose GBDT algorithm to be promoted the GBDT model of decision tree according to the corresponding gradient of sample data training.
However, in practical applications, in the decision-tree model learnt by training based on PU, being based in sample data " positive sample " of label is less, and most of is unlabelled data, therefore, extremely holds when training gradient promotes decision-tree model Easily there is " over-fitting " phenomenon, wherein over-fitting, which refers to, unanimously to be assumed in order to obtain and makes to assume to become over stringent phenomenon, The accuracy rate of the decision-tree model trained so as to cause existing way is lower.
Summary of the invention
In view of the above problems, the invention proposes it is a kind of building gradient promoted decision tree GBDT model method and device, Main purpose is to solve the problems, such as that the accuracy rate of existing trained decision-tree model is lower, improves the model trained Accuracy rate.
In order to achieve the above objectives, present invention generally provides following technical solutions:
On the one hand, the present invention provides a kind of building gradient promotion decision tree GBDT model method, specifically includes:
Sample data set is obtained, the sample data concentration includes the positive sample data with positive label and not marking without label Remember sample data;
In each regression tree of training GBDT model, one is constructed based on the positive sample data that the sample data is concentrated A positive sample training subset carries out sampling to the unmarked sample data that the sample data is concentrated and constructs a negative sample training Subset is combined the positive sample training subset and the multiple negative sample training subset to obtain the training of current regression tree Collection, and based on the current regression tree of the training set of current regression tree training, gradient is constructed further according to each regression tree Promote decision tree GBDT model.
Optionally, the positive sample data concentrated based on the sample data construct a positive sample training subset packet It includes:
All positive sample data for taking the sample data to concentrate construct a positive sample training subset;
Alternatively,
The part positive sample data for taking the sample data to concentrate construct a positive sample training subset.
Optionally, under business scenario, negative sample data volume and positive sample data volume are estimated known to the positive and negative sample proportion When ratio is x, then enabling the data volume of the negative sample training subset is x times of data volume of the positive sample training subset;When Under the unknown business scenario of positive and negative sample proportion, then enabling the data volume of the negative sample training subset is positive sample training 1 to 2 times of the data volume of collection.
Optionally, described to include: based on the current regression tree of training set training
It is iterated training by the training set and default GBDT algorithm of the current regression tree, obtains corresponding to every time repeatedly Each regression tree of generation training.
Optionally, described that training is iterated by the training set and default GBDT algorithm of the current regression tree, it obtains Each regression tree of corresponding each repetitive exercise, comprising:
It is concentrated from the sample data and obtains the first training set and according to first training set, default GBDT algorithm, with And first parameter, the first regression tree of training, first parameter are the practical knots for the whole sample datas concentrated for sample data The mean value of fruit;
After training obtains first regression tree, is concentrated from the sample training and choose the second training set, and according to institute The second training set, default GBDT algorithm and the second parameter are stated, the second regression tree of training, second parameter is by described Sample data in second training set is according to sample data in the determining prediction result of first regression tree and the second training set Actual result determine, first training set, the second training set are by a positive sample training subset and a negative sample What this training subset was constituted, first training set is different from the negative sample training subset that second training set is included.
Optionally, the second training set is chosen concentrating from the sample training, and according to second training set, default GBDT algorithm and the second parameter, before training the second regression tree, the method also includes:
Second training set is predicted by first regression tree, it is corresponding pre- to obtain second training set Survey result;
According to the actual result of second training set prediction result corresponding with second training set, the reality is determined The residual error of border result and prediction result, and the residual error is determined as second parameter.
Optionally, the method further includes:
Multiple training sets are constructed based on the sample data set;
It is selected respectively from the set and the multiple training set that the set of machine learning algorithm, hyper parameter combine It selects, training obtains multiple candidate families, wherein a kind of machine learning algorithm, one group of hyper parameter and a training set determine one Candidate family;
At least one GBDT model and the multiple candidate family are assessed respectively, select it is multiple meet it is default The model of condition;
Integrate the composite model after the multiple model for meeting preset condition is integrated.
Optionally, before being assessed respectively at least one GBDT model and the multiple candidate family, institute State method further include:
Sampling building positive sample assessment subset is carried out to the positive sample data that the sample data is concentrated, to the sample number Sampling building negative sample assessment subset is carried out according to the unmarked sample data of concentration, positive sample is assessed into subset and negative sample is assessed Sub-combinations obtain assessment collection;
It is described that at least one GBDT model and the multiple candidate family are assessed respectively, select multiple meet The model of preset condition includes:
At least one described described GBDT model and the multiple candidate family are commented respectively according to assessment collection Estimate, obtain the assessment result for corresponding to each model, and selects multiple models for meeting preset condition from the assessment result.
Optionally, target object data, the inspection of bank card leakage point when the sample data set includes: target object recommendation When picture/text data, malicious traffic stream when transaction data, the picture/text of stolen brush bank card when survey are classified detect Data on flows;
Wherein, when the sample data set is target object data, the mesh that has been recommended in the target object data Mark object data is positive sample data, and the not recommended target object data is negative sample data;When the sample data Collection for stolen brush bank card transaction data when, the transaction data of known leaks point is positive sample data in the transaction data, The transaction data of unknown leakage point is negative sample data;It is described to have classified when the sample data set is picture/text data Picture/text data be positive sample data, non-classified picture/text data are negative sample data;When the sample data When integrating as data on flows, the known malicious data on flows in the data on flows is positive sample data, and unknown data on flows is Negative sample data.
On the other hand, the present invention also provides the devices that a kind of building gradient promotes decision tree GBDT model, wherein the party Method includes:
Acquiring unit, for obtaining sample data set, the sample data concentration includes the positive sample data with positive label And the unmarked sample data without label;
Construction unit, for being concentrated just based on the sample data in each regression tree of training GBDT model Sample data constructs a positive sample training subset, and the unmarked sample data concentrated to the sample data carries out sampling building The positive sample training subset is combined with the multiple negative sample training subset and is worked as by one negative sample training subset The training set of preceding regression tree, and based on the current regression tree of the training set of current regression tree training, further according to described each Regression tree constructs gradient and promotes decision tree GBDT model.
Optionally, the construction unit includes:
First building module, all positive sample data for taking the sample data to concentrate construct a positive sample training Subset, or part positive sample data one positive sample training subset of building for taking the sample data to concentrate
Optionally, the construction unit includes:
Second building module, under the business scenario known to the positive and negative sample proportion, estimate negative sample data volume with just Sample data volume ratio be x when, then enable the negative sample training subset data volume be the positive sample training subset data X times of amount;Under the unknown business scenario of positive and negative sample proportion, then enable the negative sample training subset data volume be it is described just 1 to 2 times of the data volume of sample training subset.
Optionally, the construction unit includes:
Training module, for by the current regression tree training set and default GBDT algorithm be iterated training, obtain To each regression tree of each repetitive exercise of correspondence.
Optionally, the training module includes:
First training submodule obtains the first training set and according to first instruction for concentrating from the sample data Practice collection, default GBDT algorithm and the first parameter, the first regression tree of training, first parameter is concentrated for sample data The mean value of the actual result of whole sample datas;
Second training submodule, for concentrating and choosing from the sample training after training obtains first regression tree Second training set, and according to second training set, default GBDT algorithm and the second parameter, the second regression tree of training is described Second parameter is the prediction result and the determined by the sample data in second training set according to first regression tree The actual result of sample data determines that first training set, the second training set are by a positive sample in two training sets What training subset and a negative sample training subset were constituted, the negative sample that first training set and second training set are included This training subset is different.
Optionally, the training module further include:
It predicts submodule, for predicting by first regression tree second training set, obtains described the The corresponding prediction result of two training sets;
Submodule is determined, for the actual result prediction corresponding with second training set according to second training set As a result, determining the residual error of the actual result and prediction result, and the residual error is determined as second parameter.
Optionally, described device further include:
Training set construction unit, for constructing multiple training sets based on the sample data set;
Training unit, set and the multiple training set for set, hyper parameter combination from machine learning algorithm Middle to be selected respectively, training obtains multiple candidate families, wherein a kind of machine learning algorithm, one group of hyper parameter and an instruction Practice to collect and determines a candidate family;
Assessment unit is selected for being assessed respectively at least one GBDT model and the multiple candidate family Multiple models for meeting preset condition out;
Integrated unit, for integrating the composite model after the multiple model for meeting preset condition is integrated.
Optionally, described device further include:
Assessment collection construction unit, the positive sample data for concentrating to the sample data carry out sampling building positive sample and comment Estimate subset, sampling building negative sample assessment subset is carried out to the unmarked sample data that the sample data is concentrated, by positive sample Assessment subset and negative sample assessment sub-combinations obtain assessment collection;
The assessment unit is specifically used for being collected according to the assessment at least one described described GBDT model and described more A candidate family is assessed respectively, obtains the assessment result for corresponding to each model, and is selected from the assessment result multiple Meet the model of preset condition.
Optionally, target object data, the inspection of bank card leakage point when the sample data set includes: target object recommendation When picture/text data, malicious traffic stream when transaction data, the picture/text of stolen brush bank card when survey are classified detect Data on flows;
Wherein, when the sample data set is target object data, the mesh that has been recommended in the target object data Mark object data is positive sample data, and the not recommended target object data is negative sample data;When the sample data Collection for stolen brush bank card transaction data when, the transaction data of known leaks point is positive sample data in the transaction data, The transaction data of unknown leakage point is negative sample data;It is described to have classified when the sample data set is picture/text data Picture/text data be positive sample data, non-classified picture/text data are negative sample data;When the sample data When integrating as data on flows, the known malicious data on flows in the data on flows is positive sample data, and unknown data on flows is Negative sample data.
On the other hand, the present invention provides a kind of method realizing target object and recommending, comprising:
Obtain target object data to be predicted;
According to the method as described in any one of first aspect, obtains the gradient and promote decision tree GBDT model;
Decision tree GBDT model performance objective object recommendation task is promoted using obtained gradient;
Wherein, the target object is the commodity provided by internet or service.
On the other hand, the present invention provides a kind of method for realizing the detection of bank card leak point, comprising:
Obtain the transaction data of the stolen brush bank card of leakage point to be detected;
According to the method as described in any one of first aspect, obtains gradient and promote decision tree GBDT model;
Decision tree GBDT model, which is promoted, using obtained gradient executes bank card leak point Detection task;
Wherein, when the transaction data is concentrated when including each transaction of each bank card exchange hour and transaction Terminal Equipment Identifier.
On the other hand, the embodiment of the invention provides a kind of methods for realizing picture/text classification, comprising:
Obtain picture/text data to be predicted;
According to the method as described in any one of first aspect, obtains gradient and promote decision tree GBDT model;
Decision tree GBDT model, which is promoted, using obtained gradient executes picture/text classification task.
On the other hand, the embodiment of the invention also provides a kind of methods of malicious traffic stream detection, comprising:
Obtain data on flows to be detected;
According to the method as described in any one of first aspect, obtains gradient and promote decision tree GBDT model;
Decision tree GBDT model is promoted using obtained gradient, and Detection task is executed to the data on flows to be detected.
On the other hand, the present invention provides a kind of computer readable storage medium, wherein the computer readable storage medium On be stored with computer program, wherein the computer program is realized any of the above-described when being executed by one or more computing devices Item the method.
On the other hand, the present invention provides a kind of is including one or more computing devices and one or more storage devices It unites, record has computer program on one or more of storage devices, and the computer program is one or more of Computing device makes one or more of computing devices realize any of the above-described the method when executing.
By above-mentioned technical proposal, a kind of method and dress of building gradient promotion decision tree GBDT model provided by the invention It sets, sample data set can be obtained, then in each regression tree of training GBDT model, concentrated based on the sample data Positive sample data construct a positive sample training subset, to the sample data concentrate unmarked sample data sample A negative sample training subset is constructed, the positive sample training subset and the multiple negative sample training subset are combined To the training set of current regression tree, and based on the current regression tree of the training set of current regression tree training, further according to described every One regression tree building gradient promotes decision tree GBDT model, and compared with the prior art, the present invention can be concentrated by sample data The positive sample training subset and negative sample training subset of acquisition train the regression tree of each GBDT model, due to for training The current regression tree training set of each regression tree is extracted from sample training collection, more so as to guarantee to obtain more The problem of otherness between a tree, over-fitting caused by avoiding because of existing training method, then improves trained The gradient arrived promotes the accuracy of decision tree GBDT model.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
A kind of building gradient that Fig. 1 shows proposition of the embodiment of the present invention promotes the process of decision tree GBDT model method Figure;
Fig. 2 shows the processes for another building gradient promotion decision tree GBDT model method that the embodiment of the present invention proposes Figure;
A kind of building gradient that Fig. 3 shows proposition of the embodiment of the present invention promotes the composition frame of decision tree GBDT model equipment Figure;
Another building gradient that Fig. 4 shows proposition of the embodiment of the present invention promotes the composition of decision tree GBDT model equipment Block diagram;
Fig. 5 shows a kind of composition block diagram of system for realizing target object recommendation of proposition of the embodiment of the present invention;
Fig. 6 shows a kind of composition block diagram of system for realizing the detection of bank card leak point of proposition of the embodiment of the present invention;
Fig. 7 shows a kind of composition block diagram of system for realizing picture/text classification of proposition of the embodiment of the present invention;
Fig. 8 shows a kind of composition block diagram of system for realizing malicious traffic stream detection of proposition of the embodiment of the present invention.
Specific embodiment
The exemplary embodiment that the present invention will be described in more detail below with reference to accompanying drawings.Although showing the present invention in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the present invention without should be by embodiments set forth here It is limited.It is to be able to thoroughly understand the present invention on the contrary, providing these embodiments, and can be by the scope of the present invention It is fully disclosed to those skilled in the art.
The embodiment of the invention provides a kind of building gradients to promote decision tree GBDT model method, and this method can be applied to all It is above-mentioned in such as detection process of the detection of bank card leak source, the recommendation of commodity and service, image or text classification and malicious traffic stream In scene, marked positive sample data are sub-fraction, and the major part in sample data is unlabelled data.The present invention is real Method described in example is applied to be intended to solve existing way by constructing a kind of higher gradient promotion decision tree GBDT model of accuracy The model trained the problem lower because of accuracy caused by more being fitted, this method specific steps are as shown in Figure 1, comprising:
101, sample data set is obtained.
Wherein, the sample data concentration includes the positive sample data with positive label and the unmarked sample number without label According to.In many actual conditions, data are readily available, but the labeling process of data needs that high manpower and material resources is spent to provide Source.For example, can only often obtain a small amount of positive sample durings malicious traffic stream detection, the detection of bank's leakage point, music recommendation etc. Notebook data (music that known malicious traffic stream, leak point, user like) and a large amount of unmarked sample data.In this situation Under, can choose PU study usually to carry out model training, so as to according to the obtained model of training to the data of above-mentioned scene into Row analysis and automatic classification.
Wherein PU (Positive and unlabeled learning, abbreviation PU Learning) is referred to as positive example and does not mark Remember sample learning, i.e., the only positive sample data and unmarked sample data the case where under train classification models.Previous grinds Study carefully in usual never marker samples and chooses negative class sample training classifier, however, in the case where containing only positive sample data cases, model And parameter is all difficult to select reliable result.
It, in view of the problems of the existing technology, in embodiments of the present invention, first can be according to the side of this step based on this Method carries out the acquisition of sample data set, and wherein sample data concentration includes positive sample data with positive label and without label Unmarked sample data.For example, in practical applications, it is when performed task is to detect to malicious traffic stream, then described The positive sample data that sample data is concentrated can be understood as known malicious traffic stream data, and unmarked sample data can then manage Solution is the data on flows for not carrying out detecting.
102, in each regression tree of training GBDT model, the positive sample data structure based on sample data concentration A positive sample training subset is built, sampling one negative sample of building is carried out to the unmarked sample data that the sample data is concentrated Training subset is combined the positive sample training subset and the multiple negative sample training subset to obtain current regression tree Training set, and based on the current regression tree of the training set of current regression tree training, it is constructed further according to each regression tree Gradient promotes decision tree GBDT model.
Wherein GBDT GBDT (Gradient Boosting Decision Tree) is called MART (Multiple Additive Regression Tree), it is a kind of decision Tree algorithms of iteration, which is made of more decision trees, owns The conclusion of tree, which adds up, does final prediction result, the algorithm at the beginning of being suggested just and SVM be together considered as generalization ability Stronger algorithm more caused everybody concern because the machine learning model of sequence is used to search in recent years.
Therefore, the characteristics of being based on GBDT algorithm promotes decision tree GBDT model in training gradient in embodiments of the present invention When, it is to carry out each regression tree by training set to be trained.Specifically, being trained based on existing decision-tree model In journey often because training when sample size it is less due to there are problems that over-fitting, here, can be first by sample data set In positive sample data, negative sample data individually construct the current regression tree of correspondence for each regression tree to be trained Training set, wherein the training set for corresponding to current regression tree, which can be, constructs a positive sample training subset with positive sample data, and Sampling operation is carried out from unlabelled sample data, obtains a corresponding negative sample training subset, and with positive sample instruction Practice subset and negative sample training subset combines training set needed for obtaining this regression tree for needing training, i.e., current regression tree Training set.
It, then can be by the corresponding tree of GBDT algorithm training, successively with this after the training set that current regression tree has been determined It is iterated training, obtains the whole regression trees for corresponding to the sample training collection, and above-mentioned regression tree is combined to obtain pair Should the gradient of sample training collection refresh oneself decision tree GBDT model.
In order to which the method for preferably promoting decision tree GBDT model to building gradient provided in an embodiment of the present invention is said It is bright, another embodiment is additionally provided herein, to be refined and be extended for each step in above-described embodiment, specifically, such as Shown in Fig. 2, comprising:
201, sample data set is obtained.
Wherein, the sample data concentration includes the positive sample data with positive label and the unmarked sample number without label According to.
Specifically, the sample data set may include: the target under target object recommendation scene in actual application Object data;The transaction data of stolen brush bank card when bank card leakage point detects under scene;Picture/text is classified under scene Picture/text data;And the data on flows under scene is detected in malicious traffic stream.
Wherein, the unmarked sample of positive sample data set concentrated based on different practical application scenes, every kind of sample data Data also change therewith:
For example, when the sample data set is target object data, the mesh that has been recommended in the target object data Mark object data is positive sample data, and the not recommended target object data is negative sample data;
When the sample data set is the transaction data of stolen brush bank card, known leaks point in the transaction data Transaction data is positive sample data, and the transaction data of unknown leakage point is negative sample data;
When the sample data set is picture/text data, the classified picture/text data are positive sample number It is negative sample data according to, non-classified picture/text data;
When the sample data set is data on flows, the known malicious data on flows in the data on flows is positive sample Data, unknown data on flows are negative sample data.
202, in each regression tree of training GBDT model, the positive sample data structure based on sample data concentration A positive sample training subset is built, sampling one negative sample of building is carried out to the unmarked sample data that the sample data is concentrated Training subset is combined the positive sample training subset and the multiple negative sample training subset to obtain current regression tree Training set, and based on the current regression tree of the training set of current regression tree training, it is constructed further according to each regression tree Gradient promotes decision tree GBDT model.
Specifically, being constructed based on the positive sample data that the sample data is concentrated when constructing positive sample training subset During one positive sample training subset, building process may include: all positive samples for taking the sample data to concentrate Data construct a positive sample training subset, alternatively, the part positive sample data for taking the sample data to concentrate are constructing one just Sample training subset.Meanwhile during constructing a negative sample subset, when never marker samples data carry out sampling building one When a negative sample training subset, the data volume of the negative sample training subset is the data volume of the positive sample training subset 1 to 2 times.For example, then the quantity of negative sample training subset then can be when positive sample training subset includes 1000 data 1000 to 2000 datas.In this way it can be ensured that can have when training the regression tree of each GBDT model comprising difference The training set of sample data is trained, and ensures that the otherness of each tree.Also, the number of negative sample training subset When according to amount being 1 to 2 times of data volume of the positive sample training subset, for 1 times or less, it can be ensured that in training set In have enough sample datas, it is higher so as to the accuracy of protecting trained model, meanwhile, compared to 2 times or more For, sample size is more at this time, and the time consumption for training that will lead to model is longer, therefore, chooses the data of negative sample training subset When amount is 1 to 2 times of the data volume of the positive sample training subset, less model it can be instructed while taking into account model accuracy rate Time loss during white silk improves the efficiency of model training.It should be noted that during above-mentioned determining training set, When choosing the training subset of positive negative sample in the manner described above, execution method is the business field unknown when positive and negative sample proportion It is executed under scape, and another situation, it can also be in specific implement scene, the positive negative sample ratio in known sample data When example, or when can determine the ratio in usually entire sample data between positive negative sample based on historical data, then selecting During taking negative sample, positive sample training subset and the training of negative appearance can be determined by the known positive and negative sample proportion Subset that is, known to the positive and negative sample proportion under business scenario, estimates negative sample data volume and positive sample data volume ratio is x When, then enabling the data volume of the negative sample training subset is x times of data volume of the positive sample training subset.
After the training set of current regression tree has been determined, then pass through the training set and default GBDT of the current regression tree Algorithm is iterated training, obtains corresponding to each regression tree corresponding to each repetitive exercise.Specifically, every in repetitive exercise When one regression tree, executive mode can carry out in the following manner:
Firstly, carrying out the training of the first regression tree.The first training set, simultaneously is obtained specifically, concentrating from the sample data According to first training set, default GBDT algorithm and the first parameter, the first regression tree of training, first parameter be for The mean value of the actual result for whole sample datas that sample data is concentrated;
Then, according to the subsequent other regression trees of the first regression tree training.Specifically, being returned when training obtains described first It after tree, is concentrated from the sample training and chooses the second training set, and according to second training set, default GBDT algorithm, and Second parameter, the second regression tree of training.
Here, second parameter is true according to first regression tree by the sample data in second training set The actual result of sample data determines in fixed prediction result and the second training set, first training set, the second training set It is to be made of a positive sample training subset and a negative sample training subset, first training set and second instruction It is different to practice the negative sample training subset for collecting included.That is, when the first regression tree of training, it can be first from sample training centralized calculation The average value of actual result, and the average value is determined as the first parameter, corresponding is then trained according to first parameter One tree, i.e. the first regression tree.
After complete first regression tree of training, the determination of the second parameter can be carried out by first regression tree, specifically: it is first First, second training set is predicted by first regression tree, obtains the corresponding prediction knot of second training set Fruit;Then, according to the actual result of second training set prediction result corresponding with second training set, the reality is determined The residual error of border result and prediction result, and the residual error is determined as second parameter.In embodiments of the present invention, described One regression tree can be understood as root regression tree, after training first regression tree, then can successively repetitive exercise it is subsequent more A second regression tree, wherein the second parameter when each repetitive exercise is the regression tree pair obtained by a preceding repetitive exercise Residual error is calculated in the carry out difference of the actual result of the prediction result and training set of training set when the secondary repetitive exercise.
For example, when the actual result of sample training concentration sample data A, sample data B and sample data C tri- is respectively 6,11,4 when, then according to the method for this step can training the first regression tree when determine its first parameter be three actual results Average value be 7, can be trained in this way according to first parameter 7 in the first regression tree of training, when having determined first time Gui Shuhou can predict sample data A by the first regression tree, when prediction result is 5, due to sample data A's Actual result is 6, then can will carry out difference calculating between actual result and prediction result, and obtain residual error 1, and by the residual error 1 It is used as the training process of the second regression tree as the second parameter.
203, multiple training sets are constructed based on the sample data set.
Based on actual PU study during, be based on positive sample data in sample data it is a small amount of, in order to obtain more For accurate model, obtained GBDT model can also be supplemented by constructing the model based on algorithms of different.Also, Different models can be trained by different algorithm and training set, therefore, sample data set can be passed through in this step Training set when building is for other subsequent model trainings.
Specifically, can be carried out in the following manner when constructing a training set;It is primarily based on the sample data set In at least partly positive sample data construct a positive sample training subset, and to the sample data concentrate unmarked sample Data carry out multiple repairing weld operation and construct multiple negative sample training subsets.Then again by the positive sample training subset and described more A negative sample training subset is respectively combined to obtain multiple training sets.Certainly, the positive sample training subset in building training set During, one positive sample training set of building that can be as described above can also concentrate extraction section just from sample data Sample is trained the building of collection, specifically can be with are as follows: firstly, at least partly positive sample concentrated based on the sample data Data construct multiple positive sample training subsets, and carry out multiple repairing weld behaviour to the unmarked sample data that the sample data is concentrated Make to construct multiple negative sample training subsets.Then, then by each positive sample training subset and the multiple negative sample training subset It is respectively combined to obtain multiple training sets.
204, it is carried out respectively from the set and the multiple training set that the set of machine learning algorithm, hyper parameter combine Selection, training obtain multiple candidate families.
Specifically, its machine algorithm can be chosen from preset machine algorithm set, hyper parameter then can be by hyper parameter It is obtained in combined set, here, the candidate family can be by a kind of machine learning algorithm, in conjunction with one group of selected super ginseng Determined by a training set in several and corresponding multiple training sets.Wherein, a kind of machine learning algorithm, one group of hyper parameter A candidate family is determined with a training set.
For example, machine learning algorithm set are as follows: [algorithm 1, algorithm 2, algorithm 3], the set of hyper parameter combination are as follows: [super ginseng Array closes 1, hyper parameter combination 2 ... ..., hyper parameter combination 10], training set includes: training set 1, training set 2 ... ..., training set 8.It then selects " algorithm 1+ hyper parameter combination 1+ training set 1 " to can determine a candidate decision tree-model, selects " algorithm 2+ hyper parameter Combination 1+ training set 1 " can determine another candidate decision tree-model, and selection " algorithm 1+ hyper parameter combines 2+ training set 1 " can be true Another fixed candidate decision tree-model, selection " algorithm 1+ hyper parameter combines 1+ training set 2 " can determine another candidate decision tree Model, and so on.
205, the positive sample data concentrated to the sample data carry out sampling building positive sample assessment subset, to the sample The unmarked sample data that notebook data is concentrated carries out sampling building negative sample assessment subset, and positive sample is assessed subset and negative sample Assessment sub-combinations obtain assessment collection.
Multiple candidate families have been obtained based on abovementioned steps, for these models, accuracy is different, because This, also needs to assess these candidate families in embodiments of the present invention, to obtain relatively accurate model, therefore, It, can be with specifically: the positive sample that the sample data is concentrated in the construction assessment collection based on the sample data set Data carry out sampling building positive sample assessment subset, and the unmarked sample data concentrated to the sample data carries out sampling building Negative sample assesses subset, and positive sample is assessed subset and negative sample assessment sub-combinations obtain assessment collection.In addition, in order to further Raising assessment result accuracy, multiple assessments collection can also be constructed, in this step so that the multiple assessments of later use collect Each candidate family is repeatedly assessed, and determines comprehensive assessment effect according to multiple assessment result, therefore, is based on the sample It can be with when notebook data collection construction assessment collection specifically: multiple assessments are constructed based on the sample data set and are collected, wherein each commenting Estimating concentration includes positive sample data and the unmarked sample data as negative sample data.
206, according to the assessment collect at least one described described GBDT model and the multiple candidate family respectively into Row assessment, obtains the assessment result for corresponding to each model, and multiple moulds for meeting preset condition are selected from the assessment result Type.
Wherein, when the assessment collection of building is multiple assessments collection, then meet preset condition according to selecting in assessment result Model process can carry out in the following manner: firstly, for each candidate family, be collected according to the multiple assessment and default Evaluation condition respectively assesses the candidate family, obtains multiple assessment results.Then, the multiple of each candidate family are commented Estimate result to be merged, and will merge the multiple assessment result obtain the corresponding final assessment result of the candidate family as Actual assessment result.
It is directly affected it should be noted that being existed based on different default evaluation conditions to assessment mode and assessment result, Therefore for assessment result, based on different default evaluation conditions, its corresponding assessment result is also different, such as: when When the default evaluation condition is maximal margin method, the assessment result of each candidate family of correspondence is that each candidate family exists The class interval of prediction result on assessment collection.And when the default evaluation condition is to calculate the method for AUC value, the correspondence The assessment result of each candidate family is AUC value of each candidate family on assessment collection.Wherein, AUC value can be understood as one A probability value, when you select a positive sample and negative sample at random, current sorting algorithm is according to the score being calculated It is exactly AUC value that this positive sample is come the probability before negative sample by value, and AUC value is bigger, and illustrating that current class model more has can Positive sample can be come before negative sample, so as to preferably classify, so that it is determined that the classifying quality of model is more accurate.
207, the composite model after the multiple model for meeting preset condition is integrated is integrated.
Based on the candidate mould for meeting preset condition acquired when selecting assessment result to meet the candidate family of preset condition Type is often multiple, and the accuracy of above-mentioned candidate family is also not identical, for the model that further ensures that Accuracy needs to integrate on above-mentioned model in this case, wherein its process can be with when integrated are as follows: according to correspondence Assessment result be that each selected candidate family distributes corresponding weighted value, and according to weighted value to selected candidate mould Type is integrated.
In such manner, it is possible to ensure by being integrated to the model for meeting predicted condition, obtain it is final meet model, can Further improve the overfitting problem of model on the basis of obtained GBDT model, the model ensured that can With preferable forecasting accuracy.
Further, as above-mentioned building gradient promoted decision tree GBDT model method not only be only that acquisition one compared with For accurate model, practical significance also resides in the application of actual scene with solving practical problems, for example, target object recommend, During the detection of bank card leak point, picture/text classification and cause malicious traffic stream detection, therefore, the above method is being combined It can be as shown in following examples come process when solving the above problems.
Firstly, obtaining data to be predicted, wherein the data to be predicted may include: picture/text data to be sorted, Transaction data, target object data to be predicted and the data on flows to be detected of robber's brush bank card of leakage point to be detected. Specifically, data to be predicted are different according to different application scenarios.
Then, according to the described in any item methods of such as previous embodiment, the instruction that gradient promotes decision tree GBDT model is carried out Practice, obtains gradient and promote decision tree GBDT model.Specifically, its implementation procedure may is that acquisition sample data set, then instructing When practicing each regression tree of GBDT model, a positive sample training is constructed based on the positive sample data that the sample data is concentrated Subset carries out sampling one negative sample training subset of building to the unmarked sample data that the sample data is concentrated, will be described Positive sample training subset and the multiple negative sample training subset are combined to obtain the training set of current regression tree, and are based on institute The current regression tree of training set training for stating current regression tree promotes decision tree further according to each regression tree building gradient GBDT model.For example, then in this step, the sample data set of acquisition is then when this method executes in the scene of malicious traffic stream Data on flows when for malicious traffic stream detection, wherein the known malicious data on flows in the data on flows is positive sample data, Unknown data on flows is negative sample data.
Finally, utilize obtained gradient promoted decision tree GBDT model execute prediction task, wherein the prediction task with Acquired data to be predicted in abovementioned steps are corresponding.For example, when the data to be predicted are that malicious traffic stream detects scene Under data on flows to be detected when, then finally according to gradient promoted decision tree GBDT model execute prediction task be malicious stream Measure Detection task.
In addition, in the examples described above, can be applied in the scene that target object is recommended,
In addition, as the realization for promoting decision tree GBDT model method to above-mentioned building gradient, the embodiment of the present invention is provided A kind of building gradient promotes decision tree GBDT model equipment, which is mainly used for model accuracy caused by improving over-fitting Lower problem improves the accuracy of trained GBDT model.To be easy to read, present apparatus embodiment is no longer to aforementioned side Detail content in method embodiment is repeated one by one, it should be understood that the device in the present embodiment can correspond to realize it is aforementioned Full content in embodiment of the method.The device is as shown in figure 3, specifically include:
Acquiring unit 31 can be used for obtaining sample data set, and the sample data concentration includes the positive sample with positive label Notebook data and unmarked sample data without label;
Construction unit 32 can be used for being based on the acquiring unit 31 in each regression tree of training GBDT model The positive sample data that the sample data of acquisition is concentrated construct a positive sample training subset, the sample obtained to the acquiring unit 31 The unmarked sample data that notebook data is concentrated carries out sampling one negative sample training subset of building, by the positive sample training subset It is combined to obtain the training set of current regression tree with the multiple negative sample training subset, and based on the current regression tree The current regression tree of training set training promotes decision tree GBDT model further according to each regression tree building gradient.
Further, as shown in figure 4, the construction unit 32 includes:
First building module 321, all positive sample data that can be used for that the sample data is taken to concentrate are constructing one just Sample training subset, or the part positive sample data that can be used for that the sample data is taken to concentrate construct a positive sample training Subset.
Further, as shown in figure 4, the construction unit 32 includes:
Second building module 322, can be used for working as under business scenario known to positive and negative sample proportion, estimates negative sample data When amount and positive sample data volume ratio are x, then enabling the data volume of the negative sample training subset is the positive sample training subset X times of data volume;Under the unknown business scenario of positive and negative sample proportion, then the data volume of the negative sample training subset is enabled to be 1 to 2 times of the data volume of the positive sample training subset.
Further, as shown in figure 4, the construction unit 32 includes:
Training module 323 can be used for obtaining by the first building module and the second building module described current The training set of regression tree and default GBDT algorithm are iterated training, obtain each regression tree for corresponding to each repetitive exercise.
Further, as shown in figure 4, the training module 323 includes:
First training submodule 3231 can be used for concentrating the first training set of acquisition from the sample data and according to institute The first training set, default GBDT algorithm and the first parameter are stated, the first regression tree of training, first parameter is for sample number According to the mean value of the actual result of whole sample datas of concentration;
Second training submodule 3232 can be used for obtaining first regression tree when the first training training of submodule 3231 Afterwards, it is concentrated from the sample training and chooses the second training set, and according to second training set, default GBDT algorithm, Yi Ji Two parameters, the second regression tree of training, second parameter is by the sample data in second training set according to described the What the actual result of the prediction result that one regression tree determines and sample data in the second training set determined, first training set, Second training set is made of a positive sample training subset and a negative sample training subset, first training set with The negative sample training subset that second training set is included is different.
Further, as shown in figure 4, the training module 323 further include:
Predict submodule 3233, the first regression tree that can be used for obtaining by the first training submodule 3231 is to institute It states the second training set to be predicted, obtains the corresponding prediction result of second training set;
It determines submodule 3234, can be used for actual result and the prediction submodule according to second training set The corresponding prediction result of 3233 pair of second training set, determines the residual error of the actual result and prediction result, and by the residual error It is determined as second parameter, so that the second training submodule 3232 is according to second regression tree of the second parameter training.
Further, as shown in figure 4, described device further include:
Training set construction unit 33, the sample data set that can be used for obtaining based on the acquiring unit 31 construct multiple instructions Practice collection;
Training unit 34 can be used for the set and the training of the set from machine learning algorithm, hyper parameter combination It is selected respectively in multiple training sets that collection construction unit 33 obtains, training obtains multiple candidate families, wherein a kind of machine Learning algorithm, one group of hyper parameter and a training set determine a candidate family;
Assessment unit 35, at least one the described GBDT model that can be used for constructing construction unit 32 and the training are single Multiple candidate families of 34 training of member are assessed respectively, select multiple models for meeting preset condition;
Integrated unit 36 can be used for integrating multiple models for meeting preset condition after the assessment unit 35 is assessed and obtain Composite model after integrated.
Further, as shown in figure 4, described device further include:
Assessment collection construction unit 37, can be used for the positive sample number concentrated to the sample data that the acquiring unit 31 obtains According to sampling building positive sample assessment subset is carried out, it is negative that sampling building is carried out to the unmarked sample data that the sample data is concentrated Positive sample is assessed subset and negative sample assessment sub-combinations obtains assessment collection by Samples Estimates subset;
The assessment unit 35, the assessment collection that can be specifically used for being obtained according to the assessment collection construction unit 37 is to described At least one GBDT model and the multiple candidate family are assessed respectively, obtain the assessment knot for corresponding to each model Fruit, and multiple models for meeting preset condition are selected from the assessment result.
Further, as shown in figure 4, the sample data set include: target object recommend when target object data, silver Picture/text data, malice when transaction data, the picture/text of stolen brush bank card when row card leakage point detects are classified Data on flows when flow detection;
Wherein, when the sample data set is target object data, the mesh that has been recommended in the target object data Mark object data is positive sample data, and the not recommended target object data is negative sample data;When the sample data Collection for stolen brush bank card transaction data when, the transaction data of known leaks point is positive sample data in the transaction data, The transaction data of unknown leakage point is negative sample data;It is described to have classified when the sample data set is picture/text data Picture/text data be positive sample data, non-classified picture/text data are negative sample data;When the sample data When integrating as data on flows, the known malicious data on flows in the data on flows is positive sample data, and unknown data on flows is Negative sample data.
Based on method and apparatus described in previous embodiment, in conjunction with specific application scenarios, in the recommendation of such as music, commodity During the target objects such as recommendation are recommended, the embodiment of the invention also provides a kind of systems realizing target object and recommending, and use With the realization to the recommendation function for realizing target object, to be easy to read, present apparatus embodiment is no longer to preceding method embodiment In detail content repeated one by one, it should be understood that the system in the present embodiment can correspond to realize preceding method implement Full content in example.Specifically, as shown in Figure 5, comprising:
Target object data acquiring unit 51 can be used for obtaining target object data to be predicted;
The device 52 that gradient promotes decision tree GBDT model is constructed, can be used for obtaining gradient based on sample data set and mentioning Rise decision tree GBDT model, wherein it is about target object that the training gradient, which promotes the sample data set of decision tree GBDT model, Data set, wherein the data for the target object being easily selected by a user are positive sample data, the target pair not being easily selected by a user The data of elephant are unmarked sample data;In an embodiment of the present invention, building gradient promotes the device of decision tree GBDT model 52 specifically can be as shown in Figure 3 or Figure 4;
Execution unit 53 can be used for being promoted the obtained gradient of device 52 of decision tree GBDT model using building gradient Promote decision tree GBDT model performance objective object recommendation task;Wherein, the target object is the quotient provided by internet Product or service.
In addition, during the detection of bank card leakage point, the embodiment of the present invention is also provided in conjunction with specific application scenarios A kind of system for realizing the detection of bank card leak point, to realizing the detection function for carrying out leakage point in bank card business dealing data Can, to be easy to read, present apparatus embodiment no longer repeats the detail content in preceding method embodiment one by one, but it should Clear, the system in the present embodiment can correspond to the full content realized in preceding method embodiment.Specifically, as shown in fig. 6, Include:
Transaction data acquiring unit 61 can be used for obtaining the transaction data of the stolen brush bank card of leakage point to be detected;
The device 62 that gradient promotes decision tree GBDT model is constructed, for being based on sample data set, gradient promotion is obtained and determines Plan tree GBDT model, wherein the sample data set that the training gradient promotes decision tree GBDT model is the stolen bank card brushed Transaction data collection, wherein the transaction data of the stolen brush bank card of marked leak point is positive sample data, unmarked leak point Stolen brush bank card transaction data be unmarked sample data;In an embodiment of the present invention, building gradient promotes decision The device 62 for setting GBDT model specifically can be as shown in Figure 3 or Figure 4;
Execution unit 63 can be used for being promoted the obtained gradient of device 62 of decision tree GBDT model using building gradient It promotes decision tree GBDT model and executes bank card leak point Detection task;
Wherein, when the transaction data is concentrated when including each transaction of each bank card exchange hour and transaction Terminal Equipment Identifier.
In addition, during text, image are classified, the embodiment of the present invention is also mentioned in conjunction with specific application scenarios Supplied a kind of system for realizing picture/text classification, realizing the function of classifying to text, image, be it is easy to read, Present apparatus embodiment no longer repeats the detail content in preceding method embodiment one by one, it should be understood that the present embodiment In system can correspond to realize preceding method embodiment in full content.Specifically, as shown in fig. 7, comprises:
Picture/text data capture unit 71 can be used for obtaining picture/text data to be predicted;
The device 72 that gradient promotes decision tree GBDT model is constructed, for being based on sample data set, gradient promotion is obtained and determines Plan tree GBDT model, wherein it is picture/text data set that the training gradient, which promotes the sample data set of decision tree GBDT model, Wherein, the picture/text data for having classification marker are positive sample data, and the picture/text data of no classification marker are not mark Remember sample data;In an embodiment of the present invention, building gradient promoted decision tree GBDT model device 72 specifically can be as Shown in Fig. 3 or Fig. 4;
Execution unit 73 can be used for being promoted the obtained gradient of device 72 of decision tree GBDT model using building gradient It promotes decision tree GBDT model and executes picture/text classification task.
In addition, in malicious traffic stream detection scene, the embodiment of the invention also provides one kind in conjunction with specific application scenarios The system for realizing malicious traffic stream detection, to be easy to read, this dress to malicious traffic stream is detected in the data on flows never detected It sets embodiment no longer to repeat the detail content in preceding method embodiment one by one, it should be understood that in the present embodiment System can correspond to the full content realized in preceding method embodiment.Specifically, as shown in Figure 8, comprising:
Data on flows acquiring unit 81 can be used for obtaining gene data to be detected;
The device 82 that gradient promotes decision tree GBDT model is constructed, for being based on sample data set, gradient promotion is obtained and determines Plan tree GBDT model, wherein it is data on flows collection that the training gradient, which promotes the sample data set of decision tree GBDT model, wherein Known malicious traffic stream data are positive sample data, and the data on flows not detected is unmarked sample data;In reality of the invention It applies in example, the device 82 that building gradient promotes decision tree GBDT model specifically can be as shown in Figure 3 or Figure 4;
Execution unit 83 can be used for being promoted the obtained gradient of device 82 of decision tree GBDT model using building gradient Promote the Detection task that decision tree GBDT model executes Disease-causing gene.
Further, the embodiment of the invention also provides a kind of computer readable storage mediums, wherein the computer can It reads to be stored with computer program on storage medium, wherein real when the computer program is executed by one or more computing devices Existing above-mentioned building gradient promotes decision tree GBDT model method.
In addition, including one or more computing devices and one or more storage dresses the embodiment of the invention also provides one kind The system set, record has computer program on one or more of storage devices, and the computer program is one Or multiple computing devices make one or more of computing devices realize that above-mentioned building gradient promotes decision tree when executing GBDT model method.
In conclusion a kind of building gradient that the embodiment of the present invention proposes promotes the method and dress of decision tree GBDT model It sets, sample data set can be obtained, then in each regression tree of training GBDT model, concentrated based on the sample data Positive sample data construct a positive sample training subset, to the sample data concentrate unmarked sample data sample A negative sample training subset is constructed, the positive sample training subset and the multiple negative sample training subset are combined To the training set of current regression tree, and based on the current regression tree of the training set of current regression tree training, further according to described every One regression tree building gradient promotes decision tree GBDT model, and compared with the prior art, the present invention can be concentrated by sample data The positive sample training subset and negative sample training subset of acquisition train the regression tree of each GBDT model, due to for training The current regression tree training set of each regression tree is extracted from sample training collection, more so as to guarantee to obtain more The problem of otherness between a tree, over-fitting caused by avoiding because of existing training method, then improves trained The gradient arrived promotes the accuracy of decision tree GBDT model.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.
It is understood that the correlated characteristic in above-mentioned method and device can be referred to mutually.In addition, above-described embodiment In " first ", " second " etc. be and not represent the superiority and inferiority of each embodiment for distinguishing each embodiment.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In addition, memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory includes extremely A few storage chip.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flashRAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of method that building gradient promotes decision tree GBDT model, comprising:
Sample data set is obtained, the sample data concentration includes the positive sample data with positive label and the unmarked sample without label Notebook data;
In each regression tree of training GBDT model, one is being constructed just based on the positive sample data that the sample data is concentrated Sample training subset carries out sampling to the unmarked sample data that the sample data is concentrated and constructs negative sample training Collection is combined the positive sample training subset and the multiple negative sample training subset to obtain the training of current regression tree Collection, and based on the current regression tree of the training set of current regression tree training, gradient is constructed further according to each regression tree Promote decision tree GBDT model.
2. the method for claim 1, wherein the positive sample data concentrated based on the sample data construct one Positive sample training subset includes:
All positive sample data for taking the sample data to concentrate construct a positive sample training subset;
Alternatively,
The part positive sample data for taking the sample data to concentrate construct a positive sample training subset.
3. the method for claim 1, wherein
Under the business scenario known to the positive and negative sample proportion, when to estimate negative sample data volume and positive sample data volume ratio be x, then The data volume for enabling the negative sample training subset is x times of data volume of the positive sample training subset;
Under the unknown business scenario of positive and negative sample proportion, then enabling the data volume of the negative sample training subset is the positive sample 1 to 2 times of the data volume of training subset.
4. the method for claim 1, wherein described include: based on the current regression tree of training set training
It is iterated training by the training set and default GBDT algorithm of the current regression tree, obtains corresponding to each iteration instruction Each experienced regression tree.
5. method as claimed in claim 4, wherein described to be calculated by the training set of the current regression tree and default GBDT Method is iterated training, obtains each regression tree for corresponding to each repetitive exercise, comprising:
It is concentrated from the sample data and obtains the first training set and according to first training set, default GBDT algorithm, Yi Ji One parameter, the first regression tree of training, first parameter are the actual results for the whole sample datas concentrated for sample data Mean value;
After training obtains first regression tree, is concentrated from the sample training and choose the second training set, and according to described the Two training sets, default GBDT algorithm and the second parameter, the second regression tree of training, second parameter is by described second The reality of prediction result and sample data in the second training set that sample data in training set is determined according to first regression tree What border result determined, first training set, the second training set are by a positive sample training subset and a negative sample instruction Practice what subset was constituted, first training set is different from the negative sample training subset that second training set is included.
6. a kind of method realizing target object and recommending, comprising:
Obtain target object data to be predicted;
According to method according to any one of claims 1 to 5, obtains the gradient and promote decision tree GBDT model;
Decision tree GBDT model performance objective object recommendation task is promoted using obtained gradient;
Wherein, the target object is the commodity provided by internet or service.
7. a kind of method for realizing the detection of bank card leak point, comprising:
Obtain the transaction data of the stolen brush bank card of leakage point to be detected;
According to method according to any one of claims 1 to 5, obtains gradient and promote decision tree GBDT model;
Decision tree GBDT model, which is promoted, using obtained gradient executes bank card leak point Detection task;
Wherein, the end when transaction data is concentrated when including each transaction of each bank card exchange hour and transaction End equipment mark.
8. a kind of method for realizing picture/text classification, comprising:
Obtain picture/text data to be predicted;
According to method according to any one of claims 1 to 5, obtains gradient and promote decision tree GBDT model;
Decision tree GBDT model, which is promoted, using obtained gradient executes picture/text classification task.
9. a kind of method of malicious traffic stream detection, comprising:
Obtain data on flows to be detected;
According to the method according to any one of claims 1 to 5, obtains gradient and promote decision tree GBDT model;
Decision tree GBDT model is promoted using obtained gradient, and Detection task is executed to the data on flows to be detected.
10. the device that a kind of building gradient promotes decision tree GBDT model, comprising:
Acquiring unit, for obtaining sample data set, the sample data concentration includes positive sample data and nothing with positive label The unmarked sample data of label;
Construction unit, the positive sample for being concentrated based on the sample data in each regression tree of training GBDT model Data construct a positive sample training subset, carry out sampling building one to the unmarked sample data that the sample data is concentrated The positive sample training subset is combined with the multiple negative sample training subset and is currently returned by negative sample training subset The training set of Gui Shu, and based on the current regression tree of the training set of current regression tree training, it is returned further according to described each Tree building gradient promotes decision tree GBDT model.
CN201910526406.6A 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device Active CN110348580B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210493503.1A CN114819186A (en) 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device
CN201910526406.6A CN110348580B (en) 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910526406.6A CN110348580B (en) 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202210493503.1A Division CN114819186A (en) 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device

Publications (2)

Publication Number Publication Date
CN110348580A true CN110348580A (en) 2019-10-18
CN110348580B CN110348580B (en) 2022-05-10

Family

ID=68182254

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201910526406.6A Active CN110348580B (en) 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device
CN202210493503.1A Pending CN114819186A (en) 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202210493503.1A Pending CN114819186A (en) 2019-06-18 2019-06-18 Method and device for constructing GBDT model, and prediction method and device

Country Status (1)

Country Link
CN (2) CN110348580B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110824099A (en) * 2019-11-07 2020-02-21 东南大学 Method for predicting reaction performance in solid fuel chemical chain process based on GBRT
CN110866528A (en) * 2019-10-28 2020-03-06 腾讯科技(深圳)有限公司 Model training method, energy consumption use efficiency prediction method, device and medium
CN111045716A (en) * 2019-11-04 2020-04-21 中山大学 Related patch recommendation method based on heterogeneous data
CN111177375A (en) * 2019-12-16 2020-05-19 医渡云(北京)技术有限公司 Electronic document classification method and device
CN111310860A (en) * 2020-03-26 2020-06-19 清华大学深圳国际研究生院 Method and computer-readable storage medium for improving performance of gradient boosting decision trees
CN111428930A (en) * 2020-03-24 2020-07-17 中电药明数据科技(成都)有限公司 GBDT-based medicine patient using number prediction method and system
CN111860935A (en) * 2020-05-21 2020-10-30 北京骑胜科技有限公司 Fault prediction method, device, equipment and storage medium of vehicle
CN112101678A (en) * 2020-09-23 2020-12-18 东莞理工学院 GBDT-based student personality tendency prediction method
CN112434862A (en) * 2020-11-27 2021-03-02 中国人民大学 Financial predicament method and device for enterprise on market
CN112581342A (en) * 2020-12-25 2021-03-30 中国建设银行股份有限公司 Method, device and equipment for evaluating aged care institution grade and storage medium
CN113326433A (en) * 2021-03-26 2021-08-31 沈阳工业大学 Personalized recommendation method based on ensemble learning
CN113343051A (en) * 2021-06-04 2021-09-03 全球能源互联网研究院有限公司 Abnormal SQL detection model construction method and detection method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8499008B2 (en) * 2009-07-24 2013-07-30 Yahoo! Inc. Mixing knowledge sources with auto learning for improved entity extraction
US9262532B2 (en) * 2010-07-30 2016-02-16 Yahoo! Inc. Ranking entity facets using user-click feedback
CN107463802A (en) * 2017-08-02 2017-12-12 南昌大学 A kind of Forecasting Methodology of protokaryon protein acetylation sites
CN107909516A (en) * 2017-12-06 2018-04-13 链家网(北京)科技有限公司 A kind of problem source of houses recognition methods and system
CN108269012A (en) * 2018-01-12 2018-07-10 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of risk score model
US10063582B1 (en) * 2017-05-31 2018-08-28 Symantec Corporation Securing compromised network devices in a network
CN108539738A (en) * 2018-05-10 2018-09-14 国网山东省电力公司电力科学研究院 A kind of short-term load forecasting method promoting decision tree based on gradient
CN108717867A (en) * 2018-05-02 2018-10-30 中国科学技术大学苏州研究院 Disease forecasting method for establishing model and device based on Gradient Iteration tree
CN109242105A (en) * 2018-08-17 2019-01-18 第四范式(北京)技术有限公司 Tuning method, apparatus, equipment and the medium of hyper parameter in machine learning model
US20190050368A1 (en) * 2016-04-21 2019-02-14 Sas Institute Inc. Machine learning predictive labeling system
CN109460795A (en) * 2018-12-17 2019-03-12 北京三快在线科技有限公司 Classifier training method, apparatus, electronic equipment and computer-readable medium
CN109472296A (en) * 2018-10-17 2019-03-15 阿里巴巴集团控股有限公司 A kind of model training method and device promoting decision tree based on gradient

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8499008B2 (en) * 2009-07-24 2013-07-30 Yahoo! Inc. Mixing knowledge sources with auto learning for improved entity extraction
US9262532B2 (en) * 2010-07-30 2016-02-16 Yahoo! Inc. Ranking entity facets using user-click feedback
US20190050368A1 (en) * 2016-04-21 2019-02-14 Sas Institute Inc. Machine learning predictive labeling system
US10063582B1 (en) * 2017-05-31 2018-08-28 Symantec Corporation Securing compromised network devices in a network
CN107463802A (en) * 2017-08-02 2017-12-12 南昌大学 A kind of Forecasting Methodology of protokaryon protein acetylation sites
CN107909516A (en) * 2017-12-06 2018-04-13 链家网(北京)科技有限公司 A kind of problem source of houses recognition methods and system
CN108269012A (en) * 2018-01-12 2018-07-10 中国平安人寿保险股份有限公司 Construction method, device, storage medium and the terminal of risk score model
CN108717867A (en) * 2018-05-02 2018-10-30 中国科学技术大学苏州研究院 Disease forecasting method for establishing model and device based on Gradient Iteration tree
CN108539738A (en) * 2018-05-10 2018-09-14 国网山东省电力公司电力科学研究院 A kind of short-term load forecasting method promoting decision tree based on gradient
CN109242105A (en) * 2018-08-17 2019-01-18 第四范式(北京)技术有限公司 Tuning method, apparatus, equipment and the medium of hyper parameter in machine learning model
CN109472296A (en) * 2018-10-17 2019-03-15 阿里巴巴集团控股有限公司 A kind of model training method and device promoting decision tree based on gradient
CN109460795A (en) * 2018-12-17 2019-03-12 北京三快在线科技有限公司 Classifier training method, apparatus, electronic equipment and computer-readable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUAN CHENG ET AL: "Research on Travel Time Prediction Model of Freeway Based on Gradient Boosting Decision Tree", 《IEEE ACCESS》 *
曾思源: "基于行为相似性的网络用户识别系统设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866528A (en) * 2019-10-28 2020-03-06 腾讯科技(深圳)有限公司 Model training method, energy consumption use efficiency prediction method, device and medium
CN110866528B (en) * 2019-10-28 2023-11-28 腾讯科技(深圳)有限公司 Model training method, energy consumption use efficiency prediction method, device and medium
CN111045716B (en) * 2019-11-04 2022-02-22 中山大学 Related patch recommendation method based on heterogeneous data
CN111045716A (en) * 2019-11-04 2020-04-21 中山大学 Related patch recommendation method based on heterogeneous data
CN110824099A (en) * 2019-11-07 2020-02-21 东南大学 Method for predicting reaction performance in solid fuel chemical chain process based on GBRT
CN110824099B (en) * 2019-11-07 2022-03-04 东南大学 Method for predicting reaction performance in solid fuel chemical chain process based on GBRT
CN111177375A (en) * 2019-12-16 2020-05-19 医渡云(北京)技术有限公司 Electronic document classification method and device
CN111428930A (en) * 2020-03-24 2020-07-17 中电药明数据科技(成都)有限公司 GBDT-based medicine patient using number prediction method and system
CN111310860A (en) * 2020-03-26 2020-06-19 清华大学深圳国际研究生院 Method and computer-readable storage medium for improving performance of gradient boosting decision trees
CN111310860B (en) * 2020-03-26 2023-04-18 清华大学深圳国际研究生院 Method and computer-readable storage medium for improving performance of gradient boosting decision trees
CN111860935A (en) * 2020-05-21 2020-10-30 北京骑胜科技有限公司 Fault prediction method, device, equipment and storage medium of vehicle
CN112101678A (en) * 2020-09-23 2020-12-18 东莞理工学院 GBDT-based student personality tendency prediction method
CN112434862A (en) * 2020-11-27 2021-03-02 中国人民大学 Financial predicament method and device for enterprise on market
CN112434862B (en) * 2020-11-27 2024-03-12 中国人民大学 Method and device for predicting financial dilemma of marketing enterprises
CN112581342A (en) * 2020-12-25 2021-03-30 中国建设银行股份有限公司 Method, device and equipment for evaluating aged care institution grade and storage medium
CN113326433A (en) * 2021-03-26 2021-08-31 沈阳工业大学 Personalized recommendation method based on ensemble learning
CN113326433B (en) * 2021-03-26 2023-10-10 沈阳工业大学 Personalized recommendation method based on ensemble learning
CN113343051A (en) * 2021-06-04 2021-09-03 全球能源互联网研究院有限公司 Abnormal SQL detection model construction method and detection method
CN113343051B (en) * 2021-06-04 2024-04-16 全球能源互联网研究院有限公司 Abnormal SQL detection model construction method and detection method

Also Published As

Publication number Publication date
CN114819186A (en) 2022-07-29
CN110348580B (en) 2022-05-10

Similar Documents

Publication Publication Date Title
CN110348580A (en) Construct the method, apparatus and prediction technique, device of GBDT model
CN110084374A (en) Construct method, apparatus and prediction technique, device based on the PU model learnt
CN109936582A (en) Construct the method and device based on the PU malicious traffic stream detection model learnt
Liang et al. Interpretable structure-evolving LSTM
CN109934293A (en) Image-recognizing method, device, medium and obscure perception convolutional neural networks
Karayev et al. Anytime recognition of objects and scenes
CN104933428B (en) A kind of face identification method and device based on tensor description
Adhikari et al. Iterative bounding box annotation for object detection
CN109034219A (en) Multi-tag class prediction method and device, electronic equipment and the storage medium of image
CN108009593A (en) A kind of transfer learning optimal algorithm choosing method and system
CN109086811A (en) Multi-tag image classification method, device and electronic equipment
CN112800097A (en) Special topic recommendation method and device based on deep interest network
CN113688665B (en) Remote sensing image target detection method and system based on semi-supervised iterative learning
CN105844283A (en) Method for identifying category of image, image search method and image search device
CN107506793A (en) Clothes recognition methods and system based on weak mark image
CN109189767A (en) Data processing method, device, electronic equipment and storage medium
CN110263979A (en) Method and device based on intensified learning model prediction sample label
CN110084245B (en) Weak supervision image detection method and system based on visual attention mechanism reinforcement learning
Song et al. Temporal action localization in untrimmed videos using action pattern trees
Li et al. Localizing and quantifying infrastructure damage using class activation mapping approaches
CN110119860A (en) A kind of rubbish account detection method, device and equipment
Long et al. Learning to localize actions from moments
CN110175657A (en) A kind of image multi-tag labeling method, device, equipment and readable storage medium storing program for executing
Aquil et al. Predicting software defects using machine learning techniques
CN113051404A (en) Knowledge reasoning method, device and equipment based on tensor decomposition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant