CN114819186A

CN114819186A - Method and device for constructing GBDT model, and prediction method and device

Info

Publication number: CN114819186A
Application number: CN202210493503.1A
Authority: CN
Inventors: 王海; 涂威威
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2022-07-29
Also published as: CN110348580A; CN110348580B

Abstract

The invention discloses a method and a device for constructing a gradient boosting decision tree GBDT model, relates to the technical field of machine learning, and mainly aims to solve the problem that the accuracy of the existing trained decision tree model is low. The main technical scheme of the invention is as follows: acquiring a sample data set, wherein the sample data set comprises positive sample data with a positive label and unlabeled sample data without a label; when each regression tree of the GBDT model is trained, a positive sample training subset is constructed based on positive sample data in the sample data set, an unmarked sample data in the sample data set is sampled to construct a negative sample training subset, the positive sample training subset and the negative sample training subsets are combined to obtain a training set of the current regression tree, the current regression tree is trained based on the training set of the current regression tree, and then the gradient boosting decision tree GBDT model is constructed according to each regression tree. The method is used for the construction process of the gradient lifting decision tree.

Description

Method and device for constructing GBDT model, and prediction method and device

The present application is a divisional application of patent applications entitled "method and apparatus for building GBDT model, prediction method and apparatus", having an application date of 2019, month 6 and 18, and an application number of 201910526406.6.

Technical Field

The invention relates to the technical field of machine learning, in particular to a method and a device for constructing a Gradient Boosting Decision Tree (GBDT) model and a method and a device for predicting by using the model.

Background

With the continuous progress of the technology, the artificial intelligence technology is gradually developed. Among them, machine learning is a necessary product of the development of artificial intelligence research to a certain stage, and aims to improve the performance of the system itself by means of calculation and experience. In a computer system, "experience" is usually in the form of "data" from which a "model" can be generated by a machine learning algorithm, i.e. by providing empirical data to a machine learning algorithm, a model can be generated based on these empirical data, which provides a corresponding judgment, i.e. a prediction, in the face of a new situation. Whether the machine learning model is trained or predicted using the trained machine learning model, the data needs to be converted into machine learning samples including various features.

Currently, in real-world applications, data is relatively easy to acquire, and marking of data requires high resources such as manpower and material resources, so that a small amount of marked data is often stored in a certain data set and is marked as a positive sample, and a large amount of unmarked data is often stored in the certain data set. For this situation, PU Learning (PU Learning for short) is generally selected to combine with the gradient boosting decision tree algorithm to perform the training of the decision tree model, for example, the GBDT algorithm is selected to train the corresponding GBDT model of the gradient boosting decision tree according to the sample data.

However, in practical applications, when a decision tree model based on PU learning is trained, the labeled "positive samples" in the sample data are fewer and most of the sample data are unlabeled data, so that an "overfitting" phenomenon is very likely to occur when the decision tree model is trained and gradient-boosted, where the overfitting is a phenomenon that an assumption becomes too strict to obtain a consistent assumption, and thus the accuracy of the decision tree model trained by the conventional method is low.

Disclosure of Invention

In view of the above problems, the present invention provides a method and an apparatus for constructing a gradient boosting decision tree GBDT model, and mainly aims to solve the problem of low accuracy of the existing trained decision tree model and improve the accuracy of the trained model.

In order to achieve the purpose, the invention mainly provides the following technical scheme:

in one aspect, the present invention provides a method for constructing a gradient boosting decision tree GBDT model, which specifically includes:

acquiring a sample data set, wherein the sample data set comprises positive sample data with a positive label and unlabeled sample data without a label;

when each regression tree of the GBDT model is trained, a positive sample training subset is constructed based on positive sample data in the sample data set, an unmarked sample data in the sample data set is sampled to construct a negative sample training subset, the positive sample training subset and the negative sample training subsets are combined to obtain a training set of the current regression tree, the current regression tree is trained based on the training set of the current regression tree, and then the gradient boosting decision tree GBDT model is constructed according to each regression tree.

Optionally, the constructing a positive sample training subset based on positive sample data in the sample data set includes:

taking all positive sample data in the sample data set to construct a positive sample training subset;

alternatively, the first and second electrodes may be,

and taking part of positive sample data in the sample data set to construct a positive sample training subset.

Optionally, when the ratio of the estimated negative sample data size to the estimated positive sample data size is x in a service scenario with a known positive-negative sample ratio, the data size of the negative sample training subset is x times of the data size of the positive sample training subset; and when the proportion of the positive sample and the negative sample is unknown, enabling the data volume of the negative sample training subset to be 1 to 2 times of the data volume of the positive sample training subset.

Optionally, the training the current regression tree based on the training set includes:

and performing iterative training through the training set of the current regression tree and a preset GBDT algorithm to obtain each regression tree corresponding to each iterative training.

Optionally, the performing iterative training through the training set of the current regression tree and a preset GBDT algorithm to obtain each regression tree corresponding to each iterative training includes:

acquiring a first training set from the sample data set, and training a first regression tree according to the first training set, a preset GBDT algorithm and a first parameter, wherein the first parameter is a mean value of actual results of all sample data in the sample data set;

after the first regression tree is obtained through training, a second training set is selected from the sample training set, a second regression tree is trained according to the second training set, a preset GBDT algorithm and second parameters, the second parameters are determined according to a prediction result determined by sample data in the second training set according to the first regression tree and an actual result of the sample data in the second training set, the first training set and the second training set are both composed of a positive sample training subset and a negative sample training subset, and the negative sample training subsets contained in the first training set and the second training set are different.

Optionally, before selecting a second training set from the sample training set and training a second regression tree according to the second training set, a preset GBDT algorithm, and a second parameter, the method further includes:

predicting the second training set through the first regression tree to obtain a prediction result corresponding to the second training set;

and determining a residual error between the actual result and the predicted result according to the actual result of the second training set and the predicted result corresponding to the second training set, and determining the residual error as the second parameter.

Optionally, the method further comprises:

constructing a plurality of training sets based on the sample data set;

respectively selecting from a set of machine learning algorithms, a set of hyper-parameter combinations and the plurality of training sets, and training to obtain a plurality of candidate models, wherein one candidate model is determined by one machine learning algorithm, one group of hyper-parameters and one training set;

respectively evaluating at least one GBDT model and the candidate models, and selecting a plurality of models meeting preset conditions;

and integrating the plurality of models meeting the preset conditions to obtain an integrated composite model.

Optionally, before evaluating at least one of the GBDT models and the candidate models, respectively, the method further comprises:

sampling positive sample data in the sample data set to construct a positive sample evaluation subset, sampling unlabeled sample data in the sample data set to construct a negative sample evaluation subset, and combining the positive sample evaluation subset and the negative sample evaluation subset to obtain an evaluation set;

the evaluating at least one of the GBDT models and the candidate models, respectively, and the selecting a plurality of models that meet a predetermined condition includes:

and respectively evaluating the GBDT model and the candidate models according to the evaluation set to obtain an evaluation result corresponding to each model, and selecting a plurality of models meeting preset conditions from the evaluation results.

Optionally, the sample data set includes: target object data during target object recommendation, transaction data of a stolen bank card during bank card leakage point detection, image/text data during image/text classification, and flow data during malicious flow detection;

when the sample data set is target object data, recommended target object data in the target object data is positive sample data, and non-recommended target object data is negative sample data; when the sample data set is transaction data of a stolen bank card, the transaction data of a known leakage point in the transaction data is positive sample data, and the transaction data of an unknown leakage point is negative sample data; when the sample data set is image/text data, the classified image/text data is positive sample data, and the unclassified image/text data is negative sample data; when the sample data set is flow data, known malicious flow data in the flow data is positive sample data, and unknown flow data is negative sample data.

In another aspect, the present invention further provides an apparatus for constructing a gradient boosting decision tree GBDT model, wherein the method includes:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a sample data set, and the sample data set comprises positive sample data with positive labels and unlabeled sample data without labels;

the construction unit is used for constructing a positive sample training subset based on positive sample data in the sample data set when each regression tree of the GBDT model is trained, sampling unlabeled sample data in the sample data set to construct a negative sample training subset, combining the positive sample training subset and the negative sample training subsets to obtain a training set of the current regression tree, training the current regression tree based on the training set of the current regression tree, and constructing the gradient boosting decision tree GBDT model according to each regression tree.

Optionally, the building unit includes:

a first constructing module, configured to construct a positive sample training subset by taking all positive sample data in the sample data set, or to construct a positive sample training subset by taking part of positive sample data in the sample data set

Optionally, the building unit includes:

the second construction module is used for enabling the data volume of the negative sample training subset to be x times of the data volume of the positive sample training subset when the ratio of the estimated negative sample data volume to the estimated positive sample data volume is x under a service scene with a known positive-negative sample ratio; and when the proportion of the positive sample and the negative sample is unknown, enabling the data volume of the negative sample training subset to be 1 to 2 times of the data volume of the positive sample training subset.

Optionally, the building unit includes:

and the training module is used for carrying out iterative training through the training set of the current regression tree and a preset GBDT algorithm to obtain each regression tree corresponding to each iterative training.

Optionally, the training module includes:

the first training submodule is used for acquiring a first training set from the sample data set, and training a first regression tree according to the first training set, a preset GBDT algorithm and a first parameter, wherein the first parameter is the mean value of actual results of all sample data in the sample data set;

and the second training submodule is used for selecting a second training set from the sample training set after the first regression tree is obtained through training, and training a second regression tree according to the second training set, a preset GBDT algorithm and a second parameter, wherein the second parameter is determined according to a prediction result determined by the sample data in the second training set according to the first regression tree and an actual result of the sample data in the second training set, the first training set and the second training set are both composed of a positive sample training subset and a negative sample training subset, and the negative sample training subsets contained in the first training set and the second training set are different.

Optionally, the training module further includes:

the prediction submodule is used for predicting the second training set through the first regression tree to obtain a prediction result corresponding to the second training set;

and the determining submodule is used for determining a residual error between the actual result and the predicted result according to the actual result of the second training set and the predicted result corresponding to the second training set, and determining the residual error as the second parameter.

Optionally, the apparatus further comprises:

a training set construction unit for constructing a plurality of training sets based on the sample data set;

the training unit is used for respectively selecting from the set of machine learning algorithms, the set of hyper-parameter combinations and the plurality of training sets, and training to obtain a plurality of candidate models, wherein one candidate model is determined by one machine learning algorithm, one group of hyper-parameters and one training set;

the evaluation unit is used for respectively evaluating at least one GBDT model and the candidate models and selecting a plurality of models which accord with preset conditions;

and the integration unit is used for integrating the plurality of models meeting the preset conditions to obtain an integrated composite model.

Optionally, the apparatus further comprises:

the evaluation set building unit is used for sampling positive sample data in the sample data set to build a positive sample evaluation subset, sampling unmarked sample data in the sample data set to build a negative sample evaluation subset, and combining the positive sample evaluation subset and the negative sample evaluation subset to obtain an evaluation set;

the evaluation unit is specifically configured to evaluate the GBDT model and the candidate models according to the evaluation set, respectively, to obtain an evaluation result corresponding to each model, and select a plurality of models that meet a preset condition from the evaluation results.

In another aspect, the present invention provides a method for implementing target object recommendation, including:

acquiring target object data to be predicted;

-deriving the gradient boosting decision tree, GBDT, model according to the method of any of the first aspects;

executing a target object recommendation task by using the obtained gradient lifting decision tree GBDT model;

wherein the target object is a commodity or a service provided through the internet.

On the other hand, the invention provides a method for realizing the detection of the leakage point of a bank card, which comprises the following steps:

acquiring transaction data of a stolen bank card of a to-be-detected leakage point;

according to the method of any one of the first aspect, a gradient boosting decision tree, GBDT, model is derived;

executing a detection task of a leakage point of the bank card by using the obtained GBDT model;

the transaction data set comprises transaction time of each transaction of each bank card and terminal equipment identification during the transaction.

In another aspect, an embodiment of the present invention provides a method for implementing image/text classification, including:

acquiring image/text data to be predicted;

and performing an image/text classification task by using the obtained gradient boosting decision tree GBDT model.

On the other hand, an embodiment of the present invention further provides a method for detecting malicious traffic, including:

acquiring flow data to be detected;

and executing a detection task on the flow data to be detected by using the obtained gradient lifting decision tree GBDT model.

In another aspect, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more computing devices, implements any of the methods described above.

In another aspect, the present invention provides a system comprising one or more computing devices and one or more storage devices having a computer program recorded thereon, which when executed by the one or more computing devices, causes the one or more computing devices to carry out any of the methods described above.

By means of the technical scheme, the method and the device for constructing the gradient boosting decision tree GBDT model provided by the invention can acquire the sample data set, then construct a positive sample training subset based on the positive sample data in the sample data set when each regression tree of the GBDT model is trained, sample unlabeled sample data in the sample data set to construct a negative sample training subset, combine the positive sample training subset and the negative sample training subsets to obtain the training set of the current regression tree, train the current regression tree based on the training set of the current regression tree, and construct the gradient boosting decision tree GBDT model according to each regression tree, compared with the prior art, the method and the device can train the regression tree of each GBDT model through the positive sample training subset and the negative sample training subset acquired in the sample data set, the current regression tree training set used for training each regression tree is extracted from the sample training set, so that the difference among a plurality of trees can be ensured, the problem of overfitting caused by the existing training mode is avoided, and the accuracy of the gradient promotion decision tree GBDT model obtained by training is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method for constructing a GBDT model according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for constructing a GBDT model according to an embodiment of the present invention;

FIG. 3 is a block diagram of a GBDT modeling apparatus for constructing gradient boosting decision trees according to an embodiment of the present invention;

FIG. 4 is a block diagram of another GBDT modeling apparatus for constructing gradient boosting decision trees according to an embodiment of the present invention;

FIG. 5 is a block diagram illustrating components of a system for implementing target object recommendation in accordance with an embodiment of the present invention;

FIG. 6 is a block diagram illustrating a system for detecting leakage points of a bank card according to an embodiment of the present invention;

FIG. 7 is a block diagram illustrating components of a system for implementing image/text classification in accordance with an embodiment of the present invention;

fig. 8 shows a block diagram of a system for implementing malicious traffic detection according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention provides a method for constructing a gradient boosting decision tree GBDT model, which can be applied to the detection processes of bank card missing point detection, commodity service recommendation, image or text classification and malicious flow, wherein in the scenes, marked positive sample data is a small part, and the majority of the sample data is unmarked data. The method provided by the embodiment of the invention aims to solve the problem of low accuracy caused by multi-fitting of the model trained in the existing mode by constructing a gradient boosting decision tree GBDT model with high accuracy, and the method comprises the following specific steps as shown in FIG. 1:

101. and acquiring a sample data set.

The sample data set comprises positive sample data with positive labels and unlabeled sample data without labels. In many practical cases, data is easily obtained, but the marking process of the data requires high expenditure of human and material resources. For example, in malicious traffic detection, bank leak detection, music recommendation, etc., only a small amount of positive sample data (known malicious traffic, leakage points, music liked by the user) and a large amount of unlabeled sample data are often available. In this case, PU learning may be selected for model training, so as to analyze and automatically classify the data of the scene according to the trained model.

The PU (Positive and unlabeled Learning) is called Positive and unlabeled sample Learning, i.e. training the classification model under the condition of only Positive sample data and unlabeled sample data. In the past, negative samples are usually selected from unlabeled samples to train a classifier, however, under the condition of only containing positive samples, the model and parameters are difficult to select reliable results.

Based on this, in view of the problems in the prior art, in the embodiment of the present invention, first, according to the method in this step, a sample data set may be obtained, where the sample data set includes positive sample data with a positive tag and unlabeled sample data without a tag. For example, in practical applications, when the executed task is to detect malicious traffic, the positive sample data in the sample data set may be understood as known malicious traffic data, and the unlabeled sample data may be understood as traffic data that has not been detected.

102. When each regression tree of the GBDT model is trained, a positive sample training subset is constructed based on positive sample data in the sample data set, an unmarked sample data in the sample data set is sampled to construct a negative sample training subset, the positive sample training subset and the negative sample training subsets are combined to obtain a training set of the current regression tree, the current regression tree is trained based on the training set of the current regression tree, and then the gradient boosting decision tree GBDT model is constructed according to each regression tree.

The GBDT (Gradient Boosting Decision Tree), also called mart (multiple adaptive Regression Tree), is an iterative Decision Tree algorithm, which is composed of multiple Decision trees, and the conclusions of all the trees are accumulated to make a final prediction result.

Therefore, based on the characteristics of the GBDT algorithm, in the embodiment of the present invention, when training the gradient boosting decision tree GBDT model, each regression tree is trained through the training set. Specifically, based on the problem of overfitting of the existing decision tree model in the training process due to the fact that the number of samples is often small during training, a training set corresponding to a current regression tree can be separately constructed by aiming positive sample data and negative sample data in a sample data set at each regression tree to be trained, wherein the training set corresponding to the current regression tree can be a training set of positive samples constructed by using the positive sample data, sampling operation is performed on unlabeled sample data to obtain a corresponding training subset of negative samples, and the training set required by the regression tree to be trained, namely the training set of the current regression tree, is obtained by combining the training subsets of the positive samples and the training subsets of the negative samples.

After the training set of the current regression tree is determined, the corresponding trees can be trained through a GBDT algorithm, iterative training is sequentially carried out, all regression trees corresponding to the sample training set are obtained, and the regression trees are combined to obtain a gradient refreshing decision tree GBDT model corresponding to the sample training set.

In order to better explain the method for constructing a gradient boosting decision tree GBDT model provided by the embodiment of the present invention, another embodiment is provided herein to refine and expand the steps in the above embodiment, specifically, as shown in fig. 2, including:

201. and acquiring a sample data set.

The sample data set comprises positive sample data with positive labels and unlabeled sample data without labels.

Specifically, in practical applications, the sample data set may include: target object data in a target object recommendation scene; transaction data of the stolen bank card in the scene of detecting the leakage point of the bank card; image/text data in an image/text classification scenario; and traffic data in a malicious traffic detection scenario.

Based on different practical application scenarios, the unlabeled sample data of the positive sample data set in each sample data set also changes with the unlabeled sample data:

for example, when the sample data set is target object data, the recommended target object data in the target object data is positive sample data, and the non-recommended target object data is negative sample data;

when the sample data set is transaction data of a stolen bank card, the transaction data of a known leakage point in the transaction data is positive sample data, and the transaction data of an unknown leakage point is negative sample data;

when the sample data set is image/text data, the classified image/text data is positive sample data, and the unclassified image/text data is negative sample data;

when the sample data set is flow data, known malicious flow data in the flow data is positive sample data, and unknown flow data is negative sample data.

202. When each regression tree of the GBDT model is trained, a positive sample training subset is constructed based on positive sample data in the sample data set, an unmarked sample data in the sample data set is sampled to construct a negative sample training subset, the positive sample training subset and the negative sample training subsets are combined to obtain a training set of the current regression tree, the current regression tree is trained based on the training set of the current regression tree, and then the gradient boosting decision tree GBDT model is constructed according to each regression tree.

Specifically, when constructing the positive sample training subset, that is, in the process of constructing a positive sample training subset based on the positive sample data in the sample data set, the construction process may include: and constructing a positive sample training subset by taking all positive sample data in the sample data set, or constructing a positive sample training subset by taking part of the positive sample data in the sample data set. Meanwhile, in the process of constructing a negative sample subset, when a negative sample training subset is constructed by sampling unlabeled sample data, the data volume of the negative sample training subset is 1 to 2 times of the data volume of the positive sample training subset. For example, when the positive sample training subset contains 1000 pieces of data, the number of the negative sample training subsets may be 1000 to 2000 pieces of data. Therefore, the training set containing different sample data can be used for training when the regression tree of each GBDT model is trained, and the difference of each tree is ensured. And when the data volume of the negative sample training subset is 1 to 2 times of the data volume of the positive sample training subset, compared with the data volume below 1 time, enough sample data can be ensured in the training set, so that the accuracy of the trained model can be ensured to be higher, and meanwhile, compared with the data volume above 2 times, the time consumption of the model can be longer due to the fact that the number of the samples is more, so that when the data volume of the negative sample training subset is 1 to 2 times of the data volume of the positive sample training subset, the time consumption in the model training process can be reduced while the accuracy of the model is considered, and the efficiency of the model training is improved. It should be noted that, in the process of determining the training set, when the training subset of the positive and negative samples is selected in the above manner, the execution method is executed in a service scenario in which the ratio of the positive and negative samples is unknown, and in another case, when the ratio of the positive and negative samples in the known sample data is known in a specific implementation scenario or when the ratio between the positive and negative samples in the general whole sample data can be determined based on historical data, in the process of selecting the negative sample, the training subset of the positive sample and the training subset of the negative sample can be determined by the known ratio of the positive and negative samples, that is, when the ratio between the estimated quantity of the negative sample data and the quantity of the positive sample data is x in the service scenario in which the ratio of the positive and negative samples is known, the quantity of the data of the training subset of the negative sample is x times the quantity of the training subset of the positive sample.

And after the training set of the current regression tree is determined, performing iterative training through the training set of the current regression tree and a preset GBDT algorithm to obtain each regression tree corresponding to each iterative training. Specifically, when each regression tree is iteratively trained, the execution mode may be as follows:

first, training of a first regression tree is performed. Specifically, a first training set is obtained from the sample data set, and a first regression tree is trained according to the first training set, a preset GBDT algorithm and a first parameter, wherein the first parameter is a mean value of actual results of all sample data in the sample data set;

subsequent additional regression trees are then trained based on the first regression tree. Specifically, after the first regression tree is obtained through training, a second training set is selected from the sample training set, and a second regression tree is trained according to the second training set, a preset GBDT algorithm and a second parameter.

The second parameter is determined by the predicted result determined by the sample data in the second training set according to the first regression tree and the actual result of the sample data in the second training set, the first training set and the second training set are both composed of a positive sample training subset and a negative sample training subset, and the first training set and the second training set include different negative sample training subsets. That is, when training the first regression tree, the average of the actual results may be calculated from the sample training set, and the average is determined as the first parameter, and then the corresponding first tree, i.e., the first regression tree, may be trained according to the first parameter.

After the first regression tree is trained, the second parameter may be determined by the first regression tree, specifically: firstly, predicting the second training set through the first regression tree to obtain a prediction result corresponding to the second training set; then, according to the actual result of the second training set and the predicted result corresponding to the second training set, determining a residual between the actual result and the predicted result, and determining the residual as the second parameter. In the embodiment of the present invention, the first regression tree may be understood as a root regression tree, and after the first regression tree is trained, a plurality of subsequent second regression trees may be sequentially iteratively trained, where the second parameter in each iterative training is obtained by calculating a difference between a predicted result of the training set in the iterative training and an actual result of the training set in the previous iterative training by using the regression tree obtained in the previous iterative training to obtain a residual.

For example, when the actual results of the sample data a, the sample data B, and the sample data C in the sample training set are respectively 6, 11, and 4, the method according to this step may determine that the first parameter is an average value of the three actual results of 7 when training the first regression tree, so that training may be performed according to the first parameter 7 when training the first regression tree, and after determining the first regression tree, the sample data a may be predicted through the first regression tree, and when the prediction result is 5, because the actual result of the sample data a is 6, the difference between the actual result and the prediction result may be calculated to obtain a residual 1, and the residual 1 is used as the second parameter in the training process of the second regression tree.

203. And constructing a plurality of training sets based on the sample data set.

In the actual PU learning-based process, the number of positive sample data in the sample data is small, and in order to obtain a more accurate model, the obtained GBDT model can be supplemented by constructing models based on different algorithms. In addition, different models can be trained through different algorithms and training sets, so that the training set for training other subsequent models can be constructed through the sample data set in the step.

Specifically, the following method may be performed when constructing the training set; firstly, a positive sample training subset is constructed based on at least part of positive sample data in the sample data set, and multiple times of sampling operation are carried out on unlabeled sample data in the sample data set to construct multiple negative sample training subsets. And then combining the positive sample training subsets and the negative sample training subsets respectively to obtain a plurality of training sets. Of course, in the process of constructing the positive sample training subset in the training set, a positive sample training set may be constructed as described above, and a part of positive samples may also be extracted from the sample data set to construct the training set, which specifically may be: firstly, a plurality of positive sample training subsets are constructed based on at least part of positive sample data in the sample data set, and a plurality of times of sampling operation is carried out on unlabeled sample data in the sample data set to construct a plurality of negative sample training subsets. Then, each positive sample training subset and the negative sample training subsets are combined respectively to obtain a plurality of training sets.

204. And respectively selecting from the set of machine learning algorithms, the set of hyper-parameter combinations and the plurality of training sets, and training to obtain a plurality of candidate models.

Specifically, the machine algorithm may be selected from a preset set of machine algorithms, and the hyper-parameters may be obtained from a set of combinations of hyper-parameters, where the candidate model may be determined by a machine learning algorithm in combination with a set of selected hyper-parameters and a corresponding one of a plurality of training sets. Wherein a machine learning algorithm, a set of hyper-parameters, and a training set determine a candidate model.

For example, the set of machine learning algorithms is: [ Algorithm 1, Algorithm 2, Algorithm 3], the set of hyper-parameter combinations is: [ hyper-parameter combination 1, hyper-parameter combination 2, … …, hyper-parameter combination 10], the training set comprising: training set 1, training set 2, … …, training set 8. Then selecting "Algorithm 1+ Superparametric combination 1+ training set 1" may determine one candidate decision tree model, selecting "Algorithm 2+ Superparametric combination 1+ training set 1" may determine another candidate decision tree model, selecting "Algorithm 1+ Superparametric combination 2+ training set 1" may determine another candidate decision tree model, selecting "Algorithm 1+ Superparametric combination 1+ training set 2" may determine another candidate decision tree model, and so on.

205. Sampling positive sample data in the sample data set to construct a positive sample evaluation subset, sampling unlabeled sample data in the sample data set to construct a negative sample evaluation subset, and combining the positive sample evaluation subset and the negative sample evaluation subset to obtain an evaluation set.

A plurality of candidate models are obtained based on the foregoing steps, and the accuracy of the models is different, so that in the embodiment of the present invention, the candidate models need to be evaluated to obtain a relatively accurate model, and therefore, when the evaluation set is constructed based on the sample data set, the following steps may be specifically performed: sampling positive sample data in the sample data set to construct a positive sample evaluation subset, sampling unlabeled sample data in the sample data set to construct a negative sample evaluation subset, and combining the positive sample evaluation subset and the negative sample evaluation subset to obtain an evaluation set. In addition, in order to further improve the accuracy of the evaluation result, a plurality of evaluation sets may be further configured in this step, so that each candidate model is evaluated multiple times by using the plurality of evaluation sets subsequently, and the comprehensive evaluation effect is determined according to the multiple evaluation results, therefore, when the evaluation set is configured based on the sample data set, the method may further specifically include: and constructing a plurality of evaluation sets based on the sample data set, wherein each evaluation set comprises positive sample data and unlabeled sample data serving as negative sample data.

206. And respectively evaluating the GBDT model and the candidate models according to the evaluation set to obtain an evaluation result corresponding to each model, and selecting a plurality of models meeting preset conditions from the evaluation results.

When the constructed evaluation set is a plurality of evaluation sets, selecting a model process meeting preset conditions according to evaluation results, and performing the following steps: firstly, for each candidate model, evaluating the candidate model according to the plurality of evaluation sets and preset evaluation conditions respectively to obtain a plurality of evaluation results. Then, the multiple evaluation results of each candidate model are fused, and the final evaluation result corresponding to the candidate model obtained by fusing the multiple evaluation results is used as an actual evaluation result.

It should be noted that, since different preset evaluation conditions have direct influence on the evaluation manner and the evaluation result, the evaluation result is also different based on different preset evaluation conditions, for example: and when the preset evaluation condition is a maximum interval method, the evaluation result corresponding to each candidate model is the classification interval of the prediction result of each candidate model on the evaluation set. And when the preset evaluation condition is a method for calculating an AUC value, the evaluation result corresponding to each candidate model is an AUC value of each candidate model on the evaluation set. The AUC value can be understood as a probability value, when you randomly select a positive sample and a negative sample, the probability that the current classification algorithm arranges the positive sample in front of the negative sample according to the calculated score value is the AUC value, and the larger the AUC value is, the more likely the current classification model arranges the positive sample in front of the negative sample, so that better classification can be performed, and the classification effect of the model can be determined more accurately.

207. And integrating the plurality of models meeting the preset conditions to obtain an integrated composite model.

In order to further ensure the accuracy of the obtained model, it is necessary to integrate the models, where the candidate models meeting the preset condition obtained when selecting the candidate models whose evaluation results meet the preset condition are often multiple and the accuracy of the candidate models is not the same, and in this case, the process may be as follows: and distributing a corresponding weight value for each selected candidate model according to the corresponding evaluation result, and integrating the selected candidate models according to the weight values.

Therefore, the final conforming model can be obtained by integrating the models conforming to the prediction conditions, the overfitting problem of the models can be further improved on the basis of the obtained GBDT model, and the obtained models can have better prediction accuracy.

Further, as the method for constructing the GBDT model, not only a more accurate model is obtained, but also the practical significance is in the application of the actual scene to solve the actual problem, for example, in the processes of target object recommendation, bank card leakage point detection, image/text classification, and malicious traffic detection, so the process when the problem is solved by combining the above methods can be shown as the following example.

Firstly, data to be predicted is obtained, wherein the data to be predicted may include: the image/text data to be classified, the transaction data of the embezzled bank card of the leak to be detected, the target object data to be predicted and the flow data to be detected. Specifically, the data to be predicted is different according to different application scenarios.

Then, according to the method as described in any of the previous embodiments, training of the GBDT model is performed to obtain the GBDT model. Specifically, the execution process may be: obtaining a sample data set, constructing a positive sample training subset based on positive sample data in the sample data set when each regression tree of a GBDT model is trained, sampling unlabeled sample data in the sample data set to construct a negative sample training subset, combining the positive sample training subset and the negative sample training subsets to obtain a training set of a current regression tree, training the current regression tree based on the training set of the current regression tree, and constructing a gradient lifting decision tree GBDT model according to each regression tree. For example, when the method is executed in a scene of malicious traffic, in this step, the acquired sample data set is traffic data during malicious traffic detection, where known malicious traffic data in the traffic data is positive sample data, and unknown traffic data is negative sample data.

And finally, executing a prediction task by using the obtained gradient boosting decision tree GBDT model, wherein the prediction task corresponds to the acquired data to be predicted in the previous step. For example, when the data to be predicted is the traffic data to be detected in the scene of malicious traffic detection, the prediction task executed according to the gradient boosting decision tree GBDT model is the malicious traffic detection task.

In addition, in the above example, the method may also be applied to a scenario recommended for a target object, and as an implementation of the method for constructing a gradient boosting decision tree GBDT model, an embodiment of the present invention provides a device for constructing a gradient boosting decision tree GBDT model, which is mainly used for improving a problem of low model accuracy caused by overfitting and improving accuracy of a trained GBDT model. For convenience of reading, details in the foregoing method embodiments are not described in detail again in this embodiment of the apparatus, but it should be clear that the apparatus in this embodiment can correspondingly implement all the contents in the foregoing method embodiments. As shown in fig. 3, the apparatus specifically includes:

the acquiring unit 31 may be configured to acquire a sample data set, where the sample data set includes positive sample data with a positive tag and unlabeled sample data without a tag;

the constructing unit 32 may be configured to, when each regression tree of the GBDT model is trained, construct a positive sample training subset based on positive sample data in the sample data set acquired by the acquiring unit 31, sample unlabeled sample data in the sample data set acquired by the acquiring unit 31 to construct a negative sample training subset, combine the positive sample training subset and the negative sample training subsets to obtain a training set of a current regression tree, train the current regression tree based on the training set of the current regression tree, and construct the gradient boosting decision tree GBDT model according to each regression tree.

Further, as shown in fig. 4, the building unit 32 includes:

the first constructing module 321 may be configured to take all positive sample data in the sample data set to construct a positive sample training subset, or may be configured to take part of the positive sample data in the sample data set to construct a positive sample training subset.

Further, as shown in fig. 4, the building unit 32 includes:

the second constructing module 322 may be configured to, when the ratio of the predicted negative sample data size to the predicted positive sample data size is x in a service scenario where the ratio of the positive sample data size to the negative sample data size is known, make the data size of the negative sample training subset x times the data size of the positive sample training subset; and when the proportion of the positive sample and the negative sample is unknown, enabling the data volume of the negative sample training subset to be 1 to 2 times of the data volume of the positive sample training subset.

Further, as shown in fig. 4, the building unit 32 includes:

the training module 323 may be configured to perform iterative training on the training set of the current regression tree obtained by the first building module and the second building module and a preset GBDT algorithm to obtain each regression tree corresponding to each iterative training.

Further, as shown in fig. 4, the training module 323 includes:

the first training submodule 3231 may be configured to obtain a first training set from the sample data set, and train a first regression tree according to the first training set, a preset GBDT algorithm, and a first parameter, where the first parameter is a mean value of actual results of all sample data in the sample data set;

the second training submodule 3232 may be configured to, after the first training submodule 3231 obtains the first regression tree through training, select a second training set from the sample training set, and train a second regression tree according to the second training set, a preset GBDT algorithm, and a second parameter, where the second parameter is determined according to a prediction result determined by sample data in the second training set according to the first regression tree and an actual result of the sample data in the second training set, the first training set and the second training set are both composed of a positive sample training subset and a negative sample training subset, and the negative sample training subsets included in the first training set and the second training set are different.

Further, as shown in fig. 4, the training module 323 further includes:

the predicting submodule 3233 may be configured to predict the second training set through the first regression tree obtained by the first training submodule 3231, so as to obtain a prediction result corresponding to the second training set;

the determining submodule 3234 may be configured to determine, according to an actual result of the second training set and a predicted result of the predicting submodule 3233 corresponding to the second training set, a residual between the actual result and the predicted result, and determine the residual as the second parameter, so that the second training submodule 3232 trains the second regression tree according to the second parameter.

Further, as shown in fig. 4, the apparatus further includes:

a training set constructing unit 33, configured to construct a plurality of training sets based on the sample data set acquired by the acquiring unit 31;

the training unit 34 may be configured to select from a set of machine learning algorithms, a set of hyper-parameter combinations, and a plurality of training sets obtained by the training set constructing unit 33, respectively, and train to obtain a plurality of candidate models, where one machine learning algorithm, one set of hyper-parameters, and one training set determine one candidate model;

the evaluation unit 35 may be configured to evaluate at least one GBDT model constructed by the construction unit 32 and a plurality of candidate models trained by the training unit 34, respectively, and select a plurality of models meeting a preset condition;

the integrating unit 36 may be configured to integrate the plurality of models that meet the preset condition after being evaluated by the evaluating unit 35 to obtain an integrated composite model.

Further, as shown in fig. 4, the apparatus further includes:

the evaluation set constructing unit 37 may be configured to sample positive sample data in the sample data set acquired by the acquiring unit 31 to construct a positive sample evaluation subset, sample unlabeled sample data in the sample data set to construct a negative sample evaluation subset, and combine the positive sample evaluation subset and the negative sample evaluation subset to obtain an evaluation set;

the evaluation unit 35 may be specifically configured to evaluate the at least one GBDT model and the plurality of candidate models respectively according to the evaluation set obtained by the evaluation set construction unit 37, obtain an evaluation result corresponding to each model, and select a plurality of models meeting a preset condition from the evaluation results.

Further, as shown in fig. 4, the sample data set includes: target object data during target object recommendation, transaction data of a stolen bank card during bank card leakage point detection, image/text data during image/text classification, and flow data during malicious flow detection;

when the sample data set is target object data, recommended target object data in the target object data is positive sample data, and non-recommended target object data is negative sample data; when the sample data set is transaction data of a stolen bank card, the transaction data of a known leakage point in the transaction data is positive sample data, and the transaction data of an unknown leakage point is negative sample data; when the sample data set is image/text data, the classified image/text data is positive sample data, and the unclassified image/text data is negative sample data; when the sample data set is flow data, the known malicious flow data in the flow data is positive sample data, and the unknown flow data is negative sample data.

Based on the method and apparatus described in the foregoing embodiments, in combination with a specific application scenario, in a process of recommending a target object, such as music recommendation, commodity recommendation, and the like, an embodiment of the present invention further provides a system for implementing target object recommendation, so as to implement a recommendation function for a target object. Specifically, as shown in fig. 5, the method includes:

a target object data acquisition unit 51 operable to acquire target object data to be predicted;

the device 52 for constructing a gradient lifting decision tree GBDT model may be configured to obtain a gradient lifting decision tree GBDT model based on a sample data set, where the sample data set for training the gradient lifting decision tree GBDT model is a data set related to a target object, where data of the target object selected by a user is positive sample data, and data of the target object not selected by the user is unmarked sample data; in an embodiment of the present invention, the device 52 for constructing a gradient boost decision tree GBDT model may specifically be as shown in fig. 3 or fig. 4;

an executing unit 53, configured to execute a target object recommendation task by using the gradient lifting decision tree GBDT model obtained by the apparatus 52 for constructing a gradient lifting decision tree GBDT model; wherein the target object is a commodity or a service provided through the internet.

In addition, in combination with a specific application scenario, in a process of detecting a leakage point of a bank card, an embodiment of the present invention further provides a system for detecting a leakage point of a bank card, so as to implement a function of detecting a leakage point in transaction data of a bank card. Specifically, as shown in fig. 6, the method includes:

the transaction data acquisition unit 61 can be used for acquiring transaction data of a stolen bank card of the leakage point to be detected;

a device 62 for constructing a gradient lifting decision tree GBDT model, configured to obtain the gradient lifting decision tree GBDT model based on a sample data set, where the sample data set for training the gradient lifting decision tree GBDT model is a transaction data set of a stolen bank card that is swiped, where transaction data of the stolen bank card that is marked with a leakage point is positive sample data, and transaction data of the stolen bank card that is not marked with a leakage point is unmarked sample data; in an embodiment of the present invention, the apparatus 62 for constructing a gradient boost decision tree GBDT model may be specifically as shown in fig. 3 or fig. 4;

an executing unit 63, configured to execute a bank card leakage point detection task by using the gradient lifting decision tree GBDT model obtained by the apparatus 62 for constructing a gradient lifting decision tree GBDT model;

In addition, in combination with a specific application scenario, in the process of classifying texts and images, an embodiment of the present invention further provides a system for implementing image/text classification, so as to implement a function of classifying texts and images. Specifically, as shown in fig. 7, the method includes:

an image/text data acquisition unit 71 operable to acquire image/text data to be predicted;

a device 72 for constructing a gradient lifting decision tree GBDT model, configured to obtain a gradient lifting decision tree GBDT model based on a sample data set, where the sample data set for training the gradient lifting decision tree GBDT model is an image/text data set, where image/text data with classification marks is positive sample data, and image/text data without classification marks is unlabeled sample data; in an embodiment of the present invention, the apparatus 72 for constructing a gradient boost decision tree GBDT model may specifically be as shown in fig. 3 or fig. 4;

the execution unit 73 may be configured to perform an image/text classification task using the gradient boosting decision tree GBDT model obtained by the apparatus 72 for constructing a gradient boosting decision tree GBDT model.

In addition, in combination with a specific application scenario, in a malicious traffic detection scenario, an embodiment of the present invention further provides a system for implementing malicious traffic detection, so as to detect malicious traffic from undetected traffic data. Specifically, as shown in fig. 8, the method includes:

a flow data acquiring unit 81, which can be used to acquire gene data to be detected;

a device 82 for constructing a gradient lifting decision tree GBDT model, configured to obtain the gradient lifting decision tree GBDT model based on a sample data set, where the sample data set for training the gradient lifting decision tree GBDT model is a traffic data set, where known malicious traffic data is positive sample data, and undetected traffic data is unlabeled sample data; in an embodiment of the present invention, the device 82 for constructing a gradient boost decision tree GBDT model may specifically be as shown in fig. 3 or fig. 4;

the execution unit 83 may be configured to perform the task of detecting the pathogenic gene using the gradient boosting decision tree GBDT model obtained by the apparatus 82 for constructing a gradient boosting decision tree GBDT model.

Further, an embodiment of the present invention also provides a computer-readable storage medium, where the computer-readable storage medium has a computer program stored thereon, where the computer program, when executed by one or more computing devices, implements the above-mentioned method for constructing a gradient boosting decision tree GBDT model.

In addition, embodiments of the present invention also provide a system including one or more computing devices and one or more storage devices, on which is recorded a computer program, which, when executed by the one or more computing devices, causes the one or more computing devices to implement the above-mentioned method for constructing a gradient boosting decision tree GBDT model.

In summary, the method and apparatus for constructing a GBDT model according to the embodiments of the present invention can obtain a sample data set, then construct a positive sample training subset based on positive sample data in the sample data set when each regression tree of the GBDT model is trained, sample unlabeled sample data in the sample data set to construct a negative sample training subset, combine the positive sample training subset with the multiple negative sample training subsets to obtain a training set of a current regression tree, train the current regression tree based on the training set of the current regression tree, and construct a GBDT model according to each regression tree, compared with the prior art, the present invention can train the regression tree of each GBDT model through the positive sample training subset and the negative sample training subset obtained in the sample data set, the current regression tree training set used for training each regression tree is extracted from the sample training set, so that the difference among a plurality of trees can be ensured, the problem of overfitting caused by the existing training mode is avoided, and the accuracy of the gradient promotion decision tree GBDT model obtained by training is improved.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In addition, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of constructing a gradient boosting decision tree, GBDT, model, comprising:

2. The method of claim 1, wherein said constructing a positive sample training subset based on positive sample data in said sample data set comprises:

alternatively, the first and second electrodes may be,

3. The method of claim 1, wherein,

when the ratio of the negative sample data volume to the positive sample data volume is estimated to be x under a service scene with a known positive-negative sample ratio, enabling the data volume of the negative sample training subset to be x times of the data volume of the positive sample training subset;

and when the proportion of the positive sample and the negative sample is unknown, enabling the data volume of the negative sample training subset to be 1 to 2 times of the data volume of the positive sample training subset.

4. The method of claim 1, wherein the training the current regression tree based on the training set comprises:

5. The method of claim 4, wherein the iteratively training through the training set of the current regression tree and the predetermined GBDT algorithm to obtain each regression tree corresponding to each iterative training comprises:

6. A method of implementing target object recommendation, comprising:

acquiring target object data to be predicted;

the method according to any of claims 1-5, resulting in the gradient boosting decision tree, GBDT, model;

7. A method for realizing detection of leakage points of a bank card comprises the following steps:

the method according to any one of claims 1-5, resulting in a gradient boosting decision tree, GBDT, model;

executing a bank card leakage point detection task by using the obtained gradient lifting decision tree GBDT model;

8. A method of implementing image/text classification, comprising:

acquiring image/text data to be predicted;

9. A method of malicious traffic detection, comprising:

acquiring flow data to be detected;

-deriving a gradient boosting decision tree, GBDT, model according to the method of any of claims 1-5;

10. An apparatus for constructing a Gradient Boosting Decision Tree (GBDT) model, comprising:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a sample data set, and the sample data set comprises positive sample data with positive labels and unmarked sample data without labels;