WO2019223384A1 - Gbdt模型的特征解释方法和装置 - Google Patents

Gbdt模型的特征解释方法和装置 Download PDF

Info

Publication number
WO2019223384A1
WO2019223384A1 PCT/CN2019/076264 CN2019076264W WO2019223384A1 WO 2019223384 A1 WO2019223384 A1 WO 2019223384A1 CN 2019076264 W CN2019076264 W CN 2019076264W WO 2019223384 A1 WO2019223384 A1 WO 2019223384A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
feature
score
user
child
Prior art date
Application number
PCT/CN2019/076264
Other languages
English (en)
French (fr)
Inventor
方文静
周俊
高利翠
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to EP19806892.6A priority Critical patent/EP3719704A4/en
Priority to SG11202006205SA priority patent/SG11202006205SA/en
Publication of WO2019223384A1 publication Critical patent/WO2019223384A1/zh
Priority to US16/889,695 priority patent/US11205129B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • Embodiments of the present specification relate to the technical field of data processing, and more particularly, to a method and device for performing feature interpretation on a predicted label value of a user.
  • GBDT Gradient Boosting Desion Tree
  • the GBDT algorithm is a machine learning technology used for tasks such as regression, classification, and ranking. It combines multiple weak learners (usually decision trees) to obtain a strong prediction model. Wherein, the GBDT model obtains multiple decision trees through multiple iterations and reduces the loss function in the gradient direction in each iteration.
  • the interpretation of local feature contributions for a single user mainly includes the following two methods: extracting the preferred scheme in the GBDT model for interpretation by remodeling; and, Adjust the feature value to test the effect of the feature on the predicted performance loss. Therefore, a more effective GBDT model interpretation scheme is needed to meet the needs in the prior art.
  • the embodiments of the present specification aim to provide a more effective GBDT model interpretation scheme to solve the deficiencies in the prior art.
  • one aspect of the present specification provides a method for obtaining a feature interpretation of a predicted tag value of a user, the method is performed after predicting a user's tag value through a GBDT model, and the feature interpretation includes a comparison with the user.
  • scores including the user's leaf node and the leaf node are obtained, wherein the scores of the leaf nodes are predetermined by the GBDT model.
  • each prediction path corresponding to each of the leaf nodes Determining each prediction path corresponding to each of the leaf nodes, where the prediction path is a node connection path from the leaf node to a root node of a decision tree where the leaf node is located;
  • the correlation between the feature corresponding to the at least one child node and the predicted label value is obtained by adding the feature local increments of at least one of the child nodes corresponding to the same feature.
  • the score of each parent node is determined based on a predetermined score of each leaf node of a decision tree in which the parent node is located.
  • the score of the parent node is the average of the scores of its two child nodes.
  • the score of each parent node is determined based on a predetermined score of each leaf node of a decision tree in which the parent node is located.
  • the score of the parent node is a weighted average of the scores of its two child nodes, and the weight of the scores of the child nodes is determined based on the number of samples allocated to the GBDT model during the training process.
  • determining a feature corresponding to each child node and a local increment of a feature at each child node includes obtaining each child node. The difference between the node's own score and its parent's score is used as the feature local increment.
  • the GBDT model is a classification model or a regression model.
  • the predetermined number of the decision trees that are ranked first is a plurality of sequentially arranged decisions included in the GBDT model tree.
  • Another aspect of the present specification provides a device for obtaining a feature interpretation of a predicted tag value of a user, the device being implemented after predicting a user's tag value through a GBDT model, the feature interpretation including correlation with the predicted tag value of the user Multiple features of the user, and the correlation between each of the features and the predicted label value, the GBDT model includes multiple ordered decision trees, and the device includes:
  • a first obtaining unit is configured to obtain a score including a leaf node of the user and a score of the leaf node in each of a predetermined number of the decision trees that are ranked first, wherein the score of the leaf node is A score predetermined by the GBDT model;
  • a first determining unit configured to determine each prediction path corresponding to each of the leaf nodes, where the prediction path is a node connection path from the leaf node to a root node of a decision tree where the leaf node is located;
  • a second obtaining unit configured to obtain a split feature and a score of each parent node on each of the prediction paths, and the score of each parent node is determined based on predetermined scores of each leaf node of a decision tree in which the parent node is located;
  • a second determining unit is configured to, for each child node on each of the prediction paths, determine the child node by its own score, a score of its parent node, and a splitting feature of its parent node.
  • a feature obtaining unit configured to obtain a set of features corresponding to all of the child nodes as a plurality of features related to the predicted label value of the user
  • the correlation obtaining unit is configured to obtain a correlation between a feature corresponding to the at least one child node and the predicted label value by adding a local incremental feature of at least one of the child nodes corresponding to the same feature.
  • a user-level accurate model interpretation of the GBDT model can be obtained by simply obtaining the existing parameters and prediction results in the GBDT model, and the calculation cost is low.
  • the solutions in the embodiments of the present specification can be applied to various GBDT models, and have strong applicability and operability.
  • FIG. 1 illustrates a method for obtaining a feature interpretation of a predicted label value of a user according to an embodiment of the present specification
  • FIG. 2 illustrates a decision tree included in a GBDT model according to an embodiment of the present specification
  • FIG. 3 illustrates a schematic diagram of implementing a method according to an embodiment of the present specification based on the decision tree shown in FIG. 2;
  • FIG. 4 illustrates a device 400 for acquiring a feature interpretation of a predicted label value of a user according to an embodiment of the present specification.
  • the model interpretation method according to the embodiment of the present specification is performed after predicting a user's tag value through a GBDT model.
  • the GBDT model is obtained through training described below. Get the training set first Where N is the number of training samples, that is, the number of users.
  • the GBDT model is a model for predicting credit card fraud, so x (i) may be user's credit card record data, transaction record data, etc., and y (i) may be the user's fraud risk value.
  • the N users are segmented through the first decision tree, splitting features and feature thresholds are set at each parent node of the decision tree, and users are compared by comparing corresponding features of the user with feature thresholds at the parent node. Divided into the corresponding child nodes, through this process, N users are finally divided into each leaf node, where the score of each leaf node is the calibration value of each user in the leaf node (that is, y (i) ) The mean.
  • the residual r (i) of each user is obtained by Is the new training set, which corresponds to the same user set as D1.
  • a second decision tree can be obtained.
  • N users are divided into leaf nodes, and the score of each leaf node is the residual value of each user.
  • Mean Similarly, multiple decision trees can be obtained sequentially, each decision tree being obtained based on the residuals of the previous decision tree.
  • a GBDT model including multiple decision trees can be obtained.
  • the user's feature vector is input to the above GBDT model.
  • Each decision tree in the GBDT model assigns the user to the corresponding leaf node according to the parent node's split feature and split threshold.
  • the scores of the leaf nodes of the user are added up to obtain the predicted label value of the user.
  • the model interpretation method obtains a feature interpretation of the user's predicted label value based on the existing parameters and prediction results in the GBDT model. That is, in each of the decision trees, a leaf node where a user is located, a prediction path including the leaf node is obtained, and a feature related to a prediction label value of a child node on the prediction path and a local increment of the feature are calculated. And, the local increments of the same feature included in all decision trees are accumulated as the correlation between the feature and the predicted label value, that is, the feature contributes to the feature of the predicted label value.
  • the user's predicted label values are feature-explained.
  • the above GBDT model is a regression model, that is, its predicted labels are continuous data, such as fraud risk value, age, and so on.
  • the GBDT model is not limited to a regression model, it may also be a classification model, a recommendation model, and the like, and these models may all use the GBDT model interpretation method according to the embodiment of the present specification.
  • FIG. 1 illustrates a method for obtaining a feature interpretation of a predicted tag value of a user according to an embodiment of the present specification.
  • the method is performed after predicting a tag value of a user through a GBDT model, and the feature interpretation includes a comparison with the user
  • the GBDT model includes a plurality of decision trees arranged in order, including multiple features of the user related to the predicted label value, and the correlation between each of the features and the predicted label value.
  • the method includes: in step S11, in a predetermined number of each of the decision trees that are ranked first, obtaining a score including a leaf node of the user and a score of the leaf node, respectively, wherein the score of the leaf node is The value is a predetermined score through the GBDT model; in step S12, each prediction path corresponding to each of the leaf nodes is determined, and the prediction path is from the leaf node to a root node of a decision tree where the leaf node is located.
  • step S13 the split feature and score of each parent node on each of the predicted paths are obtained, and the score of each parent node is determined based on predetermined scores of each leaf node of the decision tree in which it is located ;
  • step S14 for each child node on each of the prediction paths, determine the corresponding feature of each child node by its own score, the score of its parent node, and the split feature of its parent node, and The feature at each child node is locally incremented, where the feature corresponding to each child node is a feature related to the predicted label value of the user; in step S15, all A set of features corresponding to each of the child nodes as a plurality of features related to the predicted label value of the user; and, in step S16, by locally increasing the features of at least one of the child nodes corresponding to the same feature Plus, obtaining a correlation between a feature corresponding to the at least one child node and the predicted label value.
  • step S11 in a predetermined number of each of the decision trees that are ranked first, scores including the user's leaf node and the leaf node are obtained, where the score of the leaf node is passed A predetermined score of the GBDT model.
  • each decision tree is obtained based on a residual of a label value of a previous decision tree, that is, The scores of the leaf nodes will become smaller and smaller.
  • the local increment of the user characteristics related to the user prediction label value determined through the sequentially arranged decision trees will also become smaller and smaller on the order of magnitude. It can be predicted that the local increment of a feature obtained from a decision tree ranked later will have a smaller and smaller impact on the correlation of the feature with the predicted label value (that is, the sum of all local increments of the feature). It can even be approximately zero.
  • a predetermined number of decision trees that are ranked higher can be selected to implement the method according to the embodiment of the present specification.
  • the predetermined number may be determined by a predetermined condition, for example, the predetermined number is determined according to an order of magnitude of a leaf node, or the predetermined number is determined according to a predetermined decision tree percentage.
  • a method according to an embodiment of the present specification may be implemented on all decision trees included in the GBDT model, so as to obtain accurate model interpretation.
  • FIG. 2 illustrates a decision tree included in a GBDT model according to an embodiment of the present specification.
  • the node labeled 0 in the figure is the root node of the decision tree
  • the nodes labeled 3, 7, 8, 13, 14, 10, 11, and 12 in the figure are the leaf nodes of the decision tree.
  • the value marked under each leaf node is the score of the leaf node, which is the value of the GBDT model during training based on multiple samples classified into the leaf node Determined by calibrating the label value.
  • two nodes 11 and 12 are separated from the node 6.
  • the node 6 is a parent node of the node 11 and the node 12, and the nodes 11 and 12 are children of the node 6. node.
  • the arrows leading from some parent nodes to the child nodes in the figure are marked with features and value ranges.
  • the arrow from node 0 to node 1 is marked with "f5 ⁇ -0.5", node 0
  • the arrow leading to node 2 is labeled "f5> -0.5", where f5 represents feature 5, which is the split feature of node 0, and -0.5 is the split threshold of node 0.
  • FIG. 3 illustrates a schematic diagram of implementing a method according to an embodiment of the present specification based on the decision tree shown in FIG. 2.
  • a user's label value is predicted by a GBDT model including a decision tree shown in FIG. 3
  • users are divided into nodes 14 in the decision tree. Therefore, the node 14 including the user and the score of the node 14 can be determined from the decision tree.
  • the leaf nodes where the user is located and their scores can be similarly determined. Therefore, a predetermined number of leaf nodes and their corresponding scores can be obtained, that is, one leaf node is obtained from each decision tree of the predetermined number of decision trees.
  • each prediction path corresponding to each of the leaf nodes is determined, and the prediction path is a node connection path from the leaf node to a root node of a decision tree where the leaf node is located.
  • the prediction path is the prediction path from node 0 to node 14 in the figure. Node connection paths are shown. Meanwhile, among other decision trees of the predetermined number of decision trees, prediction paths may be similarly acquired, thereby obtaining a predetermined number of prediction paths.
  • step S13 the split features and scores of each parent node on each of the prediction paths are obtained, and the scores of each parent node are determined based on the predetermined scores of each leaf node of the decision tree in which it is located.
  • each node except node 14 has a child node, that is, the parent node included in the path includes node 0, node 2, node 5, and node 9.
  • the split feature of the parent node can be directly obtained from the decision tree. For example, referring to FIG.
  • the split feature of node 0 is feature 5
  • the split feature of node 2 is feature 2
  • the split feature of is feature 4 and the split feature of node 9 is feature 4.
  • the score of the parent node is determined based on the following formula (1):
  • S p is the score of the parent node
  • S c1 and S c2 are the scores of the two child nodes of the parent node, respectively. That is, the score of the parent node is the average of the scores of its two child nodes.
  • the score of node 9 can be determined from the scores of nodes 13 and 14
  • the score of node 5 can be determined to be 0.0625.
  • the score of node 2 is 0.0698.
  • the score of node 0 can be determined to be 0.0899.
  • the score of each parent node on the prediction path shown in FIG. 3 can be determined based on the score of each leaf node in the graph.
  • the score of node 5 can be determined from nodes 13, 14 and 10
  • the score of node 2 can be determined from nodes 13, 14, 10, 11 and 12.
  • the score of the parent node is determined based on the following formula (2):
  • N c1 and N c2 are the number of samples that fall into child nodes c1 and c2 respectively during model training. That is, the score of the parent node is a weighted average of the scores of its two child nodes, and the weight of the two child nodes is the number of samples that fall into the model during training. It can be determined in actual application or experimental tests according to the embodiments of the present specification that by using formula (2) to determine the score of the parent node, compared with formula (1), a more accurate model interpretation can be obtained. In addition, in the embodiment of the present specification, the calculation of the parent node is not limited to the above formulas (1) and (2). For example, the parameters in the formulas (1) and (2) can be adjusted to make the model interpretation more accurate. The scores of each parent node can also be obtained based on the scores of the leaf nodes through the geometric mean value, the root mean square mean value, and the like.
  • step S14 for each child node on each of the prediction paths, it is determined that each child node corresponds to each child node by its own score, the score of its parent node, and the splitting characteristics of its parent node. And the local increment of the feature at each child node, wherein the feature corresponding to each child node is a feature related to the predicted label value of the user.
  • the child nodes in the prediction path include: node 2, node 5 , Node 9 and node 14.
  • the splitting characteristics of the parent node are the characteristics of the child node related to the predicted label value. For convenience of description, it is expressed as the child node.
  • the feature local increment at each child node is obtained by the following formula (3):
  • S c represents the score of the child node
  • S p represents the score of the parent node of the child node.
  • the local increment of feature 5 (f5) at node 2 is -0.0201 (ie, 0.0698-0.0899)
  • the local increment of feature 2 (f2) at node 5 is -0.0073
  • the local increment of feature 4 (f4) at node 9 is -0.0015
  • the local increment of feature 4 (f4) at node 14 is 0.001 .
  • the calculation of the local increment is not limited to the above formula (3), and the local increment may also be calculated by other calculation methods.
  • the score of the parent node or the score of the child node in formula (3) can be multiplied by the correction parameter to make the model interpretation more accurate.
  • a set of features corresponding to all the child nodes is obtained as a plurality of features related to the predicted label value of the user. For example, referring to FIG. 3, in the decision tree shown in FIG. 3, features related to a user's predicted label value, that is, feature 5, feature 2 and feature 4 may be obtained from the prediction path. Similarly, features related to the user's predicted label value can be similarly obtained from other decision trees in the predicted number of decision trees. These features are brought together to obtain a set of multiple features related to the user's predicted label value.
  • step S16 the correlation between the feature corresponding to the at least one child node and the predicted label value is obtained by adding the feature local increments of at least one of the child nodes corresponding to the same feature.
  • nodes 9 and 14 on the prediction path both correspond to feature 4, so that the local increments at node 9 and node 14 can be added, for example,
  • the local increments of all the child nodes corresponding to feature 4 may be added to obtain the correlation or contribution value of feature 4.
  • the larger the correlation value the greater the correlation between the feature and the predicted label value.
  • the correlation value is negative, the correlation between the feature and the predicted label value is very small.
  • the larger the correlation value the greater the correlation between the feature and the credit card fraud value, that is, the greater the risk of the feature.
  • the user's predicted label value can be interpreted as a feature, thereby determining the predictive determining factors, and Through the feature interpretation, more information related to the user is obtained. For example, in the example of predicting a user's credit card fraud through the GBDT model, by obtaining the user's multiple features related to the predicted tag value and the correlation magnitude of the feature, the influence surface of the feature and the magnitude of the correlation of the feature can be , As the reference information of the user's credit card fraud degree prediction value, so that the judgment of the user is more accurate.
  • FIG. 4 illustrates a device 400 for acquiring a feature interpretation of a predicted label value of a user according to an embodiment of the present specification.
  • the device 400 is implemented after predicting a user's label value through a GBDT model, and the feature interpretation includes a plurality of characteristics of the user related to the user's predicted label value, and each of the characteristics is related to the predicted label value.
  • the GBDT model includes multiple ordered decision trees.
  • the apparatus 400 includes:
  • a first obtaining unit 41 is configured to obtain a score of a leaf node including the user and a score of the leaf node in each of a predetermined number of the decision trees that are ranked first, wherein the score of the leaf node is The value is a predetermined score through the GBDT model;
  • the first determining unit 42 is configured to determine each prediction path corresponding to each of the leaf nodes, where the prediction path is a node connection path from the leaf node to a root node of a decision tree where the leaf node is located;
  • a second obtaining unit 43 is configured to obtain the split feature and score of each parent node on each of the prediction paths, and the score of each parent node is determined based on predetermined scores of each leaf node of a decision tree in which the parent node is located ;
  • the second determining unit 44 is configured to determine, for each child node on each of the prediction paths, the score of each child node, the score of its parent node, and the splitting characteristics of its parent node. Describe the feature corresponding to each child node and the feature local increment at each child node, wherein the feature corresponding to each child node is a feature related to the predicted label value of the user;
  • the feature obtaining unit 45 is configured to obtain a set of features corresponding to all the child nodes as a plurality of features related to the predicted label value of the user;
  • the correlation obtaining unit 46 is configured to obtain the correlation between the feature corresponding to the at least one child node and the predicted label value by adding the feature local increments of at least one of the child nodes corresponding to the same feature. .
  • a user-level accurate model interpretation of the GBDT model can be obtained by simply obtaining the existing parameters and prediction results in the GBDT model, and the calculation cost is low.
  • the solutions in the embodiments of the present specification can be applied to various GBDT models, and have strong applicability and operability.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or in technical fields Any other form of storage medium is known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Accounting & Taxation (AREA)
  • Computer Security & Cryptography (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种获取对用户的预测标签值的特征解释的方法和装置,所述方法在通过GBDT模型预测用户的标签值之后执行,包括:在排序靠前的预定数目的各个决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值(S11);确定与各个所述叶子节点分别对应的各个预测路径(S12);获取每个预测路径上各个父节点的分裂特征和分值(S13);对于每个预测路径上的每个子节点,通过其自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量(S14);获取全部所述子节点各自对应的特征的集合,作为与用户的预测标签值相关的多个特征(S15);以及通过将对应于相同特征的至少一个子节点的特征局部增量相加,获取与至少一个子节点对应的特征与所述预测标签值的相关性(S16)。

Description

GBDT模型的特征解释方法和装置 技术领域
本说明书实施例涉及数据处理技术领域,更具体地,涉及一种对用户的预测标签值进行特征解释的方法和装置。
背景技术
在互联网技术快速发展的背景下,对互联网中的数据挖掘工作变得越来越重要。通常,在数据挖掘中,通过机器学习基于标定数据进行建模,从而可使用训练好的模型用于处理待预测的数据。在多种机器学习算法中,GBDT(Gradient boosting deision tree,梯度提升决策树)算法由于其优异的学习性能,得到越来越广泛的应用。GBDT算法是一种用于回归、分类、排序等任务的机器学习技术,其通过结合多个弱学习器(通常为决策树)而获得强预测模型。其中,所述GBDT模型通过多次迭代,并且在每次迭代中使得损失函数在梯度方向上减少,从而获得多个决策树。随着GBDT算法的广泛应用,产生了日益增多的对GBDT模型的解释的需求。除了目前通常使用的作为全局解释的特征重要性参数之外,针对单个用户的局部特征贡献的解释主要包括以下两种方法:通过重新建模提取GBDT模型中的优选方案以进行解释;以及,通过调节特征值大小以测试该特征对预测性能损失的影响。因此,需要一种更有效的GBDT模型解释方案,以满足现有技术中的需求。
发明内容
本说明书实施例旨在提供一种更有效的GBDT模型解释方案,以解决现有技术中的不足。
为实现上述目的,本说明书一个方面提供一种获取对用户的预测标签值的特征解释的方法,所述方法在通过GBDT模型预测用户的标签值之后执行,所述特征解释包括与所述用户的预测标签值相关的用户的多个特征、以及每个所述特征与所述预测标签值的相关性,所述GBDT模型中包括多个顺序排列的决策树,所述方法包括:
在排序靠前的预定数目的各个所述决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值,其中,所述叶子节点的分值为通过所述GBDT模型预定的分值;
确定与各个所述叶子节点分别对应的各个预测路径,所述预测路径为从所述叶子节点至其所在决策树的根节点之间的节点连接路径;
获取每个所述预测路径上各个父节点的分裂特征和分值,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定;
对于每个所述预测路径上的每个子节点,通过所述每个子节点的自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量,其中所述每个子节点对应的特征为与所述用户的预测标签值相关的特征;
获取全部所述子节点各自对应的特征的集合,作为与所述用户的预测标签值相关的多个特征;以及
通过将对应于相同特征的至少一个所述子节点的特征局部增量相加,获取与所述至少一个子节点对应的特征与所述预测标签值的相关性。
在一个实施例中,在所述获取对用户的预测标签值的特征解释的方法中,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定包括,所述父节点的分值为其两个子节点的分值的平均值。
在一个实施例中,在所述获取对用户的预测标签值的特征解释的方法中,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定包括,所述父节点的分值为其两个子节点的分值的加权平均值,所述子节点的分值的权重基于在所述GBDT模型的训练过程中分配至其的样本数而确定。
在一个实施例中,在所述获取对用户的预测标签值的特征解释的方法中,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量包括,获取所述每个子节点的自身分值与其父节点的分值的差,作为所述特征局部增量。
在一个实施例中,在所述获取对用户的预测标签值的特征解释的方法中,所述GBDT模型为分类模型或回归模型。
在一个实施例中,在所述获取对用户的预测标签值的特征解释的方法中,所述排序靠前的预定数目的所述决策树为所述GBDT模型中包括的多个顺序排列的决策树。
本说明书另一方面提供一种获取对用户的预测标签值的特征解释的装置,所述装置在通过GBDT模型预测用户的标签值之后实施,所述特征解释包括与所述用户的预测标 签值相关的用户的多个特征、以及每个所述特征与所述预测标签值的相关性,所述GBDT模型中包括多个顺序排列的决策树,所述装置包括:
第一获取单元,配置为,在排序靠前的预定数目的各个所述决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值,其中,所述叶子节点的分值为通过所述GBDT模型预定的分值;
第一确定单元,配置为,确定与各个所述叶子节点分别对应的各个预测路径,所述预测路径为从所述叶子节点至其所在决策树的根节点之间的节点连接路径;
第二获取单元,配置为,获取每个所述预测路径上各个父节点的分裂特征和分值,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定;
第二确定单元,配置为,对于每个所述预测路径上的每个子节点,通过所述每个子节点的自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量,其中所述每个子节点对应的特征为与所述用户的预测标签值相关的特征;
特征获取单元,配置为,获取全部所述子节点各自对应的特征的集合,作为与所述用户的预测标签值相关的多个特征;以及
相关性获取单元,配置为,通过将对应于相同特征的至少一个所述子节点的特征局部增量相加,获取与所述至少一个子节点对应的特征与所述预测标签值的相关性。
通过根据本说明书实施例的GBDT模型解释方案,只需获取GBDT模型中的已有参数和预测结果,即可获取对GBDT模型的用户级的准确的模型解释,并且,计算成本较低。另外,本说明书实施例的方案可适用于各种GBDT模型,适用性强,可操作性强。
附图说明
通过结合附图描述本说明书实施例,可以使得本说明书实施例更加清楚:
图1示出根据本说明书实施例的一种获取对用户的预测标签值的特征解释的方法;
图2示例示出了根据本说明书实施例的GBDT模型中包括的决策树;
图3示例示出了基于图2所示的决策树实施根据本说明书实施例的方法的示意图;以及
图4示出了根据本说明书实施例的一种获取对用户的预测标签值的特征解释的装置 400。
具体实施方式
下面将结合附图描述本说明书实施例。
首先说明本说明书实施例的应用场景。根据本说明书实施例的模型解释方法是在通过GBDT模型预测用户的标签值之后执行。所述GBDT模型是通过下述训练过程训练获得的。首先获取训练集
Figure PCTCN2019076264-appb-000001
其中N为训练样本的个数,即,用户数。其中,x (i)为第i个用户的特征向量,其例如为S维向量,即x=(x 1,x 2,…,x S),y (i)为第i个用户的标定标签值。例如,所述GBDT模型是预测信用卡欺诈的模型,则x (i)可以为用户的刷卡记录数据、交易记录数据等,y (i)可以为用户的欺诈风险值。然后,通过第一个决策树对所述N个用户进行分割,在决策树的每个父节点设定分裂特征和特征阈值,通过在父节点处将用户的对应特征与特征阈值比较而将用户分割到相应的子节点中,通过这样的过程,最后将N个用户分割到各个叶子节点中,其中,各个叶子节点的分值为该叶子节点中各个用户的标定值(即y (i))的均值。
在获取第一个决策树之后,通过将每个用户的标定标签值与该用户在第一个决策树中的叶子节点的分值相减,获取每个用户的残差r (i),以
Figure PCTCN2019076264-appb-000002
为新的训练集,其与D1对应于相同的用户集合。以与上述相同的方法,可获取第二个决策树,在第二个决策树中,将N个用户分割到各个叶子节点中,并且每个叶子节点的分值为各个用户的残差值的均值。类似地,可顺序获取多个决策树,每个决策树都基于前一个决策树的残差获得。从而可获得包括多个决策树的GBDT模型。
在预测用户的标签值时,对上述GBDT模型输入用户的特征向量,GBDT模型中的每个决策树依据其中父节点的分裂特征和分裂阈值将该用户分配到相应的叶子节点,从而,通过将用户所在的各个叶子节点的分值相加,从而获得该用户的预测标签值。
在上述预测过程之后,根据本说明书实施例的模型解释方法基于GBDT模型中的现有参数和预测结果,获取对用户的预测标签值的特征解释。即,在每个所述决策树中,获取用户所在的叶子节点,获取包含所述叶子节点的预测路径,计算预测路径上的子节点的与预测标签值相关的特征及该特征的局部增量,以及,将全部决策树中包括的相同特征的局部增量累加起来作为该特征与预测标签值的相关性,也即该特征对预测标签值的特征贡献。从而通过所述特征及其特征贡献,对用户的预测标签值进行特征解释。上 述GBDT模型为回归模型,即,其预测的标签为连续型数据,例如欺诈风险值、年龄等。然而,所述GBDT模型不限于回归模型,其还可以为分类模型、推荐模型等,并且,这些模型都可以使用根据本说明书实施例的GBDT模型解释方法。
图1示出根据本说明书实施例的一种获取对用户的预测标签值的特征解释的方法,所述方法在通过GBDT模型预测用户的标签值之后执行,所述特征解释包括与所述用户的预测标签值相关的用户的多个特征、以及每个所述特征与所述预测标签值的相关性,所述GBDT模型中包括多个顺序排列的决策树。所述方法包括:在步骤S11,在排序靠前的预定数目的各个所述决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值,其中,所述叶子节点的分值为通过所述GBDT模型预定的分值;在步骤S12,确定与各个所述叶子节点分别对应的各个预测路径,所述预测路径为从所述叶子节点至其所在决策树的根节点之间的节点连接路径;在步骤S13,获取每个所述预测路径上各个父节点的分裂特征和分值,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定;在步骤S14,对于每个所述预测路径上的每个子节点,通过其自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量,其中所述每个子节点对应的特征为与所述用户的预测标签值相关的特征;在步骤S15,获取全部所述子节点各自对应的特征的集合,作为与所述用户的预测标签值相关的多个特征;以及,在步骤S16,通过将对应于相同特征的至少一个所述子节点的特征局部增量相加,获取与所述至少一个子节点对应的特征与所述预测标签值的相关性。
首先,在步骤S11,在排序靠前的预定数目的各个所述决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值,其中,所述叶子节点的分值为通过所述GBDT模型预定的分值。
如前文所述,在所述GBDT模型中包括的多个顺序排列的决策树中,每个决策树基于其前一个决策树的标签值残差获得,即,所述顺序排列的各个决策树的叶子节点的分值会越来越小。相应地,通过所述顺序排列的各个决策树确定的与用户预测标签值相关的用户特征的局部增量也会在数量级上变得越来越小。可以预测,从排序比较靠后的决策树获取的特征的局部增量对于该特征的与预测标签值的相关性(即,该特征的全部局部增量之和)的影响会越来越小,甚至可近似为零。因此,可选取排序靠前的预定数目的决策树来实施根据本说明书实施例的方法。可通过预定条件确定所述预定数目,例如,根据叶子节点的数量级,确定所述预定数目,或者,根据预定的决策树百分比,确定所 述预定数目等。在一个实施例中,可以对所述GBDT模型包括的全部决策树实施根据本说明书实施例的方法,从而获得精确的模型解释。
图2示例示出了根据本说明书实施例的GBDT模型中包括的决策树。如图2所示,图中的标为0的节点为该决策树的根节点,图中标为3、7、8、13、14、10、11、和12的节点为该决策树的叶子节点,其中,每个叶子节点下方标出的数值(例如,节点3下方的0.136)为该叶子节点的分值,该分值是GBDT模型在训练中基于分入该叶子节点中的多个样本的标定标签值而确定的。如图2中的矩形虚线框中所示,从节点6中分出两个节点11和12,因此,节点6是节点11和节点12的父节点,节点11和节点12都是节点6的子节点。如图2中所示,图中部分父节点通向子节点的箭头上都标明了特征及取值范围,例如,节点0通向节点1的箭头上标出“f5≤-0.5”,节点0通向节点2的箭头上标出“f5>-0.5”,这里的f5表示特征5,其为节点0的分裂特征,-0.5就是节点0的分裂阈值。
图3示例示出了基于图2所示的决策树实施根据本说明书实施例的方法的示意图。如图3所示,在通过包括图3所示的决策树的GBDT模型预测用户的标签值的情况中,假设在该决策树中将用户分到节点14中。从而,可从该决策树确定包括用户的节点14以及该节点14的分值。同时,在该GBDT模型包括的其它决策树中,可类似地确定用户所在的叶子节点及其分值。从而,可获取预定数目的叶子节点及其对应分值,即,从所述预定数目的决策树的每个决策树中都获取一个叶子节点。
在步骤S12,确定与各个所述叶子节点分别对应的各个预测路径,所述预测路径为从所述叶子节点至其所在决策树的根节点之间的节点连接路径。继续参考图3,在图3所示的决策树中,在确定用户所在的叶子节点14之后,可确定预测路径为图中从节点0至节点14的预测路径,图中以粗线箭头连接的节点连接路径示出。同时,在所述预定数目的决策树的其它决策树中,可类似地获取预测路径,从而获取预定数目的预测路径。
在步骤S13,获取每个所述预测路径上各个父节点的分裂特征和分值,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定。参考图3,在节点0到节点14的预测路径中,除了节点14之外的每个节点都有子节点,即在该路径中,包括的父节点有节点0、节点2、节点5和节点9。如前文参考图2所述,父节点的分裂特征可从决策树中直接获取,例如,参考图2,可得,节点0的分裂特征为特征5,节点2的分裂特征为特征2,节点5的分裂特征为特征4,以及节点9的分裂特征为特征4。在一个实施例中,基于如下公式(1)确定父节点的分值:
Figure PCTCN2019076264-appb-000003
其中,S p为父节点的分值,S c1和S c2分别为该父节点的两个子节点的分值。即,父节点的分值为其两个子节点的分值的平均值。例如,如图3中所示,可从节点13和节点14的分值确定节点9的分值为
Figure PCTCN2019076264-appb-000004
类似地,基于节点9和节点10的分值,可确定节点5的分值为0.0625。基于节点5和节点6的分值,可确定节点2的分值为0.0698。基于节点1和节点2的分值,可确定节点0的分值为0.0899。可以理解,图3所示的预测路径上的每个父节点的分值都可以基于图中的各个叶子节点的分值确定,例如,可从节点13、14和10确定节点5的分值,可从节点13、14、10、11和12确定节点2的分值。
在一个实施例中,基于以下公式(2)确定父节点的分值:
Figure PCTCN2019076264-appb-000005
其中,N c1和N c2为在模型训练中分别落入子节点c1和c2的样本数。即,父节点的分值为其两个子节点的分值的加权平均值,所述两个子节点的权重为模型训练过程中落入其中的样本数。在对根据本说明书实施例的实际应用或实验测试中可确定,通过使用公式(2)确定父节点的分值,相比于公式(1),可获取更准确的模型解释。另外,在本说明书实施例中,对于父节点的计算不限于上述公式(1)和(2),例如,可调节公式(1)和(2)中的参数,以使得模型解释更加准确,另外,还可通过几何平均值、均方根平均值等,基于叶子节点的分值,获取各个父节点的分值。
在步骤S14,对于每个所述预测路径上的每个子节点,通过所述每个子节点的自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量,其中所述每个子节点对应的特征为与所述用户的预测标签值相关的特征。
参考图3,在从节点0到节点14的预测路径中,除了根节点0之外,其它节点都是其上一个节点的子节点,即,该路径中的子节点包括:节点2、节点5、节点9和节点14。由于通过预测路径中的父节点的特征分裂,才获取预测路径中的子节点,从而父节点的分裂特征即为子节点的与预测标签值相关的特征,为方便描述,将其表示为与子节点对应的特征,或者子节点处的贡献特征。例如,如图3所示,与节点2对应的特征为 特征5,与节点5对应的特征为特征2,与节点9对应的特征为特征4,以及,与节点14对应的特征为特征4。
在一个实施例中,通过以下公式(3)获取各个子节点处的特征局部增量:
Figure PCTCN2019076264-appb-000006
其中,
Figure PCTCN2019076264-appb-000007
表示子节点c处的特征f的局部增量,S c表示子节点的分值,S p表示所述子节点的父节点的分值。该公式可从实际应用或实验测试中得到验证。
通过公式(3),基于在步骤S13中获得的各个父节点的分值,可容易地计算得出:节点2处的特征5(f5)的局部增量为-0.0201(即0.0698-0.0899),节点5处的特征2(f2)的局部增量为-0.0073,节点9处的特征4(f4)的局部增量为-0.0015,以及节点14处的特征4(f4)的局部增量为0.001。
在本说明书实施例中,对所述局部增量的计算不限于上述公式(3),还可以通过其它计算方法计算所述局部增量。例如,可对公式(3)中的父节点的分值或子节点的分值乘以修正参数,以使模型解释更加准确。
在步骤S15,获取全部所述子节点各自对应的特征的集合,作为与所述用户的预测标签值相关的多个特征。例如,参考图3,在图3所示的决策树中,可从所述预测路径上获取与用户的预测标签值相关的特征,即,特征5、特征2和特征4。同样地,可从所述预测数目的决策树中的其它决策树中,类似地获取与用户的预测标签值相关的特征。将这些特征集合到一起,从而可获取与用户的预测标签值相关的多个特征的集合。
在步骤S16,通过将对应于相同特征的至少一个所述子节点的特征局部增量相加,获取与所述至少一个子节点对应的特征与所述预测标签值的相关性。例如,参考图3所示,在如图所示的决策树中,预测路径上的节点9和14都对应于特征4,从而可将节点9与节点14处的局部增量相加,例如,在其它决策树中没有获取与特征4对应的预测路径子节点的情况中,可获得特征4与预测标签值的相关性(或者特征贡献值)为-0.0015+0.0010=0.0025。在其它决策树中也包括与特征4对应的预测路径子节点的情况中,可将全部与特征4对应的子节点的局部增量相加,从而获取特征4的相关性或贡献值。所述相关性的值越大,表示特征与预测标签值的相关性越大,当所述相关性的值为负值时,表示该特征与预测标签值的相关性非常小。例如,在通过GBDT模型预测用户的信用卡欺诈值的实例中,所述相关性的值越大,表示该特征与信用卡欺诈值的相关性越大,即,该特征的风险性越大。
通过获取与用户的预测标签值相关的多个特征及所述多个特征与所述预测标签值的相关性,从而可以对用户预测标签值的进行特征解释,从而明确预测的确定因素,并可以通过所述特征解释,获取与用户相关的更多信息。例如,在通过GBDT模型预测用户的信用卡欺诈度的实例中,通过获取用户的与预测标签值相关的多个特征及特征的相关性大小,可以将特征的影响面及该特征的相关性的大小,作为用户信用卡欺诈度预测值的参考信息,以使得对用户的判断更加准确。
图4示出了根据本说明书实施例的一种获取对用户的预测标签值的特征解释的装置400。所述装置400在通过GBDT模型预测用户的标签值之后实施,所述特征解释包括与所述用户的预测标签值相关的用户的多个特征、以及每个所述特征与所述预测标签值的相关性,所述GBDT模型中包括多个顺序排列的决策树。所述装置400包括:
第一获取单元41,配置为,在排序靠前的预定数目的各个所述决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值,其中,所述叶子节点的分值为通过所述GBDT模型预定的分值;
第一确定单元42,配置为,确定与各个所述叶子节点分别对应的各个预测路径,所述预测路径为从所述叶子节点至其所在决策树的根节点之间的节点连接路径;
第二获取单元43,配置为,获取每个所述预测路径上各个父节点的分裂特征和分值,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定;
第二确定单元44,配置为,对于每个所述预测路径上的每个子节点,通过所述每个子节点的自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量,其中所述每个子节点对应的特征为与所述用户的预测标签值相关的特征;
特征获取单元45,配置为,获取全部所述子节点各自对应的特征的集合,作为与所述用户的预测标签值相关的多个特征;以及
相关性获取单元46,配置为,通过将对应于相同特征的至少一个所述子节点的特征局部增量相加,获取与所述至少一个子节点对应的特征与所述预测标签值的相关性。
通过根据本说明书实施例的GBDT模型解释方案,只需获取GBDT模型中的已有参数和预测结果,即可获取对GBDT模型的用户级的准确的模型解释,并且,计算成本较低。另外,本说明书实施例的方案可适用于各种GBDT模型,适用性强,可操作性强。
本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的 各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (12)

  1. 一种获取对用户的预测标签值的特征解释的方法,所述方法在通过GBDT模型预测用户的标签值之后执行,所述特征解释包括与所述用户的预测标签值相关的用户的多个特征、以及每个所述特征与所述预测标签值的相关性,所述GBDT模型中包括多个顺序排列的决策树,所述方法包括:
    在排序靠前的预定数目的各个所述决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值,其中,所述叶子节点的分值为通过所述GBDT模型预定的分值;
    确定与各个所述叶子节点分别对应的各个预测路径,所述预测路径为从所述叶子节点至其所在决策树的根节点之间的节点连接路径;
    获取每个所述预测路径上各个父节点的分裂特征和分值,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定;
    对于每个所述预测路径上的每个子节点,通过所述每个子节点的自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量,其中所述每个子节点对应的特征为与所述用户的预测标签值相关的特征;
    获取全部所述子节点各自对应的特征的集合,作为与所述用户的预测标签值相关的多个特征;以及
    通过将对应于相同特征的至少一个所述子节点的特征局部增量相加,获取与所述至少一个子节点对应的特征与所述预测标签值的相关性。
  2. 根据权利要求1所述的获取对用户的预测标签值的特征解释的方法,其中,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定包括,所述父节点的分值为其两个子节点的分值的平均值。
  3. 根据权利要求1所述的获取对用户的预测标签值的特征解释的方法,其中,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定包括,所述父节点的分值为其两个子节点的分值的加权平均值,所述子节点的分值的权重基于在所述GBDT模型的训练过程中分配至其的样本数而确定。
  4. 根据权利要求1所述的获取对用户的预测标签值的特征解释的方法,其中,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量包括,获取所述每个子节点的自身分值与其父节点的分值的差,作为所述特征局部增量。
  5. 根据权利要求1所述的获取对用户的预测标签值的特征解释的方法,其中,所述GBDT模型为分类模型或回归模型。
  6. 根据权利要求1所述的获取对用户的预测标签值的特征解释的方法,其中,所述排序靠前的预定数目的所述决策树为所述GBDT模型中包括的多个顺序排列的决策树。
  7. 一种获取对用户的预测标签值的特征解释的装置,所述装置在通过GBDT模型预测用户的标签值之后实施,所述特征解释包括与所述用户的预测标签值相关的用户的多个特征、以及每个所述特征与所述预测标签值的相关性,所述GBDT模型中包括多个顺序排列的决策树,所述装置包括:
    第一获取单元,配置为,在排序靠前的预定数目的各个所述决策树中,分别获取包括所述用户的叶子节点和所述叶子节点的分值,其中,所述叶子节点的分值为通过所述GBDT模型预定的分值;
    第一确定单元,配置为,确定与各个所述叶子节点分别对应的各个预测路径,所述预测路径为从所述叶子节点至其所在决策树的根节点之间的节点连接路径;
    第二获取单元,配置为,获取每个所述预测路径上各个父节点的分裂特征和分值,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定;
    第二确定单元,配置为,对于每个所述预测路径上的每个子节点,通过所述每个子节点的自身的分值、其父节点的分值和其父节点的分裂特征,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量,其中所述每个子节点对应的特征为与所述用户的预测标签值相关的特征;
    特征获取单元,配置为,获取全部所述子节点各自对应的特征的集合,作为与所述用户的预测标签值相关的多个特征;以及
    相关性获取单元,配置为,通过将对应于相同特征的至少一个所述子节点的特征局部增量相加,获取与所述至少一个子节点对应的特征与所述预测标签值的相关性。
  8. 根据权利要求7所述的获取对用户的预测标签值的特征解释的装置,其中,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定包括,所述父节点的分值为其两个子节点的分值的平均值。
  9. 根据权利要求7所述的获取对用户的预测标签值的特征解释的装置,其中,所述各个父节点的分值基于其所在决策树的各个叶子节点的预定分值而确定包括,所述父节点的分值为其两个子节点的分值的加权平均值,所述子节点的分值的权重基于在所述GBDT模型的训练过程中分配至其的样本数而确定。
  10. 根据权利要求7所述的获取对用户的预测标签值的特征解释的装置,其中,确定所述每个子节点对应的特征、及每个子节点处的特征局部增量包括,获取所述每个子节点的自身分值与其父节点的分值的差,作为所述特征局部增量。
  11. 根据权利要求7所述的获取对用户的预测标签值的特征解释的装置,其中,所述GBDT模型为分类模型或回归模型。
  12. 根据权利要求7所述的获取对用户的预测标签值的特征解释的装置,其中,所述排序靠前的预定数目的所述决策树为所述GBDT模型中包括的多个顺序排列的决策树。
PCT/CN2019/076264 2018-05-21 2019-02-27 Gbdt模型的特征解释方法和装置 WO2019223384A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19806892.6A EP3719704A4 (en) 2018-05-21 2019-02-27 CHARACTERISTICS INTERPRETATION METHOD AND DEVICE FOR GRADIENT AMPLIFICATION DECISION TREE (GBDT) MODEL
SG11202006205SA SG11202006205SA (en) 2018-05-21 2019-02-27 Gbdt model feature interpretation method and apparatus
US16/889,695 US11205129B2 (en) 2018-05-21 2020-06-01 GBDT model feature interpretation method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810488062.XA CN108681750A (zh) 2018-05-21 2018-05-21 Gbdt模型的特征解释方法和装置
CN201810488062.X 2018-05-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/889,695 Continuation US11205129B2 (en) 2018-05-21 2020-06-01 GBDT model feature interpretation method and apparatus

Publications (1)

Publication Number Publication Date
WO2019223384A1 true WO2019223384A1 (zh) 2019-11-28

Family

ID=63806940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076264 WO2019223384A1 (zh) 2018-05-21 2019-02-27 Gbdt模型的特征解释方法和装置

Country Status (6)

Country Link
US (1) US11205129B2 (zh)
EP (1) EP3719704A4 (zh)
CN (1) CN108681750A (zh)
SG (1) SG11202006205SA (zh)
TW (1) TWI689871B (zh)
WO (1) WO2019223384A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112330054A (zh) * 2020-11-23 2021-02-05 大连海事大学 基于决策树的动态旅行商问题求解方法、系统及存储介质
CN114841233A (zh) * 2022-03-22 2022-08-02 阿里巴巴(中国)有限公司 路径解释方法、装置和计算机程序产品

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681750A (zh) * 2018-05-21 2018-10-19 阿里巴巴集团控股有限公司 Gbdt模型的特征解释方法和装置
CN109492783B (zh) * 2018-11-14 2023-09-15 中国电力科学研究院有限公司 一种基于gbdt的电力计量设备故障风险预测方法
CN109489212B (zh) * 2018-11-21 2020-05-05 珠海格力电器股份有限公司 一种空调智能睡眠控制方法、调整系统及设备
CN110008349B (zh) * 2019-02-01 2020-11-10 创新先进技术有限公司 计算机执行的事件风险评估的方法及装置
CN110084318B (zh) * 2019-05-07 2020-10-02 哈尔滨理工大学 一种结合卷积神经网络和梯度提升树的图像识别方法
CN110457912B (zh) * 2019-07-01 2020-08-14 阿里巴巴集团控股有限公司 数据处理方法、装置和电子设备
CN110990829B (zh) * 2019-11-21 2021-09-28 支付宝(杭州)信息技术有限公司 在可信执行环境中训练gbdt模型的方法、装置及设备
CN111340121B (zh) * 2020-02-28 2022-04-12 支付宝(杭州)信息技术有限公司 目标特征的确定方法及装置
CN111383028B (zh) * 2020-03-16 2022-11-22 支付宝(杭州)信息技术有限公司 预测模型训练方法及装置、预测方法及装置
CN111401570B (zh) * 2020-04-10 2022-04-12 支付宝(杭州)信息技术有限公司 针对隐私树模型的解释方法和装置
CN114218994A (zh) * 2020-09-04 2022-03-22 京东科技控股股份有限公司 用于处理信息的方法和装置
CN112818228B (zh) * 2021-01-29 2023-08-04 北京百度网讯科技有限公司 向用户推荐对象的方法、装置、设备和介质
CN114417822A (zh) * 2022-03-29 2022-04-29 北京百度网讯科技有限公司 用于生成模型解释信息的方法、装置、设备、介质和产品
CN115048386B (zh) * 2022-06-28 2024-09-13 支付宝(杭州)信息技术有限公司 一种业务执行方法、装置、存储介质及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611191A (zh) * 2016-07-11 2017-05-03 四川用联信息技术有限公司 基于不确定连续属性的决策树分类器构建方法
US20170213280A1 (en) * 2016-01-27 2017-07-27 Huawei Technologies Co., Ltd. System and method for prediction using synthetic features and gradient boosted decision tree
CN107025154A (zh) * 2016-01-29 2017-08-08 阿里巴巴集团控股有限公司 磁盘的故障预测方法和装置
CN107153977A (zh) * 2016-03-02 2017-09-12 阿里巴巴集团控股有限公司 网上交易平台中交易实体信用评估方法、装置及系统
CN108681750A (zh) * 2018-05-21 2018-10-19 阿里巴巴集团控股有限公司 Gbdt模型的特征解释方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9449282B2 (en) * 2010-07-01 2016-09-20 Match.Com, L.L.C. System for determining and optimizing for relevance in match-making systems
US20140257924A1 (en) * 2013-03-08 2014-09-11 Corelogic Solutions, Llc Automated rental amount modeling and prediction
US9501716B2 (en) * 2014-12-11 2016-11-22 Intel Corporation Labeling component parts of objects and detecting component properties in imaging data
CN107301577A (zh) * 2016-04-15 2017-10-27 阿里巴巴集团控股有限公司 信用评估模型的训练方法、信用评估方法以及装置
CN106204063A (zh) * 2016-06-30 2016-12-07 北京奇艺世纪科技有限公司 一种付费用户挖掘方法及装置
CN106250403A (zh) * 2016-07-19 2016-12-21 北京奇艺世纪科技有限公司 用户流失预测方法及装置
CN108038539A (zh) * 2017-10-26 2018-05-15 中山大学 一种集成长短记忆循环神经网络与梯度提升决策树的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170213280A1 (en) * 2016-01-27 2017-07-27 Huawei Technologies Co., Ltd. System and method for prediction using synthetic features and gradient boosted decision tree
CN107025154A (zh) * 2016-01-29 2017-08-08 阿里巴巴集团控股有限公司 磁盘的故障预测方法和装置
CN107153977A (zh) * 2016-03-02 2017-09-12 阿里巴巴集团控股有限公司 网上交易平台中交易实体信用评估方法、装置及系统
CN106611191A (zh) * 2016-07-11 2017-05-03 四川用联信息技术有限公司 基于不确定连续属性的决策树分类器构建方法
CN108681750A (zh) * 2018-05-21 2018-10-19 阿里巴巴集团控股有限公司 Gbdt模型的特征解释方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3719704A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112330054A (zh) * 2020-11-23 2021-02-05 大连海事大学 基于决策树的动态旅行商问题求解方法、系统及存储介质
CN112330054B (zh) * 2020-11-23 2024-03-19 大连海事大学 基于决策树的动态旅行商问题求解方法、系统及存储介质
CN114841233A (zh) * 2022-03-22 2022-08-02 阿里巴巴(中国)有限公司 路径解释方法、装置和计算机程序产品
CN114841233B (zh) * 2022-03-22 2024-05-31 阿里巴巴(中国)有限公司 路径解释方法、装置和计算机程序产品

Also Published As

Publication number Publication date
EP3719704A4 (en) 2021-03-17
US11205129B2 (en) 2021-12-21
TWI689871B (zh) 2020-04-01
CN108681750A (zh) 2018-10-19
TW202004559A (zh) 2020-01-16
EP3719704A1 (en) 2020-10-07
SG11202006205SA (en) 2020-07-29
US20200293924A1 (en) 2020-09-17

Similar Documents

Publication Publication Date Title
WO2019223384A1 (zh) Gbdt模型的特征解释方法和装置
US11615341B2 (en) Customizable machine learning models
US11461537B2 (en) Systems and methods of data augmentation for pre-trained embeddings
CN110880019B (zh) 通过无监督域适应训练目标域分类模型的方法
WO2019196210A1 (zh) 数据分析方法、计算机可读存储介质、终端设备及装置
US7809705B2 (en) System and method for determining web page quality using collective inference based on local and global information
CN104869126B (zh) 一种网络入侵异常检测方法
US20080162385A1 (en) System and method for learning a weighted index to categorize objects
CN109960808B (zh) 一种文本识别方法、装置、设备及计算机可读存储介质
US10678888B2 (en) Methods and systems to predict parameters in a database of information technology equipment
WO2021073390A1 (zh) 数据筛选方法、装置、设备及计算机可读存储介质
WO2022179384A1 (zh) 一种社交群体的划分方法、划分系统及相关装置
WO2016095068A1 (en) Pedestrian detection apparatus and method
CN108509492B (zh) 基于房地产行业的大数据处理及系统
CN110909868A (zh) 基于图神经网络模型的节点表示方法和装置
WO2016122575A1 (en) Product, operating system and topic based recommendations
CN114139636B (zh) 异常作业处理方法及装置
US20220327394A1 (en) Learning support apparatus, learning support methods, and computer-readable recording medium
CN112925913B (zh) 用于匹配数据的方法、装置、设备和计算机可读存储介质
EP4293956A1 (en) Method for predicting malicious domains
US20140324524A1 (en) Evolving a capped customer linkage model using genetic models
US20140324523A1 (en) Missing String Compensation In Capped Customer Linkage Model
CN112464101A (zh) 电子书的排序推荐方法、电子设备及存储介质
CN110457543A (zh) 一种基于端到端多视角匹配的实体消解方法和系统
CN111078820A (zh) 基于权重符号社交网络嵌入的边权预测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19806892

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019806892

Country of ref document: EP

Effective date: 20200629

NENP Non-entry into the national phase

Ref country code: DE