CN112766350A

CN112766350A - Method, device and equipment for constructing two-classification model and computer readable storage medium

Info

Publication number: CN112766350A
Application number: CN202110038163.9A
Authority: CN
Inventors: 吴轶凡; 陈婷; 吴三平; 庄伟亮
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-05-07
Anticipated expiration: 2041-01-12
Also published as: CN112766350B

Abstract

The invention discloses a method, a device and equipment for constructing a two-classification model and a computer readable storage medium, wherein the method comprises the following steps: acquiring monotonicity relations between each characteristic attribute of the training samples and preset two classification targets; training a gradient lifting tree by using the training sample, and pruning splitting nodes which do not accord with the monotonicity relation corresponding to the splitting nodes in each decision tree of the gradient lifting tree; and lifting the tree according to the gradient after pruning to obtain a target binary classification model. The method and the device realize the fusion of business knowledge in the gradient lifting tree training process, so that the classification result of the model has higher credibility.

Description

Method, device and equipment for constructing two-classification model and computer readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method, a device and equipment for constructing a binary model and a computer readable storage medium.

Background

A Gradient Boosting Decision Tree (GBDT) has been used as an integrated learning method and has been well developed in many fields. The model trained by the gradient lifting tree training method is also called a gradient lifting tree, and the gradient lifting tree is a model which is generated to have better effect by combining a plurality of complementary decision trees. When the gradient spanning tree is applied to a specific application scenario to solve a specific service problem, the classification result of the gradient spanning tree can only have confidence level if the gradient spanning tree obtained through training is in accordance with the service cognition of the specific application scenario. For example, in a client Risk Credit assessment scenario, one common business realization is that the more historical overdue a client, the greater the likelihood that the client is a high risk client. However, the current gradient lifting tree training method only trains according to training samples, and the gradient lifting tree obtained by training may contain split nodes which do not conform to specific application scene service cognition, so that the reliability of the classification result of the model is low.

Disclosure of Invention

The invention mainly aims to provide a method, a device and equipment for constructing a two-classification model and a computer readable storage medium, and aims to solve the technical problems that the inspection of the quality of milk tea discharged from a cup is manually performed and the inspection efficiency is low at present.

In order to achieve the above object, the present invention provides a method for constructing a two-class model, comprising the following steps:

acquiring monotonicity relations between each characteristic attribute of the training samples and preset two classification targets;

training a gradient lifting tree by using the training sample, and pruning splitting nodes which do not accord with the monotonicity relation corresponding to the splitting nodes in each decision tree of the gradient lifting tree;

and lifting the tree according to the gradient after pruning to obtain a target binary classification model.

Optionally, the training of the gradient spanning tree by using the training sample, and the pruning of the split nodes in each decision tree of the gradient spanning tree that do not conform to the monotonicity relationship corresponding to the split node include:

training a gradient lifting tree by using the training sample, and calculating preset index values corresponding to nodes in each decision tree of the gradient lifting tree, wherein the preset index is an index having monotonicity with the preset binary target;

for a split node in each node, determining whether the split node conforms to the monotonicity relation corresponding to the split node according to the preset index values corresponding to two child nodes of the split node;

and if the split node is determined not to accord with the monotonicity relation corresponding to the split node, pruning the split node.

Optionally, the step of determining, for a split node in the nodes, whether the split node conforms to the monotonicity relationship corresponding to the split node according to the preset index values corresponding to two child nodes of the split node includes:

for a split node in each node, determining a real size relationship between the preset index values corresponding to two child nodes of the split node;

determining a target size relationship which should be possessed between the preset index values corresponding to the two child nodes according to the splitting rule of the split node and the monotonicity relationship corresponding to the characteristic attribute in the splitting rule;

detecting whether the real size relationship is the same as the target size relationship;

if not, determining that the split node does not conform to the monotonicity relation corresponding to the split node;

and if the two split nodes are the same, determining that the split nodes accord with the monotonicity relation corresponding to the split nodes.

Optionally, before the step of determining, for a split node in the nodes, a true size relationship between the preset metric values corresponding to two child nodes of the split node, the method further includes:

detecting whether the monotonicity relation corresponding to the characteristic attribute in the splitting rule of the split node is a monotonicity relation;

if so, executing the step of determining the real size relationship between the preset index values corresponding to two child nodes of the split node for the split node in each node;

if not, outputting preset prompt information, and receiving feedback information triggered based on the preset prompt information, wherein the feedback information is used for indicating whether the split node conforms to the monotonicity relation corresponding to the split node.

Optionally, if it is determined that the split node does not conform to the monotonicity relationship corresponding to the split node, the pruning of the split node includes:

and if the splitting node is determined not to be in accordance with the monotonicity relation corresponding to the splitting node, setting each blade value under the splitting node as the preset index value corresponding to the splitting node so as to carry out post pruning on the splitting node.

Optionally, the step of obtaining the target two-classification model according to the pruned gradient lifting tree includes:

converting the predicted values of the leaf nodes of each decision tree in the pruned gradient lifting tree into preset index values corresponding to the leaf nodes to obtain a decision tree set, wherein the preset index is an index having monotonicity with the preset binary classification target;

and taking each decision tree in the decision tree set as a variable generator of the binary model to be fitted, and fitting the binary model to be fitted by adopting the training sample to obtain a target binary model.

Optionally, the training samples include credit data of each customer, and after the step of obtaining the target two-class model according to the pruned gradient spanning tree, the method further includes:

inputting credit data of the customer to be assessed into the target binary classification model, processing the credit data to obtain a classification result, and determining whether to loan the customer to be assessed according to the classification result.

In order to achieve the above object, the present invention further provides a binary model building apparatus, including:

the acquisition module is used for acquiring monotonicity relations between each characteristic attribute of the training sample and preset two classification targets;

the pruning module is used for training the gradient lifting tree by adopting the training sample and pruning the splitting nodes which do not accord with the monotonicity relation corresponding to the splitting nodes in each decision tree of the gradient lifting tree;

and the determining module is used for obtaining a target two-classification model according to the pruned gradient lifting tree.

In order to achieve the above object, the present invention further provides a two-class model building device, including: a memory, a processor and a two-class model building program stored on the memory and executable on the processor, the two-class model building program when executed by the processor implementing the steps of the two-class model building method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon a two-classification model construction program, which when executed by a processor, implements the steps of the two-classification model construction method as described above.

Because the existing gradient lifting tree training method only trains according to training samples and does not integrate business knowledge of specific application scenes, certain split nodes in the trained gradient lifting tree possibly violate business cognition, and the classification result of the gradient lifting tree has low reliability. Compared with the existing training method, in the invention, the monotonicity relation which accords with the service cognition between the characteristic attribute and the two classification targets is obtained, the split nodes which do not accord with the monotonicity relation corresponding to the split nodes in the gradient lifting tree are pruned, and the target two classification models are obtained according to the pruned gradient lifting tree, so that the finally obtained two classification models accord with the service cognition of a specific application scene, namely, the fusion of the service knowledge in the training process of the gradient lifting tree is realized, and the classification result of the models has higher reliability.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of the classification model construction method according to the present invention;

FIG. 3 is a functional block diagram of an apparatus for constructing a binary model according to a preferred embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that, in the embodiment of the present invention, the two-class model building device may be a smart phone, a personal computer, a server, and the like, and is not limited herein.

As shown in fig. 1, the two-classification model building apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the device architecture shown in FIG. 1 does not constitute a limitation of the two-class model building device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a two-class model building program. The operating system is a program that manages and controls the hardware and software resources of the device, supporting the operation of the binary model builder as well as other software or programs. In the device shown in fig. 1, the user interface 1003 is mainly used for data communication with a client; the network interface 1004 is mainly used for establishing communication connection with a server; and the processor 1001 may be configured to call the binary model builder stored in the memory 1005 and perform the following operations:

Further, the step of training a gradient lifting tree by using the training sample, and pruning split nodes which do not conform to the monotonicity relationship corresponding to the split nodes in each decision tree of the gradient lifting tree includes:

Further, the step of determining, for a split node in the nodes, whether the split node conforms to the monotonicity relationship corresponding to the split node according to the preset index values corresponding to two child nodes of the split node includes:

Further, before the step of determining, for a split node of the nodes, a real size relationship between the preset metric values corresponding to two child nodes of the split node, the processor 1001 may be further configured to invoke a binary model building program stored in the memory 1005, and perform the following operations:

Further, if it is determined that the split node does not conform to the monotonicity relationship corresponding to the split node, the pruning of the split node includes:

Further, the step of obtaining the target two-classification model according to the pruned gradient lifting tree comprises:

Further, the training samples comprise credit data of each client, and after the step of obtaining the target two-class model according to the pruned gradient-boosted tree, the processor 1001 may be further configured to call a two-class model building program stored in the memory 1005, and perform the following operations:

Based on the structure, various embodiments of the two-classification model construction method are provided.

Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the method for constructing the two-class model according to the present invention.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown. The execution subject of each embodiment of the method for constructing the two-class model can be a smart phone, a personal computer, a server and other devices, and for convenience of description, the following embodiments use the modeling device as the execution subject for explanation. In this embodiment, the method for constructing the binary model includes:

step S10, acquiring monotonicity relations between each characteristic attribute of the training sample and a preset two-classification target;

in the present embodiment, training samples for training the binary model may be uploaded in advance by a modeler in the modeling apparatus. The training sample comprises a plurality of sample data, the sample data comprises attribute values under each characteristic attribute, the characteristic attribute is the input variable of the two-classification model, and the attribute value is the value of the variable. The sample data also comprises two classification labels corresponding to the sample data, and the two classification labels are labels set according to the two classification targets. And the second classification target is the output result of the preset two-classification model. Under different application scenes, the two classification targets are different, and the characteristic attribute and the two classification labels of the sample data are also different. For example, in a scenario of credit risk assessment of a customer, if it is required to predict whether the customer is a high-risk customer, the binary target may be set as a probability that the customer belongs to the high-risk customer, the characteristic attribute may include attributes related to credit risk, such as age, academic calendar, historical loan times, historical overdue times, and the like of the customer, one sample data corresponds to an attribute value of the customer under each characteristic attribute, and a binary tag corresponding to the sample data is a tag of whether the customer is a high-risk customer.

For each feature attribute of the training sample, the modeling device may obtain a monotonicity relationship between each feature attribute and the classification target, respectively. In one embodiment, there may be two monotonicity relationships between the feature attributes and the taxonomic target: monotone increase and monotone decrease; specifically, the characteristic attribute is used as an independent variable of the function, the binary target is used as a dependent variable of the function, if the function is a monotone increasing function, the monotone relation between the characteristic attribute and the binary target is monotone increasing, and if the function is a monotone decreasing function, the monotone relation between the characteristic attribute and the binary target is monotone decreasing. In another embodiment, there may be three monotonicity relationships between the feature attributes and the classification targets, i.e. except monotonicity increase and monotonicity decrease, there may be no monotonicity; that is, if the function is not a monotonic function, it indicates that the monotonicity relationship between the feature attribute and the classification target is not monotonicity.

The modeling personnel can set monotonicity relations between each characteristic attribute and the two classification targets in the modeling equipment according to the business knowledge of the application scene, the sample data and the data forms of the two classification targets, and then the modeling equipment can acquire the monotonicity relations set by the modeling personnel during modeling. For example, in the scenario of credit risk assessment of a client, according to business knowledge in credit business, the more the historical overdue times of the client are known, the greater the probability that the client is a high-risk client is; because the data form of the second classification target is the probability that the client is a high-risk client, the monotonicity relation between the characteristic attribute of the historical overdue times of the client and the probability that the client is the high-risk client is monotonously increased; if the two-class target is set to the probability that the client does not belong to a high-risk client, it is clear that the monotonicity relationship between the feature attribute and the two-class target is a monotonicity decrease.

Step S20, training a gradient lifting tree by using the training sample, and pruning splitting nodes which do not accord with the monotonicity relation corresponding to the splitting nodes in each decision tree of the gradient lifting tree;

the modeling device may start modeling after detecting the modeling instruction, train the gradient lifting tree using the training sample, and specifically train the gradient lifting tree according to the existing gradient lifting tree training method, which is not described in detail in this embodiment. During the process of training the gradient lifting tree or after the gradient lifting tree is obtained through training, the splitting nodes which do not accord with the monotonicity relation corresponding to the splitting nodes in each decision tree of the gradient lifting tree can be pruned. The split node refers to a node in the decision tree except for the leaf node. It should be noted that, in the process of training the gradient lifting tree, pruning, that is, pre-pruning, may be performed according to a method for decision tree pre-pruning, and after the gradient lifting tree is obtained by training, pruning, that is, post-pruning, may be performed according to a method for decision tree post-pruning, which is not described in detail in this embodiment. That is, if it is detected that one split node does not conform to the monotonicity relationship corresponding to the split node, pruning is performed on the split node.

The monotonicity relation corresponding to the split node is the monotonicity relation between the characteristic attribute in the split rule of the split node and the binary classification target. There are various ways to determine whether a split node conforms to its corresponding monotonicity relationship. For example, in an embodiment, whether a split node conforms to a monotonicity relationship corresponding to a characteristic attribute of a split node may be determined by detecting whether sample data distribution obtained after splitting according to a split rule of the split node conforms to the monotonicity relationship corresponding to the characteristic attribute in the split rule. For example, a splitting rule of a splitting node splits sample data falling into the splitting node by taking historical overdue times as a threshold, and after splitting the sample data according to the splitting rule, counting to obtain the number of samples labeled as high-risk clients in samples with historical overdue times greater than N, and the number of samples labeled as high-risk clients in samples with historical overdue times less than N; and the monotonicity relation between the historical overdue times and the classification targets is monotonously increased, according to the monotonicity relation, the number of samples labeled as high-risk clients in the samples with the historical overdue times larger than N times is larger than the number of samples labeled as high-risk clients in the samples with the historical overdue times not larger than N times, so that the sample data distribution after the splitting according to the splitting rule of the splitting node is not in accordance with the monotonicity relation, and the splitting node is determined not to be in accordance with the monotonicity relation corresponding to the splitting node.

In another embodiment, for a split node that does not have monotonicity and corresponds to a characteristic attribute in a split rule, the split node may be used as an uncertain node, that is, it is uncertain whether the split node conforms to its own corresponding monotonicity relation. For uncertain nodes and/or split nodes which are determined not to accord with the monotonicity relationship of the nodes, relevant data of the nodes can be output, so that modeling personnel can determine whether to prune the nodes according to data analysis (namely, determine whether the nodes accord with the monotonicity relationship corresponding to the nodes) and input feedback information indicating whether to prune in modeling equipment, and the modeling equipment prunes or does not prune the nodes according to the feedback information; the output node-related data may include a splitting rule of the node, a sample number falling into the node, a positive sample number, a negative sample number, and the like, or may also output related data of a parent node and a child node of the node, or may also output related data of all nodes in a decision tree in which the node is located.

It should be noted that, in some embodiments, only the monotonicity relationship between part of the feature attributes and the classification targets may be obtained, and for a split node in which the feature attributes in the splitting rule do not have a corresponding monotonicity relationship, the detection on whether the feature attributes conform to the monotonicity relationship may not be performed, that is, the pruning may not be performed.

And step S30, obtaining a target binary classification model according to the pruned gradient lifting tree.

For the gradient lifting tree obtained after training and pruning, the gradient lifting tree can be directly used as a target two-classification model, and the target two-classification model can be obtained by refitting on the basis of the gradient lifting tree. For example, in an embodiment, if a pre-pruning method is used for pruning the split node, since the pre-pruning method is to prune in the process of building the tree, the pruning does not affect the classification accuracy of the model, and the gradient lifting tree can be directly used as the target binary classification model. Or in another embodiment, if a post-pruning method is adopted for pruning the split nodes, and the post-pruning method is to prune after the tree is constructed, and the classification accuracy of the model is affected by removing the nodes, in order to ensure the classification accuracy of the gradient lifting tree, the target binary classification model can be obtained by re-fitting by adopting methods such as logistic regression or neural network and the like on the basis of the gradient lifting tree obtained after training and pruning.

Because the existing gradient lifting tree training method only trains according to training samples and does not integrate business knowledge of specific application scenes, certain split nodes in the trained gradient lifting tree possibly violate business cognition, and the classification result of the gradient lifting tree has low reliability. Compared with the existing training method, in the embodiment, the monotonicity relation which accords with the service cognition between the characteristic attribute and the two classification targets is obtained, the split nodes which do not accord with the monotonicity relation corresponding to the split nodes in the gradient lifting tree are pruned, and the target two classification models are obtained according to the pruned gradient lifting tree, so that the finally obtained two classification models accord with the service cognition of a specific application scene, namely, the fusion of the service knowledge in the training process of the gradient lifting tree is realized, and the classification result of the models has higher reliability.

Further, based on the first embodiment, a second embodiment of the method for constructing a second classification model according to the present invention is provided, and in this embodiment, the step S20 includes:

step S201, training a gradient lifting tree by using the training sample, and calculating preset index values corresponding to nodes in decision trees of the gradient lifting tree, wherein the preset index values are indexes having monotonicity with the preset binary classification target;

in this embodiment, the gradient lifting tree is trained by using the training samples, and the preset index value corresponding to each node in each decision tree of the gradient lifting tree can be calculated during the training process or after the gradient lifting tree is obtained by training.

The preset index value is a value corresponding to the preset index, and the preset index is an index having monotonicity with the binary target. Specifically, whether an index has monotonicity with a binary target or not may be determined by referring to the method for determining the monotonicity relationship between the characteristic attribute and the binary target in the first embodiment, that is, if the monotonicity relationship between the index and the binary target is monotonous increase or monotonous decrease, it is determined that the index has monotonicity with the binary target, and the index may be used as a preset index. For example, the preset index may refer to a positive sample rate or a negative sample rate among samples falling into the node, but is not limited to these two. In an embodiment, in an application scenario of credit risk assessment of a customer, if a binary target is set as a probability that a customer belongs to a high-risk customer, a preset index may be set as an index that is monotonous with the binary target, for example, the preset index is set as a sample bad customer rate of a node, the sample bad customer rate refers to a proportion occupied by a sample labeled as a high-risk customer in a sample falling into the node, and according to business knowledge, the higher the sample bad customer rate of the node is, the higher the probability that the sample falling into the node belongs to the high-risk customer is, so that monotonicity is possessed between the sample bad customer rate and the binary target, and a monotonous relationship is monotonically increased.

In an embodiment, if a post-pruning method is used for pruning, that is, the preset index value of each node is calculated after the gradient lifting tree is trained, the nodes may be traversed upwards according to the tree structure, and the preset index value of each node is sequentially calculated, or traversed in a tree structure-down manner or other manners in other embodiments. In another embodiment, if a pre-pruning method is used for pruning, that is, the preset index value of a node is calculated in the process of training the gradient lifting tree, the preset index value of the node may be calculated when it is required to determine whether the parent node of the node conforms to its corresponding monotonicity relationship. Or in other embodiments, after a node to be split is calculated to obtain a corresponding splitting rule, preset index values of two child nodes of the node to be split can be calculated according to the splitting rule, and subsequent judgment is performed according to the preset index values without actually creating the two child nodes.

Step S202, for the split node in each node, determining whether the split node accords with the monotonicity relationship corresponding to the split node according to the preset index values corresponding to two child nodes of the split node;

the nodes of the decision tree comprise split nodes and leaf nodes, and only the split nodes are detected whether to accord with the corresponding monotonicity relation. For a split node, whether the split node conforms to the monotonicity relationship corresponding to the split node can be determined according to preset index values corresponding to two child nodes of the split node.

In one embodiment, a threshold range into which the left and right child nodes are divided when the splitting rule involves threshold comparison in training the gradient lifting tree may be preset, that is, which one of the left and right child nodes is divided into a larger threshold range and which one is divided into a smaller threshold range; after presetting, when a modeler sets a monotonicity relation between a characteristic attribute and a classification target, the modeler can directly set a size relation which preset index values of a left child node and a right child node of a split node corresponding to the characteristic attribute should accord with so as to represent the monotonicity relation; for example, a classification target is set as the probability that a client belongs to a high-risk client, a preset index is set as a sample bad client rate, a larger threshold range for dividing a left child node is specified in advance, and if a modeler analyzes and determines that the monotonicity relationship between a characteristic attribute and the classification target is monotonously increased, the modeler can directly set that the sample bad client rate of the left child node of a split node corresponding to the characteristic attribute should be larger than the sample bad client rate of a right child node, so as to serve as the monotonicity relationship between the characteristic attribute and the classification target; then, when judging whether the split node conforms to the corresponding monotonicity relationship of the split node, comparing the preset index values of the two child nodes of the split node in size, and determining whether the preset index values are the same as the size relationship corresponding to the characteristic attribute of the split node, if so, determining that the split node conforms to the corresponding monotonicity relationship of the split node, and if not, determining that the split node does not conform to the corresponding monotonicity relationship of the split node; it should be noted that, for the case that the preset index values of the left and right child nodes are equal, the setting may be performed according to specific situations.

Further, in another embodiment, the step S202 includes:

step S2021, determining a real size relationship between the preset index values corresponding to two child nodes of the split node for the split node in each node;

in the present embodiment, the monotonicity relationship is expressed by monotone increase and monotone decrease. For a split node, acquiring preset index values corresponding to two child nodes of the split node obtained through calculation respectively, and comparing the two preset index values to obtain a real size relation between the two preset index values.

Step S2022, determining a target size relationship which should be possessed between the preset index values corresponding to the two child nodes according to the splitting rule of the split node and the monotonicity relationship corresponding to the characteristic attribute in the splitting rule;

and determining a target size relationship which should be possessed between preset index values corresponding to the two child nodes according to the splitting rule of the split node and the monotonicity relationship corresponding to the characteristic attribute in the splitting rule. Specifically, the splitting rule comprises the division of threshold ranges of two child nodes, monotonicity relations corresponding to characteristic attributes in the splitting rule comprise monotonicity increase and monotonicity decrease, the division condition of the threshold ranges and the two monotonicity relations are arranged and combined, and target size relations corresponding to different combinations are preset; when the split node is judged, the corresponding target size relation is found according to the combination of the threshold range division condition corresponding to the split node and the monotonicity relation. The threshold range of the two child nodes is divided into two cases, wherein one case is that the left child node is divided into a larger threshold range (called left large for short), and the other case is that the right child node is divided into a larger threshold range (called right large for short). In combination with two monotonicity relationships, there are four cases, four cases and the target size relationship for each case is: 1. a combination of left-large and monotone increase, the corresponding target size relationship is that the preset index value of the left child node should be larger than the preset index value of the right child node; 2. a combination of left-large and monotone decreasing, the corresponding target size relationship being that the preset index value of the left child node should be smaller than the preset index value of the right child node; 3. a combination of right-large and monotone increase, the corresponding target size relationship being that the preset index value of the left child node should be smaller than the preset index value of the right child node; 4. and combining the right big and monotone decreasing, wherein the corresponding target size relation is that the preset index value of the left child node is larger than the preset index value of the right child node.

Step S2023, detecting whether the real size relationship is the same as the target size relationship;

step S2024, if the difference is not the same, determining that the split node does not conform to the monotonicity relation corresponding to the split node;

step S2025, if the two split nodes are the same, determining that the split node conforms to the monotonicity relation corresponding to the split node.

After the real size relation and the target size relation of the preset index values of the two child nodes are determined, whether the two are the same or not is detected, if not, the split node is determined not to be in accordance with the corresponding monotonicity relation, and if not, the split node is determined not to be in accordance with the corresponding monotonicity relation.

Step S203, if the split node is determined not to be in accordance with the monotonicity relation corresponding to the split node, pruning is carried out on the split node.

If the split node is determined not to be in accordance with the monotonicity relation corresponding to the split node, pruning can be carried out on the split node. In an embodiment, if a pre-pruning method is used to prune a split node, the split node may be used as a leaf node, no splitting is performed, and training of the tree is continued according to the gradient lifting.

Further, the step S203 includes:

step S2031, if it is determined that the split node does not conform to the monotonicity relationship corresponding to the split node, setting each leaf value under the split node as the preset index value corresponding to the split node, so as to perform post pruning on the split node.

In an embodiment, if a post-pruning manner is adopted to prune a splitting node, when it is determined that the splitting node does not conform to the monotonicity relationship corresponding to the splitting node, the leaf values of all the leaves under the splitting node may be set as the preset index value of the splitting node, so as to implement post-pruning of the splitting node. That is, in fact, pruning is not really performed on the splitting node, but the effect equivalent to pruning is realized by setting the leaf values below the splitting node as the preset index value of the node, and the processing efficiency is improved because the leaf values only need to be re-assigned without changing the structure of the tree.

Further, before the step S2021, the method further includes:

step S2026, detecting whether the monotonicity relation corresponding to the characteristic attribute in the splitting rule of the split node is a monotonicity relation;

in one embodiment, the monotonicity relationship between the feature attribute and the classification target may include three types: monotonous increase, monotonous decrease and no monotonicity, wherein monotonous increase and monotonous decrease belong to a monotonous relation. When judging whether pruning is carried out on the split node, whether the monotonicity relation corresponding to the characteristic attribute in the splitting rule of the split node is monotonous or not can be detected, namely whether the monotonicity relation is not monotonous or not is detected, if yes, the monotonicity relation is determined, and if not, the monotonicity relation is determined.

Step S2027, if yes, executing the step of determining a target size relationship that should be possessed between the preset index values corresponding to the two child nodes according to the splitting rule of the split node and the monotonicity relationship corresponding to the feature attributes in the splitting rule;

if the relationship is monotonic, step S2021 to step S2025 can be performed. That is, only when the monotonicity relationship corresponding to the split node is monotonously increased or monotonously decreased, the preset index value comparison of the child nodes of the split node is adopted to determine whether the split node conforms to the monotonicity relationship corresponding to the split node.

Step S2028, if not, outputting preset prompt information, and receiving feedback information triggered based on the preset prompt information, wherein the feedback information is used for indicating whether the split node conforms to the monotonicity relation corresponding to the split node.

If the split node is not monotonous, namely if the characteristic attribute in the split rule of the split node and the binary target do not have monotonicity, preset prompt information can be output, and specifically the preset prompt information can be output to a display screen of the modeling equipment to be displayed for being checked by modeling personnel. And the modeling equipment receives the feedback information and can determine whether the split node accords with the corresponding monotonicity relation according to the feedback information. The data form of the feedback information is not limited, as long as the feedback information can be used for indicating whether the split node conforms to the monotonicity relationship corresponding to the split node. The preset prompting information may include the splitting rule of the split node, the number of samples falling into the split node, the number of positive samples, the number of negative samples and other related data, or may also output related data of a parent node and a child node of the node, or may also output related data of all nodes in a decision tree in which the node is located.

In the embodiment, relevant data of the split nodes which are uncertain whether to accord with the monotonicity relation are output in a preset prompt information form, so that modeling personnel can directly participate in a modeling process, the nodes which do not accord with the business cognition in the gradient lifting tree are eliminated according to the business knowledge of the specific application scene which is known by the modeling personnel, the classification result of the two classification models obtained by training can accord with the business cognition better, and the reliability is higher.

Further, based on the first and/or second embodiment, a third embodiment of the method for constructing a second classification model according to the present invention is provided, in this embodiment, the step S30 includes:

step S301, converting the predicted value of the leaf node of each decision tree in the pruned gradient lifting tree into a preset index value corresponding to the leaf node to obtain a decision tree set, wherein the preset index is an index having monotonicity with the preset binary classification target;

in the current gradient lifting tree, the prediction residual of the previous tree is used as a learning target, and the prediction residual is not directly connected with a classification target, so that a modeling worker cannot analyze the influence of the characteristic attribute of each split node in the decision tree on a classification result, namely, the trained gradient lifting tree does not have interpretability.

To solve this problem, in this embodiment, the predicted value of the leaf node of each decision tree in the pruned gradient spanning tree may be converted into a preset index value corresponding to the leaf node, and a set formed by each converted decision tree may be used as the decision tree set. The preset index is an index having monotonicity with the preset classification target, and the definition of the preset index in the second embodiment may be specifically referred to.

Because the preset index value is an index which has monotonicity with the classification target, a modeling worker can analyze the influence of the characteristic attribute of each split node on the classification result, and the gradient lifting tree has interpretability.

Step S302, each decision tree in the decision tree set is used as a variable generator of the binary model to be fitted, and the training sample is adopted to fit the binary model to be fitted to obtain a target binary model.

In this embodiment, after the transformation to obtain the decision tree set, each decision tree in the decision tree set is used as a variable generator of the to-be-fitted binary model, and the to-be-fitted binary model is fitted (which may also be referred to as training or learning) by using the training sample to obtain the target binary model. That is, the binary models to be fitted may include each decision tree and one fitting model, and the fitting model takes the output of each decision tree as an input variable. The fitting model can be a model such as a logistic regression or a neural network, and the output result of the fitting model is set as the output related to the binary targets. Model parameters in the fitting model are continuously updated in the process of fitting the to-be-fitted two-class model by adopting the training samples, and the accuracy of the two-class result output by the model meets the set requirement through multiple rounds of iterative updating, so that the target two-class model consisting of each decision tree and the fitting model is obtained. It should be noted that the method for fitting the model may be according to an existing model fitting method, and details are not described in this embodiment.

Further, in an embodiment, if pruning is performed by using a post-pruning method, the predicted values of the leaves of each decision tree in the gradient lifting tree may be converted first, and then pruning is performed, and after pruning, the fitting of the to-be-fitted binary model may be performed directly without converting the predicted values of the leaves.

Further, based on the first, second and/or third embodiments, a fourth embodiment of the method for constructing a second classification model according to the present invention is provided, in this embodiment, the training samples include credit data of each customer, and after step S40, the method further includes:

and step S50, inputting credit data of the customer to be assessed into the target two-classification model for processing to obtain a classification result, and determining whether to loan the customer to be assessed according to the classification result.

In this embodiment, the specific application scenario may be a customer credit risk assessment scenario, the training sample includes credit data of a plurality of customers, the binary target may be a probability that the customer belongs to a high-risk customer, and after the target binary model is obtained according to the binary model construction method in the above embodiment, the credit data of the customer to be assessed may be input to the target binary model for processing, so as to obtain a classification result, where the classification result indicates whether the customer is a high-risk customer. After the classification result is obtained, the modeling equipment can output the classification result so that a service staff handling the loan service can determine whether the loan is needed to be made for the customer to be evaluated according to the classification result. Or after the classification result is obtained, the modeling device can directly determine whether to loan the customer to be evaluated according to the classification result, and output the determination result, specifically to a terminal corresponding to a service staff or a terminal corresponding to the customer. In the embodiment, in a client credit risk assessment scene, a monotonicity relation which accords with business cognition between the characteristic attribute of a client and a binary target is obtained, a division node which does not accord with the monotonicity relation corresponding to the division node in a gradient lifting tree is pruned, and a target binary model is obtained according to the pruned gradient lifting tree, so that the finally obtained binary model accords with the business cognition of the client credit risk assessment scene, the result of the target binary model has higher credibility, the result output by the model can be directly used as a basis for loan of the client, a manual judgment program is saved, and the approval efficiency of the loan business is improved.

In addition, an embodiment of the present invention further provides a binary model building apparatus, and with reference to fig. 3, the apparatus includes:

the obtaining module 10 is configured to obtain a monotonicity relation between each feature attribute of the training sample and a preset two classification targets;

a pruning module 20, configured to train a gradient lifting tree using the training samples, and prune a split node that does not conform to the monotonicity relationship corresponding to the split node in each decision tree of the gradient lifting tree;

and the determining module 30 is used for obtaining a target binary classification model according to the pruned gradient lifting tree.

Further, the pruning module 20 comprises:

the calculation unit is used for training a gradient lifting tree by adopting the training sample and calculating a preset index value corresponding to each node in each decision tree of the gradient lifting tree, wherein the preset index is an index having monotonicity with the preset binary classification target;

a determining unit, configured to determine, for a split node in each node, whether the split node conforms to the monotonicity relationship corresponding to the split node according to the preset index values corresponding to two child nodes of the split node;

and the pruning unit is used for pruning the splitting node if the splitting node is determined not to be in accordance with the monotonicity relation corresponding to the splitting node.

Further, the pair determination unit includes:

a first determining subunit, configured to determine, for a split node in the nodes, a true size relationship between the preset index values corresponding to two child nodes of the split node;

a second determining subunit, configured to determine, according to the splitting rule of the split node and the monotonicity relationship corresponding to the feature attribute in the splitting rule, a target size relationship that should be possessed between the preset index values corresponding to the two child nodes;

the first detection subunit is used for detecting whether the real size relationship is the same as the target size relationship;

a third determining subunit, configured to determine that the split node does not conform to the monotonicity relationship corresponding to the split node if the split node is different from the monotonicity relationship corresponding to the split node;

and the fourth determining subunit is configured to determine that the split node conforms to the monotonicity relationship corresponding to the split node if the split node is the same as the monotonicity relationship.

Further, the determining unit further includes:

the second detection subunit is configured to detect whether the monotonicity relationship corresponding to the feature attribute in the splitting rule of the split node is a monotonicity relationship;

the first determining subunit is further configured to, if yes, execute the step of determining, for a split node in the nodes, a true size relationship between the preset index values corresponding to two child nodes of the split node;

and the output subunit is used for outputting preset prompt information if the split node is not in the preset state, and receiving feedback information triggered based on the preset prompt information, wherein the feedback information is used for indicating whether the split node conforms to the monotonicity relation corresponding to the split node.

Further, the pruning unit is further configured to:

Further, the determining module 30 includes:

the conversion unit is used for converting the predicted values of the leaf nodes of each decision tree in the pruned gradient lifting tree into preset index values corresponding to the leaf nodes to obtain a decision tree set, wherein the preset index is an index which has monotonicity with the preset classification target;

and the fitting unit is used for taking each decision tree in the decision tree set as a variable generator of the to-be-fitted binary model, and fitting the to-be-fitted binary model by adopting the training sample to obtain a target binary model.

Further, the apparatus further comprises:

and the classification module is used for inputting credit data of the customer to be evaluated into the target two-classification model to be processed to obtain a classification result, and determining whether to loan the customer to be evaluated according to the classification result.

The specific implementation of the two-classification model building device of the present invention has basically the same expansion content as the above-mentioned embodiments of the two-classification model building method, and is not described herein again.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a two-class model building program is stored on the storage medium, and when being executed by a processor, the two-class model building program implements the steps of the two-class model building method as described below.

The embodiments of the binary model construction device and the computer-readable storage medium of the present invention can refer to the embodiments of the binary model construction method of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for constructing a binary model, the method comprising the steps of:

2. The method for constructing a two-class model according to claim 1, wherein the step of training a gradient lifting tree by using the training samples and pruning split nodes which do not conform to the monotonicity relationship corresponding to the split nodes in each decision tree of the gradient lifting tree comprises:

3. The method for constructing a two-class model according to claim 2, wherein the step of determining, for a split node among the nodes, whether the split node conforms to the monotonicity relationship corresponding to itself according to the preset index values corresponding to two child nodes of the split node comprises:

4. The method for constructing a classification model according to claim 3, wherein the step of determining, for a split node among the nodes, a true size relationship between the preset metric values corresponding to two child nodes of the split node is preceded by the step of:

5. The method for constructing a two-class model according to claim 2, wherein if it is determined that the split node does not conform to the monotonicity relationship corresponding to itself, the step of pruning the split node comprises:

6. The method of constructing a two-class model according to claim 1, wherein the step of obtaining the target two-class model from the pruned gradient pruned tree comprises:

7. The method of constructing a two-class model according to any one of claims 1 to 6, wherein the training samples include credit data of each customer, and further comprising, after the step of deriving the target two-class model from the pruned gradient-boosted tree:

8. A binary model building apparatus, comprising:

9. A two-class model building apparatus, characterized by comprising: memory, a processor and a two-class model building program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the two-class model building method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a two-classification model building program is stored thereon, which when executed by a processor implements the steps of the two-classification model building method according to any one of claims 1 to 7.