CN112232944A

CN112232944A - Scoring card creating method and device and electronic equipment

Info

Publication number: CN112232944A
Application number: CN202011049938.4A
Authority: CN
Inventors: 张晓强
Original assignee: Ccx Credit Technology Co ltd
Current assignee: Ccx Credit Technology Co ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-15
Anticipated expiration: 2040-09-29

Abstract

The embodiment of the invention provides a scoring card creating method, a scoring card creating device and electronic equipment, wherein the method comprises the following steps: acquiring data of a plurality of sample characteristics of a plurality of sample users; aiming at each sample characteristic, training and obtaining one or more regression trees corresponding to the sample characteristic based on each characteristic value of the sample characteristic; sorting the regression trees corresponding to the same sample characteristic according to the sequence of the characteristic values corresponding to the regression trees from small to large; determining the numerical value interval represented by the left leaf node of the first sorted regression tree, the numerical value interval represented by the right leaf node of the last sorted regression tree and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals; and taking each target value interval as a regression tree sub-box, and creating a scoring card comprising each regression tree sub-box. The method provided by the embodiment of the invention simplifies the creation process of the score card.

Description

Scoring card creating method and device and electronic equipment

Technical Field

The invention relates to the technical field of data analysis, in particular to a scoring card creating method and device and electronic equipment.

Background

Currently, big data analysis techniques are applied to various fields. For example: in the financial field, risk control can be achieved by analyzing data of a user.

Specifically, the financial institution may perform credit risk assessment on the user by performing big data analysis on attribute data, behavior data, and the like of the user. Currently, various attributes and behavior data of the user are mainly utilized through the created scoring card, such as: and attribute data such as age attribute, gender attribute or income and expenditure, behavior data such as depositing and withdrawing or paying, and the like, and credit scoring is carried out on the user. In this way, the financial institution can decide whether to give the user credit, and the amount and interest rate of the credit according to the credit score of the user, thereby reducing the risk in the financial transaction. Wherein, the credit score of the user can reflect the probability that the user may have overdue repayment or fraud, and the higher the credit score is, the lower the credit risk of the user is.

It can be seen that creating a scoring card is an important link in performing credit scoring. Referring to fig. 1, fig. 1 is a diagram illustrating a structure of a rating card in the prior art. Wherein, the score card 100 includes: income variable, age variable, gender variable and marital status variable, each variable may correspond to a plurality of feature bins, each feature bin being a data interval for that variable, such as the income variable corresponds to 3 feature bins in fig. 1: [0, 10000), [10000, 50000) and [50000, 50000 up), i.e. each feature bin is a data interval of revenue variables. And, each feature bin corresponds to an woe (weight of evidence) value and a corresponding score, wherein the woe value corresponding to each feature bin indicates: the characteristic is classified into a ratio of responding high-risk users and non-high-risk users corresponding to the characteristic, and a difference of the ratios of the high-risk users and the non-high-risk users in all the users. And the smaller the woe value is, the smaller the default risk of the user corresponding to the characteristic box is; the corresponding score corresponding to each feature bin represents: the score corresponding to the value of a certain variable of the user when the value is in the feature bin.

For a user, the corresponding scores of the respective features of the user may be analyzed according to the scoring card 100 shown in fig. 1, and then the sum of the corresponding scores of the respective features and the base score may be used as the credit score of the user. Therefore, the process of creating the score card is to perform big data analysis on the data of the user and calculate credit scores corresponding to various attributes and various behaviors. For example, if user a is male in gender, 20 years old, 5000 incomes, and in an untarnized state, user a may be determined from scoring card 100 shown in fig. 1: gender was male with a corresponding score of 1.6, marital status was not married with a corresponding score of 0.3, age was 20 years with a corresponding feature bin [20, 40) and a corresponding score of 22.7, income was 5000 with a corresponding feature bin [0, 10000) and a corresponding score of-7.3. The sum of the corresponding score and the base score for each feature may be: 1.6+0.3+22.7+ (-7.3) +33.7 ═ 51, as the credit score for user a.

Currently, the most common method for creating the scoring card is a standard scoring card based on logistic regression, which determines a variable bin by using user data, and then constructs a logistic regression model to determine the scoring card. In the variable binning, a plurality of numerical intervals of each variable are determined, as shown in the scoring card 100 of fig. 1, and a plurality of age intervals such as [0, 20], [20, 40), [40, 50) and [50, 100] are determined for the ages. However, in the existing variable binning process, an engineer needs to repeat operations for each feature for many times to determine a better variable binning, and the operation is complicated.

Disclosure of Invention

The embodiment of the invention aims to provide a scoring card creating method, a scoring card creating device and electronic equipment, so that the scoring card creating process is simplified.

In order to achieve the above object, an embodiment of the present invention provides a score card creating method, including:

obtaining data of a plurality of sample features of a plurality of sample users, the data of the plurality of sample features of each sample user comprising: behavior data and attribute data of the sample user; each sample user correspondingly has a label, and the label is used for representing whether the sample user is a high-risk user;

aiming at each sample characteristic, training and obtaining one or more regression trees corresponding to the sample characteristic based on each characteristic value of the sample characteristic; each regression tree includes two leaf nodes, respectively representing: dividing two numerical value intervals of the sample characteristics by the characteristic values corresponding to the regression tree;

sorting the regression trees corresponding to the same sample characteristic according to the sequence of the characteristic values corresponding to the regression trees from small to large; determining the numerical value interval represented by the left leaf node of the first sorted regression tree, the numerical value interval represented by the right leaf node of the last sorted regression tree and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals;

and taking each target value interval as a regression tree sub-box, and creating a scoring card comprising each regression tree sub-box.

Further, the step of training one or more regression trees corresponding to each sample feature based on the feature values of the sample feature to obtain the sample feature includes:

based on a gradient lifting algorithm, selecting a characteristic value of the sample characteristic by taking the data of the sample characteristic as a characteristic value and determining a regression tree taking the characteristic value as a demarcation point for each sample characteristic of a sample user; each leaf node of the regression tree corresponds to a prediction score and represents: the corresponding score when the data of the sample characteristic is positioned in the numerical value interval represented by the leaf node;

determining a gain function of the regression tree;

obtaining the sum of the prediction scores of the data of each sample characteristic of the sample user in each regression tree as an output score;

determining a loss function of the current gradient lifting tree model to be trained based on the label of the sample user and the output score; the current gradient boosting tree model to be trained comprises the following steps: a plurality of currently determined regression trees;

judging whether the loss function is converged;

if so, fixing the parameters of the current gradient lifting tree model to be trained to obtain a target gradient lifting tree model;

if not, selecting a characteristic value which enables the gain function of the regression tree to be maximum as a new characteristic value, and returning to the step of determining the regression tree which takes the characteristic value as a demarcation point;

and extracting parameters of each regression tree of the target gradient lifting tree model, and merging a plurality of regression trees which represent the same characteristic value of the same sample characteristic data to obtain one or more regression trees corresponding to each sample characteristic.

Further, the step of creating a score card including each regression tree bin by using each target value interval as a regression tree bin includes:

obtaining a score corresponding to each target value interval, wherein the score corresponding to each target value interval is as follows: the sum of the prediction scores corresponding to each leaf node with intersection between the numerical value interval and the target numerical value interval;

taking each target value interval as a regression tree sub-box, taking the score corresponding to the target value interval as the score of the regression tree sub-box, and creating a score card comprising each regression tree sub-box and the score corresponding to each regression tree sub-box; the scores of the scoring card comprise scores corresponding to each regression tree sub-box and preset basic scores.

Further, for each target value interval, the score corresponding to the target value interval is determined by adopting the following formula:

Score＝-B{f₁+f₂+…+f_K}

wherein, the Score represents the corresponding fraction of the target value interval, B is a preset constant parameter, f₁、f₂、…、f_KAnd respectively representing the sum of the prediction scores corresponding to the K leaf nodes with intersection between the numerical value interval and the target numerical value interval.

and taking each target numerical value interval as a characteristic sub-box, determining the score corresponding to each characteristic sub-box by adopting a logistic regression model, and creating a score card according to each characteristic sub-box and the score corresponding to each characteristic sub-box.

Further, the acquiring data of a plurality of sample characteristics of a plurality of sample users includes:

acquiring data of a plurality of characteristics of a sample user;

for each feature, detecting a type of the feature; if the characteristic is a numerical characteristic, taking the characteristic as a characteristic to be screened; if the feature is a classification type feature, assigning the feature according to a preset assignment rule, and taking the assigned data of the feature as the feature to be screened;

inputting a plurality of features to be screened into a gradient lifting model to be trained, and extracting the importance degree corresponding to each feature to be screened; there is a label corresponding to each feature, and the label is used for characterizing whether the feature is important or not.

For each feature to be screened, when the importance of the feature to be screened is less than or equal to a preset importance threshold, taking the feature to be screened as a feature to be deleted;

judging whether the number of the features to be deleted is zero or not; if so, determining each feature to be screened as a sample feature;

if not, judging whether the number of the features to be deleted is smaller than the preset number, if so, deleting all the features to be deleted, returning to the step of inputting the plurality of features to be screened into the gradient lifting model to be trained, and extracting the importance degree corresponding to each feature to be screened;

if not, deleting the preset number of the features to be deleted with low importance in the features to be deleted, taking the remaining features to be deleted as the features to be screened, returning to the step of inputting the plurality of features to be screened into the gradient lifting model to be trained, and extracting the importance corresponding to each feature to be screened.

Further, each of the categorical features is used to represent an attribute of the sample user, and the categorical features include: the gender of the sample user, the academic history of the sample user, the region to which the sample user belongs and the industry to which the sample user belongs;

if the feature is a type feature, assigning the feature according to a preset assignment rule, including:

and for each type feature, taking the ratio of the number of sample users marked as high risks in advance in the attribute represented by the type feature to the number of all sample users in the attribute represented by the type feature as the numerical value of the type feature.

In order to achieve the above object, an embodiment of the present invention further provides a score card creating device, including:

a data obtaining module, configured to obtain data of a plurality of sample features of a plurality of sample users, where the data of the plurality of sample features of each sample user includes: behavior data and attribute data of the sample user; each sample user correspondingly has a label, and the label is used for representing whether the sample user is a high-risk user;

the regression tree training module is used for training and obtaining one or more regression trees corresponding to the sample characteristics based on the characteristic values of the sample characteristics aiming at each sample characteristic; each regression tree includes two leaf nodes, respectively representing: dividing two numerical value intervals of the sample characteristics by the characteristic values corresponding to the regression tree;

the interval determining module is used for sequencing the regression trees corresponding to the same sample characteristic according to the sequence of the characteristic values corresponding to the regression trees from small to large; determining the numerical value interval represented by the left leaf node of the first sorted regression tree, the numerical value interval represented by the right leaf node of the last sorted regression tree and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals;

and the scoring card creating module is used for taking each target value interval as a regression tree sub-box and creating a scoring card comprising each regression tree sub-box.

Correspondingly, the embodiment of the invention provides a scoring method, which comprises the following steps:

acquiring a plurality of feature data of a user to be evaluated, wherein the feature data comprises: behavior data and attribute data of a user to be scored;

aiming at each feature data, acquiring a corresponding score of the feature data in a pre-created scoring card; wherein the scoring card is created by the method of any one of claims 1-7;

determining the sum of the score of each feature data and the basic score of the scoring card as the score of the user to be scored; wherein a higher score indicates a lower risk for the user to be scored.

Based on the above scoring method, an embodiment of the present invention further provides a scoring device, including:

in order to achieve the above object, an embodiment of the present invention provides an electronic device, which includes a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface are configured to complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of the scoring card creation method or the scoring method when executing the program stored in the memory.

In order to achieve the above object, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the above scoring card creating method steps or scoring method steps.

In order to achieve the above object, an embodiment of the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any of the above described scoring card creating method steps or scoring method steps.

The embodiment of the invention has the following beneficial effects:

by adopting the method provided by the embodiment of the invention, one or more regression trees corresponding to the sample characteristics are obtained by training based on the characteristic values of the sample characteristics through the data of the sample characteristics of a plurality of sample users aiming at each sample characteristic, and the regression trees corresponding to the same sample characteristics are sequenced according to the sequence of the characteristic values corresponding to the regression trees from small to large; and determining the numerical value interval represented by the left leaf node of the first sorted regression tree, the numerical value interval represented by the right leaf node of the last sorted regression tree and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals, taking each target numerical value interval as a regression tree sub-box, and creating a scoring card comprising each regression tree sub-box. Compared with the traditional scoring card creating mode, the scoring card creating method provided by the embodiment of the invention combines the box dividing process and the model training process to generate the regression tree box, so that the scoring card is automatically created, the creating process of the scoring card is simplified, the operation is simpler and more convenient, and the scoring effect of the scoring card is improved.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram illustrating a structure of a rating card in the prior art;

fig. 2 is a flowchart of a scoring card creation method according to an embodiment of the present invention;

fig. 3a is another flowchart of a score card creating method according to an embodiment of the present invention;

fig. 3b is a scoring card created by the scoring card creation method according to the embodiment of the present invention;

FIG. 4 is a diagram illustrating a regression tree according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating merging multiple regression trees according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating the determination of a target value interval from multiple regression trees of the same sample feature according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart illustrating a process for obtaining data of sample features according to an embodiment of the present invention;

fig. 8 is a flowchart of scoring based on the scoring card creation method provided by the embodiment of the present invention;

fig. 9 is a structural diagram of a score card creation apparatus according to an embodiment of the present invention;

fig. 10 is a block diagram of another scoring card creating apparatus according to an embodiment of the present invention;

fig. 11 is a structural diagram of a scoring device according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the technical problem, the embodiment of the invention provides a scoring card creating method, a scoring card creating device and electronic equipment.

Referring to fig. 2, a process of creating a rating card includes:

step 201, obtaining data of a plurality of sample characteristics of a plurality of sample users, where the data of the plurality of sample characteristics of each sample user includes: behavior data and attribute data of the sample user; and a label exists in each sample user, and the label is used for representing whether the sample user is a high-risk user.

Step 202, aiming at each sample characteristic, training and obtaining one or more regression trees corresponding to the sample characteristic based on each characteristic value of the sample characteristic; each regression tree includes two leaf nodes, respectively representing: and dividing two numerical value intervals of the sample characteristics by the characteristic values corresponding to the regression tree.

Step 203, sorting the regression trees corresponding to the same sample characteristic according to the sequence of the characteristic values corresponding to the regression trees from small to large; and determining the numerical value interval represented by the left leaf node of the first sorted regression tree, the numerical value interval represented by the right leaf node of the last sorted regression tree and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals.

And 204, taking each target value interval as a regression tree sub-box, and creating a scoring card comprising each regression tree sub-box.

The scoring card creating method and device provided by the embodiment of the invention are described in detail through specific embodiments.

In one embodiment of the present application, referring to fig. 3a, another flow of the score card creation method includes the following steps:

step 301, data of a plurality of sample characteristics of a plurality of sample users are obtained.

In the embodiment of the invention, each sample user correspondingly has a label, and the label is used for representing whether the sample user is a high-risk user. Wherein, a high-risk user refers to a user with a specific behavior record. Taking the financial field as an example, the high-risk user may specifically refer to: users who have overdue repayment or fraud records, non-high risk users refer to: there are no users who have overdue repayment and fraud records. Specifically, if the sample user is a high-risk user, the label corresponding to the sample user is 1, and if the sample user is a non-high-risk user, the label corresponding to the sample user is 0.

In the embodiment of the present invention, the data of the plurality of sample characteristics of each sample user includes: the sample user's behavioral data and attribute data. Taking the financial field as an example, the attribute data of the sample user may specifically include: gender, age, academic history, industry of the sample user, territory of the sample user, marital status, and the like; the behavior data of the sample user may specifically include: the number of times the sample user generates the purchasing behavior, the amount paid by the sample user for the purchasing behavior, the number of days between the purchasing behaviors of the sample user, the income of the sample user, the proportion of the expenditure of the sample user to the income, the deposit data of the sample user, the withdrawal data of the sample user, the historical default amount of the sample user and the like.

Step 302, for each sample feature of a sample user, taking the data of the sample feature as a feature value, and for each feature value of the sample feature, determining a regression tree taking the feature value as a demarcation point based on a gradient lifting algorithm.

Wherein, every regression tree corresponds two leaf nodes, and every leaf node of every regression tree corresponds a prediction score respectively, shows: and the corresponding score of the data of the sample characteristic is positioned in the numerical range represented by the leaf node.

For example, if the plurality of sample characteristics of the sample user includes: gender, age, income, and school calendar. If the age of the sample user a is 30 years, referring to fig. 4, for the age of 30 years, a regression tree treeX with the age of 30 years as a feature value is determined, where the regression tree treeX includes two leaf nodes: leaf_X1And leaf_X2。leaf_X1Indicating the age in the interval [0, 30), leaf_X2Indicates the age in the interval [30，100]. Wherein, leaf_X1Corresponding prediction score f_XLAnd predicting the score f_XLRepresents the corresponding score when the age of the sample user is in the interval [0, 30); leaf_X2Corresponding prediction score f_XRAnd predicting the score f_XRIndicating that the age of the sample user lies in the interval [30, 100]]The corresponding score.

Step 303, determining the gain function of each regression tree with each feature value as a demarcation point.

In this step, the gain function value of each regression tree may be determined specifically by using the following formula:

wherein, Gain represents the contribution of the regression tree taking one characteristic value of the sample characteristic as a demarcation point to the current gradient lifting tree model to be trained, G_LRepresenting the sum of the first order gradients of all sample features assigned to the left leaf node, H_LSum of second order gradients, G, representing all sample features assigned to left leaf node_RThe sum of the first order gradients, H, of all sample features classified as leaf nodes on the right_RRepresents the sum of the second order gradients of all sample features that are sorted to the right leaf node, λ represents the L2 regularization coefficient, γ is the minimum splitting loss, and both λ and γ are preset parameters.

Step 304, selecting the regression tree with the maximum gain function value from all the regression trees as the currently determined regression tree.

In this step, the regression tree with the largest gain function may be selected as the currently determined regression tree according to the magnitude of the gain function of each regression tree obtained by using all the characteristic values as the demarcation points.

And 305, obtaining the sum of the prediction scores of the data of each sample characteristic of the sample user in the currently determined regression tree as an output score.

In this step, the output score can be obtained by the following formula:

wherein the content of the first and second substances,

for a function space containing all regression trees, a regression tree is a function that maps attributes to scores; f. of_kThe predicted score corresponding to the kth regression tree is the score of the leaf node of the sample characteristic in the regression tree; k is the number of regression trees; f_K(x_i) For the ith sample user x_iThe output score of (1).

Step 306, determining a loss function of the current gradient lifting tree model to be trained.

Wherein, the current gradient lifting tree model to be trained comprises: one or more regression trees currently determined.

In this step, the loss function of the gradient lifting tree model may be determined specifically by using the following formula:

wherein the content of the first and second substances,

value of a loss function, y, representing the current gradient lifting tree model to be trained_iA label that represents a user of the sample,

and representing the prediction fraction of the sample characteristics corresponding to the current gradient lifting tree model to be trained, wherein n represents the number of sample users.

In step 307, it is determined whether the loss function is converged, if yes, step 308 is performed, and if no, step 309 is performed.

In this step, it may be determined whether the number of regression trees in the current gradient lifting tree model to be trained reaches a preset value, and if the number of regression trees in the current gradient lifting tree model to be trained reaches the preset value, it may be determined that the loss function is converged, otherwise it may be determined that the loss function is not converged. The preset value may be 100 or 200, and the like, and is not particularly limited.

In this step, it may also be determined whether the value of the loss function of the current gradient spanning tree model to be trained is no longer continuously reduced, and if the value of the loss function of the current gradient spanning tree model to be trained is no longer continuously reduced, it may be determined that the loss function is converged, otherwise it may be determined that the loss function is not converged.

Step 308, fixing the parameters of the current gradient lifting tree model to be trained to obtain a target gradient lifting tree model, and executing step 310.

Step 309, re-determining the regression tree with the characteristic value as the demarcation point based on the gradient lifting algorithm for each characteristic value of each sample characteristic of the sample user, and returning to execute step 303.

Step 310, extracting parameters of each regression tree of the target gradient lifting tree model, and merging multiple regression trees representing the same characteristic value of the same sample characteristic data to obtain one or multiple regression trees corresponding to each sample characteristic.

In the embodiment of the present invention, the same feature value of the same sample feature data may correspond to multiple regression trees, and in this step, the multiple regression trees may be merged. Specifically, the method comprises the following steps: one regression tree may be retained, then the prediction scores of the leaf nodes having the same numerical value interval represented by the same feature value of the same sample feature data in each regression tree are summed, and the sum of the prediction scores is used as the prediction score corresponding to the leaf node in the retained regression tree.

For example, referring to fig. 5, for a sample feature of age, if one of the feature values is 60 years old, 60 years old corresponds to: regression trees treeA, treeB, treeC and treeD. Wherein, the leaf node leaf of the treeA tree_ALAnd leaf_ARRespectively corresponding to age interval [0, 60) and age interval [60, 100%]Leaf node leaf_ALAnd leaf_ARCorresponding prediction scoreNumber is respectively f_ALAnd f_AR(ii) a Leaf node leaf of regression tree treeB_BLAnd leaf_BRRespectively corresponding to age interval [0, 60) and age interval [60, 100%]Leaf node leaf_BLAnd leaf_BRThe corresponding prediction scores are respectively f_BLAnd f_BR(ii) a Leaf node leaf of treeC regression tree_CLAnd leaf_CRRespectively corresponding to age interval [0, 60) and age interval [60, 100%]Leaf node leaf_CLAnd leaf_CRThe corresponding prediction scores are respectively f_CLAnd f_CR(ii) a Leaf node leaf of treeD regression tree_DLAnd leaf_DRRespectively corresponding to age interval [0, 60) and age interval [60, 100%]Leaf node leaf_DLAnd leaf_DRThe corresponding prediction scores are respectively f_DLAnd f_DR。

In this step, multiple regression trees representing the same feature value of the same sample feature data may be merged to obtain one or more regression trees corresponding to each sample feature:

any one of a reserved regression tree treeA, a regression tree treeB, a regression tree treeC and a regression tree treeD can be selected, and the reserved regression tree treeA is selected in the embodiment;

summing the prediction scores of each leaf node representing the same age interval [0, 60) to obtain a sum of prediction scores: f. of_{And L}＝f_AL+f_BL+f_CL+f_DLAnd will predict the sum f of the scores_{And L}As a prediction score for leaf nodes in the retained regression tree, representing the age interval [0, 60);

for the same age interval [60, 100]]Summing the prediction scores of the leaf nodes to obtain a sum of the prediction scores: f. of_{And R}＝f_AR+f_BR+f_CR+f_DRAnd will predict the sum f of the scores_{And R}As the regression tree to be retained, the age interval [60, 100]]The predicted score of the leaf node of (1);

the retained regression tree treeA is obtained as a regression tree corresponding to the age of 60, which is the characteristic value of the sample feature.

Step 311, sorting the regression trees corresponding to the same sample feature according to the order of the feature values corresponding to the regression trees; and respectively determining the numerical value interval represented by the leaf node on the left side of the first regression tree, the numerical value interval represented by the leaf node on the right side of the last regression tree after sequencing, and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals.

For each target value interval, the prediction score corresponding to the target value interval is: and the sum of the prediction scores corresponding to the numerical intervals of the leaf nodes with intersection with the target numerical interval.

For each regression tree, the leaf nodes on the left side of the regression tree represent numerical intervals of: a numerical interval in which the numerical value is smaller than the characteristic value corresponding to the regression tree; the leaf nodes on the right side of the regression tree represent numerical intervals of: the value is greater than or equal to the value interval of the characteristic value corresponding to the regression tree, or the value is greater than or equal to the value interval of the characteristic value corresponding to the regression tree and less than a certain specific value. For example, referring to fig. 4, a regression tree treeA includes: leaf node left_ALAnd leaf node leaf on the right_AR. If the feature value corresponding to treeA of the regression tree is age 60, leaf node leaf on the left side of treeA of the regression tree is_ALCorresponding to the age interval [0, 60), the leaf node leaf on the right side of the regression tree treeA_ARCorresponding age interval [60, 100]]。

For example, referring to fig. 6, for a sample feature of revenue, if the corresponding feature value includes: 1000. 10000, 30000 and 50000. The feature value 1000 of the income corresponds to the regression tree1, the feature value 10000 of the income corresponds to the regression tree2, the feature value 30000 of the income corresponds to the regression tree3, and the feature value 50000 of the income corresponds to the regression tree 4. Wherein, the leaf node leaf of the regression tree1_1LAnd leaf_1RRespectively corresponding to income interval [0, 1000) and income interval [1000, 1000 or more), leaf node leaf_1LAnd leaf_1RThe corresponding prediction scores are respectively f_1LAnd f_1R(ii) a Leaf node le of regression tree2af_2LAnd leaf_2RRespectively corresponding to income interval [0, 10000) and income interval [10000, 10000 or more), leaf node leaf_2LAnd leaf_2RThe corresponding prediction scores are respectively f_2LAnd f_2R(ii) a Leaf node leaf of regression tree3_3LAnd leaf_3RRespectively corresponding to income interval [0, 30000) and income interval [30000, 30000 or more), leaf node leaf_3LAnd leaf_3RThe corresponding prediction scores are respectively f_3LAnd f_3R(ii) a Leaf node leaf of regression tree4_4LAnd leaf_4RRespectively corresponding to income interval [0, 50000) and income interval [50000, 50000 or more), leaf node leaf_4LAnd leaf_4RThe corresponding prediction scores are respectively f_4LAnd f_4R。

In this step, the regression trees are sorted in the order of the characteristic values from small to large, and a regression tree1, a regression tree2, a regression tree3, and a regression tree4 are obtained in this order. And determining the intersection of the numerical value intervals represented by two adjacent leaf nodes of each regression tree as a target numerical value interval: two leaf nodes leaf adjacent to the regression tree1 and 2_1RAnd leaf_2LThe intersection [1000, 10000) of the expressed numerical value interval [1000, 1000 or more) and [0, 10000) is determined as a target numerical value interval; two leaf nodes leaf adjacent to the regression tree2 and 3_2RAnd leaf_3LThe intersection [10000, 30000) of the indicated numerical value interval [10000, 10000 or more) and [0, 30000) is determined as a target numerical value interval; two leaf nodes leaf adjacent to the regression tree3 and 4_3RAnd leaf_4LThe intersection [30000, 50000) of the indicated numerical value intervals [30000, 30000 or more) and [0, 50000) is determined as the target numerical value interval. Leaf node leaf to the left of the first regression tree1 after sorting_1LThe indicated value interval [0, 1000) and the leaf node leaf to the right of the last ladder regression tree4_4RThe indicated numerical ranges [50000, 50000 and above) were determined as target numerical ranges. Obtaining a target numerical value interval: [0, 1000), [1000, 10000), [10000, 30000), [30000, 50000) and [50000, 50000 or more).

And, the prediction score corresponding to the target value interval [0, 1000) is: the sum of the prediction scores corresponding to the numerical intervals [0, 1000), [0, 10000), [0, 30000) and [0, 50000) of the leaf nodes intersecting with the target numerical interval [0, 1000): f. of_1L+f_2L+f_3L+f_4L. The prediction score corresponding to the target value interval [1000, 10000) is: the sum of the prediction scores corresponding to the numerical intervals [1000, 1000 or more), [0, 10000), [0, 30000) and [0, 50000) of the leaf nodes intersecting with the target numerical interval [1000, 10000): f. of_1R+f_2L+f_3L+f_4L. The prediction score corresponding to the target value interval [10000, 30000) is: the sum of the prediction scores corresponding to the numerical intervals [1000, 1000 or more), [10000, 10000 or more), [0, 30000) and [0, 50000) of the leaf nodes intersecting with the target numerical interval [10000, 30000): f. of_1R+f_2R+f_3L+f_4L. The prediction score corresponding to the target value interval [30000, 50000) is: the sum of the prediction scores corresponding to the numerical intervals [1000, 1000 or more), [10000, 10000 or more), [30000, 30000 or more) and [0, 50000) of the leaf nodes intersecting with the target numerical interval [30000, 50000): f. of_1R+f_2R+f_3R+f_4L. The prediction scores corresponding to the target numerical value interval [50000, 50000 or more) are as follows: the sum of the prediction scores corresponding to the numerical intervals [1000, 1000 or more), [10000, 10000 or more), [30000, 30000 or more) and [50000, 50000 or more) of the leaf nodes that intersect with the target numerical interval [50000, 50000 or more): f. of_1R+f_2R+f_3R+f_4R。

Step 312, using each target value interval as a lifting tree sub-box, and creating a score card including each lifting tree sub-box.

In this step, a scoring card including each regression tree bin may be created in the following manner of steps a1-a 2:

step A1: and obtaining the corresponding score of each target value interval.

Wherein, the corresponding score of each target value interval is as follows: and the sum of the prediction scores corresponding to each leaf node with intersection between the numerical value interval and the target numerical value interval.

In this step, for each target value interval, the score corresponding to the target value interval may be determined by using the following formula:

Score＝-B{f₁+f₂+…+f_K}

Step A2: and taking each target value interval as a regression tree sub-box, taking the score corresponding to the target value interval as the score of the regression tree sub-box, and creating a score card comprising each regression tree sub-box and the score corresponding to each regression tree sub-box.

The scores of the scoring card comprise scores corresponding to each regression tree sub-box and preset basic scores.

For example, fig. 3b is a scoring card created to include each regression tree bin and the score corresponding to each regression tree bin. Referring to fig. 3b, the characteristic column in the rating card 300 includes income, age, gender and marital status, wherein the basis of the rating card 300 is a predetermined constant 35, and the regression tree bin corresponding to the income characteristic includes 5: [0, 1000), [1000, 10000), [10000, 30000), [30000, 50000) and [50000, 50000 or more); the regression tree sub-box corresponding to the characteristic of age comprises 4: [0, 20], [20, 30 ], [30, 50) and [50, 100 ]; the regression tree sub-box corresponding to the characteristic of gender comprises 2 parts: male and female; the regression tree sub-box corresponding to the marital status feature comprises 2 parts: married and unmarried.

Referring to fig. 3b, each regression tree bin of each feature in the score card 300 corresponds to a score, taking the [0, 1000) regression tree bin of income as an example, the score corresponding to the bin is-16, and the score corresponding to the regression tree bin can be calculated by using the following formula:

wherein B is a preset constant parameter, x₁Features that indicate revenue, Tree (x)₁) Representing the regression tree corresponding to the bin, which may correspond to one or more regression trees. i ∈ Tree (x)₁) The i-th regression tree that represents the intersection of the leaf node representation's range of values with the regression tree bin [0, 1000), which may correspond to one or more regression trees.

f_iL,f_iRThe predicted scores are respectively the left leaf node and the right leaf node of the ith tree in the regression tree. Delta_iL,δ_iRIs a logical variable 0-1, δ_iL,δ_iROne and only one variable takes a value of 1. E.g. delta_iLWhile at the same time delta is 1_iR0, indicating that the numerical value interval represented by the left leaf node of the ith regression tree has intersection with the regression tree branch box [0, 1000); delta_iLWhile delta is 0_iR1, the numerical range represented by the leaf node on the right side of the ith regression tree is intersected with the regression tree bin [0, 1000).

For example, taking the embodiment shown in fig. 6 as an example, the regression tree bin [0, 1000) may correspond to 4 regression trees: regression tree1, regression tree2, regression tree3, and regression tree 4. The score corresponding to the regression tree bin [0, 1000) is:

in the embodiment of the invention, the basic score A and the constant parameter B can be obtained by adopting the following method:

a ratio can be defined to represent the relative probability of user default, odds:

wherein the content of the first and second substances,

p represents the probability that user x is a high risk user, F_K(x) The sum of scores after inputting the characteristics of the user x into the target regression tree; and the number of the first and second electrodes,

substitution of odds into

The following can be obtained:

F_K(x)＝log(odds)

the Score of the Score card may be defined as a linear expression of the log of the ratio, i.e.:

Score＝A-B·log(odds)＝A-B·F_K(x)

where A and B are constants, the negative sign in front of B may be such that the lower the probability of breach, the higher the score. Typically, a high score represents a low risk and a low score represents a high risk.

In general, two assumptions can be set: assume that 1: a known expected score under a specific relative probability, that is, the relative probability odds can be obtained according to the existing standard scoring card₀When the score is P₀(ii) a Assume 2: the score which is reduced when the relative probability is doubled can be obtained as PDO according to the existing standard score card;

from the above assumptions it follows: relative probability odds is θ₀When the score is P₀Then the relative probability odds is 2 θ₀When the score is P₀PDO, i.e. the score that decreases after doubling the relative probability odds, is PDO. Will know P₀And theta₀，(P₀-PDO) and 2 θ₀Each of the formulas Score ═ a-B · log (odds) was substituted to yield:

solving the equation set to obtain a basic component A and a constant parameter B:

A＝P₀+B×log(θ₀)

by adopting the method provided by the embodiment of the invention, the target value intervals are directly used as the regression tree sub-boxes, and the scores corresponding to the regression tree sub-boxes are determined according to the prediction scores corresponding to the target value intervals. The classification process is combined with the model training process to generate the regression tree classification, and meanwhile, the scores corresponding to the regression tree classification can be obtained, so that the automatic creation of the score card is realized. The method provided by the embodiment of the invention simplifies the creation process of the scoring card, enables the operation to be simpler and more convenient, and can improve the scoring effect of the scoring card.

Referring to fig. 7, the method for acquiring data of a plurality of sample features of a plurality of sample users in step 301 may specifically include:

step 701, data of a plurality of characteristics of a sample user is obtained.

In this step, various attribute data and behavior data of the sample user can be obtained. Specifically, the gender, age, academic calendar, industry to which the user belongs, region to which the user belongs, the number of times of purchasing behavior of the user, amount paid by the user for purchasing behavior, days between purchasing behavior of the user, income of the user, proportion of expenditure to income of the user, deposit data of the user, withdrawal data of the user, and the like of the user can be obtained.

Step 702, detecting the type of each feature; if the characteristic is a numerical characteristic, taking the characteristic as a characteristic to be screened; and if the feature is the type feature, assigning the feature according to a preset assignment rule, and taking the assigned data of the feature as the feature to be screened.

In this embodiment of the present invention, each of the category features is used to represent an attribute of a sample user, and the category features include: gender of the sample user, academic history of the sample user, region to which the sample user belongs, industry to which the sample user belongs, and the like.

In this step, if the feature is a type feature, assigning the feature according to a preset assignment rule may include:

For example, if the number of all sample users is 1000, and the gender is male, and if the number of male sample users is 600 and the number of male sample users previously labeled as high risk is 50, then the gender may be assigned to the gender as male: 50 ÷ 600 ═ 0.0833; for the gender-female category feature, if the number of female sample users is 400 and the number of female sample users previously labeled as high risk is 15, then a value can be assigned to the gender-female category feature: 15 ÷ 400 ═ 0.0375; for the category type feature of the university scholars, if the number of sample users possessing the university scholars is 680 and the number of sample users possessing the university scholars marked as high risk in advance is 20, then the category type feature of the university scholars can be assigned with a value: 20 ÷ 680 ═ 0.0294; for the type feature of the area a, if the number of the sample users of the area a is 100 and the number of the sample users of the area a marked as high risk in advance is 4, then a value may be assigned to the type feature of the area a: 4 ÷ 100 ═ 0.04; for the class-type feature that the industry belongs to is a teacher, if the number of sample users that the industry belongs to is a teacher is 80, and the number of sample users that the industry previously marked as a high risk belongs to is 1, then a value can be assigned to the class-type feature that the industry belongs to is a teacher: 1 ÷ 80 ═ 0.0125.

And 703, inputting a plurality of features to be screened into the gradient lifting model to be trained, and extracting the importance corresponding to each feature to be screened.

Each feature to be screened is provided with a label correspondingly, and the label is used for representing whether the feature to be screened is important or not. Specifically, if the feature to be screened is an important feature to be screened, the corresponding tag is 1, and if the feature to be screened is an unimportant feature to be screened, the corresponding tag is 0.

In this step, after a plurality of features to be screened are input into the gradient lifting model to be trained, the importance corresponding to each feature to be screened can be extracted for each feature to be screened.

The gradient lifting model to be trained is a model obtained based on a gradient lifting algorithm. Moreover, an importance threshold and a preset number can be preset for the gradient lifting model to be trained, wherein the preset number is the number of the features which are deleted at most in a single iteration.

Step 704, regarding each feature to be screened, when the importance of the feature to be screened is less than or equal to a preset importance threshold, taking the feature to be screened as a feature to be deleted.

Step 705, judging whether the number of the features to be deleted is zero; if yes, go to step 706, otherwise go to step 707.

Step 706, determining each feature to be screened as a sample feature.

Step 707, determining whether the number of the features to be deleted is smaller than a preset number, if so, executing step 708, and if not, executing step 709.

In step 708, all the features to be deleted are deleted, and the process returns to step 703.

And 709, deleting the preset number of features to be deleted with low importance in the features to be deleted, taking the remaining features to be deleted as the features to be screened, and returning to the step 703.

In this step, the features to be deleted may be sorted in the order of the importance degree from large to small or the importance degree from small to large, then the preset number of features to be deleted with low importance degree are deleted, and the remaining features to be deleted are used as the features to be screened.

For example, assume that the data of the plurality of characteristics of the sample user includes: the 7 characteristics of age, gender, academic history, marital status, monthly income, monthly average expenditure and historical default amount of the sample user can be respectively identified according to the following classification type characteristics: and the age, the gender, the academic history and the marital status are assigned according to preset rules. Then, the data including the seven features to be screened can be all input into the gradient lifting model to be trained, and the importance of the 7 features to be screened can be obtained, assuming that the importance is respectively: age 0.18, gender 0.005, school calendar 0.15, marital status 0.09, monthly income 0.25, monthly average expenditure 0.12, and historical default amount 0.3. The importance threshold thred may be set to 0.1, and the maximum number of features to be deleted nums may be set to 1 in a single iteration.

It can be found that the importance of both gender and marital status is less than 0.1. Gender and marital status are determined as the characteristics to be deleted. Because the maximum number nums of the features to be deleted in a single iteration is 1, and the number of the features to be deleted is greater than 1, the gender which is the feature to be deleted with the lowest importance degree can be deleted, then the marital status is taken as the feature to be screened, each feature to be screened is input into the gradient lifting model to be trained again, the importance degree of each feature to be screened is determined, iteration is continuously performed until the importance degrees of all the features to be screened are greater than 0.1, and the model effect is not improved any more, and then each feature to be deleted is determined as the sample feature.

In the embodiment of the present invention, after the scoring card is created through steps 301 to 314, the created scoring card may be used to score the user to be scored. Specifically, referring to fig. 8, fig. 8 is a process of scoring for the scoring card creation method provided by the embodiment of the present invention, including the following steps:

step 801, acquiring a plurality of feature data of a user to be evaluated, wherein the feature data comprises: behavior data and attribute data of the user to be scored.

The user to be assessed is a user for which risk assessment is required, such as a user of a financial institution. The plurality of feature data of the user to be evaluated may specifically include: the attribute data of the user to be evaluated may specifically include: the gender, age, academic history, industry, territory and marital status of the user to be assessed, and the like; the behavior data of the user to be scored may specifically include: the method comprises the following steps of counting the times of purchasing behaviors of users to be scored, the amount paid by the users to be scored aiming at the purchasing behaviors, the interval days of the purchasing behaviors of the users to be scored, the income of the users to be scored, the proportion of the expenditure of the users to be scored to the income, the deposit data of the users to be scored, the withdrawal data of the users to be scored, the historical default amount of the users to be scored and the like.

Step 802, for each feature data, acquiring a score of the feature data corresponding to a pre-created scoring card.

The scoring card is created based on the scoring card creation method provided by the embodiment of the invention.

And 803, determining the sum of the scores of the feature data and the basic score of the scoring card as the score of the user to be scored.

Wherein a higher score indicates a lower risk for the user to be scored.

For example, if the user to be scored is a user α, the feature data of the user α includes: gender was male, marriage status was not married, age was 25 years, and income was 15000 yuan. And, as shown in fig. 3b, for each feature data, the score of the feature data in the pre-created scoring card 300 may be obtained: the income 15000 Yuan corresponds to the regression tree sub-box [1000, 10000), and the corresponding score is-7.8 points; age 25 corresponds to regression tree binning [20, 30) and a corresponding score of 23.3; the score corresponding to the regression tree binning of gender males is 1.5; the score corresponding to the classification of the regression tree with the marital status of not married is 0.2; the base in the score card is 35. Then, the sum of the score of each feature data and the basic score of the scoring card can be determined as the score of the user to be scored:

Score_α＝-7.8+23.3+1.5+0.2+35＝52.2

in the embodiment of the present invention, after the target value intervals are obtained in steps 301 to 313, each target value interval may be respectively used as a feature bin, a logistic regression model is used to determine a score corresponding to each feature bin, and a score card is created according to each feature bin and the score corresponding to each feature bin. Here, given feature binning, determining a score corresponding to each feature binning by using a logistic regression model according to each feature binning, and creating a corresponding score card have been described in detail in the prior art, and are not described herein again. In the embodiment of the invention, the known characteristic sub-box can be directly applied to the traditional scoring card creating process, so that the effect of the created scoring card is obviously improved, and the technical application range of the invention is wider.

Based on the same inventive concept, according to the scoring card creation method provided in the above embodiment of the present invention, correspondingly, another embodiment of the present invention further provides a scoring card creation device, a schematic structural diagram of which is shown in fig. 9, and the method specifically includes:

a data obtaining module 901, configured to obtain data of a plurality of sample features of a plurality of sample users, where the data of the plurality of sample features of each sample user includes: behavior data and attribute data of the sample user; each sample user correspondingly has a label, and the label is used for representing whether the sample user is a high-risk user;

a regression tree training module 902, configured to train, for each sample feature, to obtain one or more regression trees corresponding to the sample feature based on each feature value of the sample feature; each regression tree includes two leaf nodes, respectively representing: dividing two numerical value intervals of the sample characteristics by the characteristic values corresponding to the regression tree;

an interval determining module 903, configured to sort the regression trees corresponding to the same sample feature according to a descending order of the feature values corresponding to the regression trees; determining the numerical value interval represented by the left leaf node of the first sorted regression tree, the numerical value interval represented by the right leaf node of the last sorted regression tree and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals;

and a scoring card creating module 904, configured to create a scoring card including each regression tree bin by taking each target value interval as a regression tree bin.

It can be seen that, with the device provided in the embodiment of the present invention, for each sample feature, based on each feature value of the sample feature, through data of a plurality of sample features of a plurality of sample users, one or more regression trees corresponding to the sample feature are obtained by training, and the regression trees corresponding to the same sample feature are sorted in the order from small to large of the feature values corresponding to each regression tree; and determining the numerical value interval represented by the left leaf node of the first sorted regression tree, the numerical value interval represented by the right leaf node of the last sorted regression tree and the intersection of the numerical value intervals represented by two adjacent leaf nodes of different regression trees as target numerical value intervals, taking each target numerical value interval as a regression tree sub-box, and creating a scoring card comprising each regression tree sub-box. Compared with the traditional scoring card creating mode, the scoring card creating device provided by the embodiment of the invention combines the box dividing process and the model training process to generate the regression tree box, so that the scoring card is automatically created, the creating process of the scoring card is simplified, the operation is simpler and more convenient, and the scoring effect of the scoring card is improved.

Further, referring to fig. 10, the regression tree training module 902 includes:

a regression tree determination submodule 1001 configured to determine, for each sample feature of a sample user, a regression tree using a data of the sample feature as a feature value and using the feature value as a demarcation point based on a gradient lifting algorithm for each feature value of the sample feature; each leaf node of the regression tree corresponds to a prediction score and represents: the corresponding score when the data of the sample characteristic is positioned in the numerical value interval represented by the leaf node;

a gain function determining submodule 1002, configured to determine a gain function of each regression tree with each feature value as a boundary point;

a regression tree selection submodule 1003 for selecting the regression tree with the largest gain function from the regression trees as the currently determined regression tree

An output score obtaining sub-module 1004, configured to obtain a sum of prediction scores of data of each sample feature of the sample user in the currently determined regression tree as an output score;

a loss function value determining submodule 1005, configured to determine a loss function of the current gradient lifting tree model to be trained, based on the label of the sample user and the output score; the current gradient boosting tree model to be trained comprises the following steps: one or more currently determined regression trees;

a determining submodule 1006, configured to determine whether the loss function converges; if so, fixing the parameters of the current gradient lifting tree model to be trained to obtain a target gradient lifting tree model; if not, re-determining the regression tree with the characteristic value as the demarcation point based on a gradient lifting algorithm aiming at each characteristic value of each sample characteristic of the sample user, and returning to the step of respectively determining the gain function of each regression tree with each characteristic value as the demarcation point;

the merging submodule 1007 is configured to extract parameters of each regression tree of the target gradient lifting tree model, and merge multiple regression trees representing the same feature value of the same sample feature data to obtain one or multiple regression trees corresponding to each sample feature.

Further, the score card creating module 904 is specifically configured to obtain a score corresponding to each target value interval, where the score corresponding to each target value interval is: the sum of the prediction scores corresponding to each leaf node with intersection between the numerical value interval and the target numerical value interval; taking each target value interval as a regression tree sub-box, taking the score corresponding to the target value interval as the score of the regression tree sub-box, and creating a score card comprising each regression tree sub-box and the score corresponding to each regression tree sub-box; the scores of the scoring card comprise scores corresponding to each regression tree sub-box and preset basic scores.

Score＝-B{f₁+f₂+…+f_K}

wherein, Score represents the corresponding Score of the target value interval, B isPredetermined constant parameter, f₁、f₂、…、f_KAnd respectively representing the sum of the prediction scores corresponding to the K leaf nodes with intersection between the numerical value interval and the target numerical value interval.

Further, the score card creating module 904 is specifically configured to use each target value interval as a feature bin, determine a score corresponding to each feature bin by using a logistic regression model, and create a score card according to each feature bin and the score corresponding to each feature bin.

Further, the data obtaining module 901 includes:

the characteristic data acquisition submodule is used for acquiring data of a plurality of characteristics of a sample user;

the data type detection submodule is used for detecting the type of each feature; if the characteristic is a numerical characteristic, taking the characteristic as a characteristic to be screened; if the feature is a classification type feature, assigning the feature according to a preset assignment rule, and taking the assigned data of the feature as the feature to be screened;

the importance extraction submodule is used for inputting a plurality of features to be screened into the gradient lifting model to be trained and extracting the importance corresponding to each feature to be screened; there is a label corresponding to each feature, and the label is used for characterizing whether the feature is important or not.

The to-be-deleted feature determining submodule is used for regarding each feature to be screened, and when the importance of the feature to be screened is smaller than or equal to a preset importance threshold, taking the feature to be screened as the feature to be deleted;

the third judgment submodule is used for judging whether the number of the features to be deleted is zero or not; if so, determining each feature to be screened as a sample feature; if not, judging whether the number of the features to be deleted is smaller than the preset number, if so, deleting all the features to be deleted, returning to the step of inputting the plurality of features to be screened into the gradient lifting model to be trained, and extracting the importance degree corresponding to each feature to be screened; if not, deleting the preset number of the features to be deleted with low importance in the features to be deleted, taking the remaining features to be deleted as the features to be screened, returning to the step of inputting the plurality of features to be screened into the gradient lifting model to be trained, and extracting the importance corresponding to each feature to be screened.

Therefore, by adopting the device provided by the embodiment of the invention, the target value intervals are directly used as the regression tree sub-boxes, and the scores corresponding to the regression tree sub-boxes are determined according to the prediction scores corresponding to the target value intervals. The classification process is combined with the model training process to generate the regression tree classification, and meanwhile, the scores corresponding to the regression tree classification can be obtained, so that the automatic creation of the score card is realized. The device provided by the embodiment of the invention simplifies the creation process of the scoring card, enables the operation to be simpler and more convenient, and can improve the scoring effect of the scoring card.

According to the scoring method provided in the above embodiment of the present invention, correspondingly, another embodiment of the present invention further provides a scoring device, a schematic structural diagram of which is shown in fig. 11, and the scoring device specifically includes:

the characteristic data obtaining module 1101 is configured to obtain a plurality of characteristic data of a user to be scored, where the characteristic data includes: behavior data and attribute data of a user to be scored;

a first score determining module 1102, configured to obtain, for each piece of feature data, a score corresponding to the feature data in a pre-created score card; wherein the scoring card is created by the method of any one of claims 1-7;

a second score determining module 1103, configured to determine a sum of the score of each feature data and the basic score of the rating card as the score of the user to be rated; wherein a higher score indicates a lower risk for the user to be scored.

An embodiment of the present invention further provides an electronic device, as shown in fig. 12, including a processor 1201, a communication interface 1202, a memory 1203, and a communication bus 1204, where the processor 1201, the communication interface 1202, and the memory 1203 complete mutual communication through the communication bus 1204,

a memory 1203 for storing a computer program;

the processor 1201 is configured to implement the following steps when executing the program stored in the memory 1203:

Or, the following steps are implemented:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements any one of the score card creation methods or the score card creation steps described above.

In yet another embodiment, the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any one of the scoring card creation methods or the scoring methods in the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the device, the electronic apparatus and the storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A score card creation method, comprising:

2. The method according to claim 1, wherein the step of training one or more regression trees corresponding to each sample feature based on the feature values of the sample feature comprises:

for each sample characteristic of a sample user, taking the data of the sample characteristic as a characteristic value, and for each characteristic value of the sample characteristic, determining a regression tree taking the characteristic value as a demarcation point based on a gradient lifting algorithm; each leaf node of the regression tree corresponds to a prediction score and represents: the corresponding score when the data of the sample characteristic is positioned in the numerical value interval represented by the leaf node;

respectively determining the gain function of each regression tree taking each characteristic value as a demarcation point;

selecting the regression tree with the maximum gain function from all the regression trees as the currently determined regression tree;

obtaining the sum of the prediction scores of the data of each sample characteristic of the sample user in the currently determined regression tree as an output score;

determining a loss function of the current gradient lifting tree model to be trained based on the label of the sample user and the output score; the current gradient boosting tree model to be trained comprises the following steps: one or more currently determined regression trees;

judging whether the loss function is converged;

if not, re-determining the regression tree with the characteristic value as the demarcation point based on a gradient lifting algorithm aiming at each characteristic value of each sample characteristic of the sample user, and returning to the step of respectively determining the gain function of each regression tree with each characteristic value as the demarcation point;

3. The method of claim 2, wherein the step of binning each target value range as a regression tree and creating a scorecard comprising the regression tree bins comprises:

4. The method of claim 3, wherein for each target value interval, the score corresponding to the target value interval is determined using the following formula:

Score＝-B{f₁+f₂+…+f_K}

5. The method of claim 1, wherein the step of binning each target value range as a regression tree and creating a scorecard comprising the regression tree bins comprises:

6. The method of claim 1, wherein obtaining data for a plurality of sample characteristics for a plurality of sample users comprises:

acquiring data of a plurality of characteristics of a sample user;

inputting a plurality of features to be screened into a gradient lifting model to be trained, and extracting the importance degree corresponding to each feature to be screened; a label is correspondingly arranged on each feature, and the label is used for representing whether the feature is important or not;

7. The method of claim 6, wherein each of the categorical features is indicative of an attribute of a sample user, the categorical features comprising: the gender of the sample user, the academic history of the sample user, the region to which the sample user belongs and the industry to which the sample user belongs;

8. A scoring method, comprising:

9. A score card creation apparatus, comprising:

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 7 or claim 8 when executing a program stored in the memory.