CN112232944B

CN112232944B - Method and device for creating scoring card and electronic equipment

Info

Publication number: CN112232944B
Application number: CN202011049938.4A
Authority: CN
Inventors: 张晓强
Original assignee: Ccx Credit Technology Co ltd
Current assignee: Ccx Credit Technology Co ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2024-05-31
Anticipated expiration: 2040-09-29
Also published as: CN112232944A

Abstract

The embodiment of the invention provides a method and a device for creating a scoring card and electronic equipment, wherein the method comprises the following steps: acquiring data of a plurality of sample characteristics of a plurality of sample users; for each sample feature, training to obtain one or more regression trees corresponding to the sample feature based on each feature value of the sample feature; sequencing the regression trees corresponding to the same sample features according to the sequence from small to large of the feature values corresponding to each regression tree; determining the intersection of the numerical value interval represented by the left leaf node of the first regression tree after sequencing, the numerical value interval represented by the right leaf node of the last regression tree after sequencing and the numerical value interval represented by the adjacent two leaf nodes of different regression trees as a target numerical value interval; and taking each target numerical value interval as a regression tree sub-box, and creating a grading card comprising each regression tree sub-box. The method provided by the embodiment of the invention simplifies the creation process of the grading card.

Description

Method and device for creating scoring card and electronic equipment

Technical Field

The present invention relates to the field of data analysis technologies, and in particular, to a method and an apparatus for creating a scoring card, and an electronic device.

Background

Currently, big data analysis techniques are applied in various fields. For example: in the financial field, risk control may be achieved by analyzing data of users.

Specifically, the financial institution may perform credit risk assessment on the user by performing big data analysis on attribute data, behavior data, and the like of the user. Currently, various attribute and behavior data of users are mainly utilized through a created scoring card, for example: attribute data such as age attribute, sex attribute or income and expense, and behavior data such as deposit and withdrawal or payment, credit scoring is carried out on the user. In this way, the financial institution may decide whether to grant credit to the user and the credit limit and interest rate of the credit to the user based on the credit score of the user, thereby reducing the risk in financial transactions. Wherein the credit score of the user may reflect the probability that the user may be overdue repayment or fraud, and the higher the credit score, the lower the credit risk of the user.

It can be seen that creating a scoring card is an important step in doing credit scoring. Referring to fig. 1, fig. 1 is a diagram showing an exemplary structure of a scoring card according to the prior art. Wherein the scoring card 100 includes: revenue variables, age variables, gender variables, and marital status variables, each variable may correspond to a plurality of feature bins, each feature bin being a data interval for the variable, such as the revenue variables corresponding to 3 feature bins in fig. 1: [0, 10000), [10000, 50000), and [50000, 50000 ] or more, that is, each feature bin is one data section of the revenue variable. And, each feature bin corresponds to a woe (weight of evidence, evidence weight) value and a corresponding score, where the woe value corresponding to each feature bin represents: the characteristic box corresponds to the difference between the ratio of the responding high-risk user to the non-high-risk user and the ratio of the high-risk user to the non-high-risk user in all users. And the smaller the woe value is, the smaller the default risk of the user corresponding to the feature box is; the corresponding score for each feature bin represents: the value of a certain variable of the user is scored when the feature is in the bin.

For a user, the corresponding score of each feature of the user may be analyzed according to the scoring card 100 shown in fig. 1, and then the sum of the corresponding score and the base score of each feature may be used as the credit score of the user. Therefore, the process of creating the scoring card is to analyze the big data of the user data and calculate the credit scores corresponding to various attributes and various behaviors. For example, if the sex of the user a is male, the age is 20 years, the income is 5000, and the status is not wedding, the user a can determine, according to the scoring card 100 shown in fig. 1: the gender was male with a corresponding score of 1.6, marital status with a corresponding score of 0.3, age 20 years corresponding feature bin [20, 40 ] and a corresponding score of 22.7, income 5000 corresponding feature bin [0, 10000) and a corresponding score of-7.3. The sum of the corresponding scores and base scores for each feature may be: 1.6+0.3+22.7+ (-7.3) +33.7=51 as credit score for user a.

Currently, the most commonly used standard grading cards based on logistic regression are established, wherein user data is utilized to determine variable boxes, and then a logistic regression model is constructed to determine the grading cards. The variable box is to determine a plurality of numerical intervals of each variable, as shown in the scoring card 100 in fig. 1, and determine a plurality of age intervals such as [0, 20], [20, 40), [40, 50) and [50, 100] according to the age requirement. However, the existing variable box-dividing process requires an engineer to repeatedly operate for each feature to determine a better variable box-dividing process, and the operation is complicated.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device and electronic equipment for creating a grading card so as to simplify the grading card creation process.

In order to achieve the above object, an embodiment of the present invention provides a method for creating a scoring card, including:

acquiring data of a plurality of sample features of a plurality of sample users, the data of the plurality of sample features of each sample user comprising: behavior data and attribute data of the sample user; each sample user is correspondingly provided with a label, and the label is used for representing whether the sample user is a high risk user or not;

for each sample feature, training to obtain one or more regression trees corresponding to the sample feature based on each feature value of the sample feature; each regression tree includes two leaf nodes, representing: two numerical intervals of the sample features divided by the feature values corresponding to the regression tree;

sequencing the regression trees corresponding to the same sample features according to the sequence from small to large of the feature values corresponding to each regression tree; determining the intersection of the numerical value interval represented by the left leaf node of the first regression tree after sequencing, the numerical value interval represented by the right leaf node of the last regression tree after sequencing and the numerical value interval represented by the adjacent two leaf nodes of different regression trees as a target numerical value interval;

and taking each target numerical value interval as a regression tree sub-box, and creating a grading card comprising each regression tree sub-box.

Further, the step of training to obtain one or more regression trees corresponding to each sample feature based on the respective feature values of the sample feature includes:

Aiming at each sample characteristic of a sample user, selecting a characteristic value of the sample characteristic by taking data of the sample characteristic as a characteristic value based on a gradient lifting algorithm, and determining a regression tree taking the characteristic value as a demarcation point; each leaf node of the regression tree corresponds to a predictive score, representing: the data of the sample feature is positioned in the corresponding fraction of the numerical interval represented by the leaf node;

determining a gain function of the regression tree;

obtaining the sum of the prediction scores of the data of each sample characteristic of the sample user in each regression tree as an output score;

Determining a loss function of a current gradient lifting tree model to be trained based on the labels of the sample users and the output scores; the current gradient lifting tree model to be trained comprises the following steps: a plurality of regression trees currently determined;

judging whether the loss function converges or not;

if yes, fixing parameters of a current gradient lifting tree model to be trained to obtain a target gradient lifting tree model;

If not, selecting a characteristic value which enables the gain function of the regression tree to reach the maximum as a new characteristic value, and returning to the step of determining the regression tree taking the characteristic value as a demarcation point;

Extracting parameters of each regression tree of the target gradient lifting tree model, and combining a plurality of regression trees representing the same characteristic value of the same sample characteristic data to obtain one or more regression trees corresponding to each sample characteristic.

Further, the step of creating a scoring card including each regression tree bin by taking each target value interval as one regression tree bin includes:

obtaining the score corresponding to each target numerical value interval, wherein the score corresponding to each target numerical value interval is as follows: the sum of the prediction scores corresponding to the leaf nodes with intersections of the numerical intervals and the target numerical interval;

taking each target value interval as a regression tree sub-box, taking the score corresponding to the target value interval as the score of the regression tree sub-box, and creating a scoring card comprising each regression tree sub-box and the score corresponding to each regression tree sub-box; the scoring card comprises scoring corresponding to each regression tree score box and preset basic scores.

Further, for each target value interval, the score corresponding to the target value interval is determined by adopting the following formula:

Score＝-B{f₁+f₂+…+f_K}

Wherein Score represents the Score corresponding to the target value interval, B is a preset constant parameter, and f ₁、f₂、…、f_K represents the sum of the predictive scores corresponding to K leaf nodes having intersections between the value interval and the target value interval.

And taking each target numerical value interval as a characteristic sub-box, determining the score corresponding to each characteristic sub-box by adopting a logistic regression model, and creating a score card according to each characteristic sub-box and the score corresponding to each characteristic sub-box.

Further, the acquiring data of a plurality of sample characteristics of a plurality of sample users includes:

Acquiring data of a plurality of characteristics of a sample user;

For each feature, detecting a type of the feature; if the characteristic is a numerical characteristic, taking the characteristic as a characteristic to be screened; if the feature is a category type feature, assigning the feature according to a preset assignment rule, and taking the assigned data of the feature as a feature to be screened;

Inputting a plurality of features to be screened into a gradient lifting model to be trained, and extracting importance degrees corresponding to the features to be screened; there is one label for each feature, which is used to characterize whether the feature is important.

Aiming at each feature to be screened, when the importance of the feature to be screened is smaller than or equal to a preset importance threshold, the feature to be screened is taken as a feature to be deleted;

Judging whether the number of the features to be deleted is zero or not; if yes, determining each feature to be screened as a sample feature;

if not, judging whether the number of the features to be deleted is smaller than the preset number, if so, deleting all the features to be deleted, and returning to the step of inputting a plurality of features to be screened into a gradient lifting model to be trained, and extracting the importance degree corresponding to each feature to be screened;

If not, deleting the preset number of the features to be deleted with low importance among the features to be deleted, taking the remaining features to be deleted as the features to be screened, and returning to the step of inputting the multiple features to be screened into the gradient lifting model to be trained and extracting the importance corresponding to each feature to be screened.

Further, each of the category type features is used for representing an attribute of the sample user, and the category type features include: the gender of the sample user, the academy of the sample user, the region to which the sample user belongs and the industry to which the sample user belongs;

if the feature is a category type feature, assigning the feature according to a preset assignment rule, including:

For each category feature, the ratio of the number of sample users that have been marked as high risk in advance in the attribute represented by the category feature divided by the number of all sample users in the attribute represented by the category feature is taken as the value of the category feature.

In order to achieve the above object, an embodiment of the present invention further provides a scoring card creation apparatus, including:

A data acquisition module, configured to acquire data of a plurality of sample features of a plurality of sample users, where the data of the plurality of sample features of each sample user includes: behavior data and attribute data of the sample user; each sample user is correspondingly provided with a label, and the label is used for representing whether the sample user is a high risk user or not;

The regression tree training module is used for training and obtaining one or more regression trees corresponding to each sample feature based on each feature value of the sample feature; each regression tree includes two leaf nodes, representing: two numerical intervals of the sample features divided by the feature values corresponding to the regression tree;

The interval determining module is used for sequencing the regression trees corresponding to the same sample features according to the sequence from small to large of the feature values corresponding to the regression trees; determining the intersection of the numerical value interval represented by the left leaf node of the first regression tree after sequencing, the numerical value interval represented by the right leaf node of the last regression tree after sequencing and the numerical value interval represented by the adjacent two leaf nodes of different regression trees as a target numerical value interval;

and the scoring card creation module is used for taking each target numerical value interval as a regression tree box and creating scoring cards comprising the regression tree boxes.

Correspondingly, the embodiment of the invention provides a scoring method, which comprises the following steps:

acquiring a plurality of feature data of a user to be scored, wherein the feature data comprises: behavior data and attribute data of users to be scored;

for each piece of characteristic data, obtaining the corresponding score of the characteristic data in a pre-established scoring card; wherein the scoring card is created by the method of any one of claims 1-7;

Determining the sum of the score of each feature data and the basic score of the score card as the score of the user to be scored; wherein a higher score indicates a lower risk for the user to be scored.

Based on the scoring method, the embodiment of the invention also provides a scoring device, which comprises the following steps:

In order to achieve the above object, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

And the processor is used for realizing any of the scoring card creation method steps or the scoring method steps when executing the program stored in the memory.

To achieve the above object, an embodiment of the present invention provides a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements any one of the above-described grading card creation method steps or grading method steps.

To achieve the above object, an embodiment of the present invention further provides a computer program product containing instructions, which when run on a computer, cause the computer to perform any one of the above-mentioned scoring card creation method steps or scoring method steps.

The embodiment of the invention has the beneficial effects that:

According to the method provided by the embodiment of the invention, one or more regression trees corresponding to the sample characteristics are obtained through training based on the characteristic values of the sample characteristics according to the data of the sample characteristics of the sample users, and the regression trees corresponding to the same sample characteristics are ordered according to the sequence from small to large of the characteristic values corresponding to the regression trees; and determining the intersection of the numerical value interval represented by the left leaf node of the first regression tree after sequencing, the numerical value interval represented by the right leaf node of the last regression tree after sequencing and the numerical value interval represented by the adjacent two leaf nodes of different regression trees as target numerical value intervals, taking each target numerical value interval as one regression tree to divide boxes, and creating a grading card comprising each regression tree divide box. Compared with the traditional scoring card creation mode, the scoring card creation method provided by the embodiment of the invention combines the box division process and the model training process to generate the regression tree box division, so that the scoring card is automatically created, the scoring card creation process is simplified, the operation is simpler and more convenient, and the scoring effect of the scoring card can be improved.

Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram showing an exemplary structure of a scoring card according to the prior art;

FIG. 2 is a flowchart of a method for creating a scoring card according to an embodiment of the present invention;

FIG. 3a is a flowchart of another method for creating a scoring card according to an embodiment of the present invention;

fig. 3b is a score card created by the score card creation method according to the embodiment of the present invention;

FIG. 4 is a schematic diagram of a regression tree provided in an embodiment of the present invention;

FIG. 5 is a schematic diagram of merging multiple regression trees according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of determining a target value interval from multiple regression trees of the same sample feature according to an embodiment of the present invention;

FIG. 7 is a flow chart of acquiring data of a sample feature according to an embodiment of the present invention;

FIG. 8 is a flowchart of scoring based on the method for creating a scoring card according to an embodiment of the present invention;

fig. 9 is a block diagram of a scoring card creating device according to an embodiment of the present invention;

FIG. 10 is a block diagram of another card creation apparatus according to an embodiment of the present invention;

FIG. 11 is a block diagram of a scoring device according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Because the prior art has the problem that the score card creating process is complicated, in order to solve the technical problem, the embodiment of the invention provides a score card creating method, a score card creating device and electronic equipment.

Referring to fig. 2, a process of creating a scoring card includes:

Step 201, acquiring data of a plurality of sample characteristics of a plurality of sample users, wherein the data of the plurality of sample characteristics of each sample user comprises: behavior data and attribute data of the sample user; there is one label for each sample user, which is used to characterize whether the sample user is a high risk user.

Step 202, training and obtaining one or more regression trees corresponding to each sample feature based on each feature value of the sample feature; each regression tree includes two leaf nodes, representing: and dividing the sample characteristics into two numerical intervals by the characteristic values corresponding to the regression tree.

Step 203, sorting the regression trees corresponding to the same sample features according to the sequence from small to large of the feature values corresponding to each regression tree; and determining the intersection of the numerical intervals represented by the left leaf nodes of the first regression tree after sequencing, the numerical intervals represented by the right leaf nodes of the last regression tree after sequencing and the numerical intervals represented by the adjacent two leaf nodes of different regression trees as target numerical intervals.

And 204, taking each target numerical value interval as a regression tree sub-box, and creating a grading card comprising each regression tree sub-box.

The method and the device for creating the scoring card provided by the embodiment of the invention are described in detail below through specific embodiments.

In one embodiment of the present application, referring to fig. 3a, another flow of the method for creating a scoring card includes the steps of:

step 301, acquiring data of a plurality of sample characteristics of a plurality of sample users.

In the embodiment of the invention, each sample user is correspondingly provided with a label, and the label is used for representing whether the sample user is a high risk user or not. Where a high risk user refers to a user with a specific behavioral record. Taking the financial field as an example, a high risk user may specifically refer to: users with overdue payouts or fraud records, non-high risk users refer to: there are no users for overdue payouts and fraud records. Specifically, if the sample user is a high risk user, the label corresponding to the sample user is 1, and if the sample user is a non-high risk user, the label corresponding to the sample user is 0.

In the embodiment of the present invention, the data of the plurality of sample characteristics of each sample user includes: behavior data and attribute data of the sample user. Taking the financial field as an example, the attribute data of the sample user may specifically include: gender, age, academic, industry, region and marital status of the sample user, etc.; the behavior data of the sample user may specifically include: the number of times the sample user generated a purchase, the amount paid by the sample user for the purchase, the number of days in between sample user purchases, the sample user's revenue, the proportion of sample user's spending to revenue, sample user's deposit data, sample user's withdrawal data, and sample user's historical default amount, etc.

Step 302, for each sample feature of the sample user, determining a regression tree with the feature value as a demarcation point based on a gradient lifting algorithm for each feature value of the sample feature by taking the data of the sample feature as a feature value.

Wherein each regression tree corresponds to two leaf nodes, and each leaf node of each regression tree corresponds to a predictive score, respectively, representing: the data of the sample feature is located in the fraction corresponding to the numerical interval represented by the leaf node.

For example, if the plurality of sample characteristics of the sample user include: gender, age, income, and academy. If the age of the sample user a is 30 years, referring to fig. 4, for the age of 30 years, the regression tree treeX, treeX, which uses the 30 years as the feature value and uses the feature value as the demarcation point, may be determined to include two leaf nodes: leaf _X1 and leaf _X2.leaf_X1 represent ages in interval [0, 30 ], and leaf _X2 represents ages in interval [30, 100]. Wherein leaf _X1 corresponds to the prediction score f _XL, and the prediction score f _XL represents the score corresponding to the sample user when the age is within the interval [0, 30); leaf _X2 corresponds to the prediction score f _XR, and the prediction score f _XR represents the score corresponding to when the age of the sample user is within the interval [30, 100].

Step 303, determining gain functions of the regression trees taking each characteristic value as a demarcation point.

In this step, the gain function value of each regression tree may be determined specifically by the following formula:

Wherein Gain represents the contribution of the regression tree taking one feature value of the sample feature as a demarcation point to the current gradient lifting tree model to be trained, G _L represents the sum of the first-order gradients of all the sample features divided into the left-hand leaf nodes, H _L represents the sum of the second-order gradients of all the sample features divided into the left-hand leaf nodes, G _R represents the sum of the first-order gradients of all the sample features divided into the right-hand leaf nodes, H _R represents the sum of the second-order gradients of all the sample features divided into the right-hand leaf nodes, λ represents the L2 regularization coefficient, γ is the minimum splitting loss, and λ and γ are preset parameters.

And 304, selecting a regression tree with the maximum gain function value from the regression trees as the currently determined regression tree.

In this step, the regression tree with the largest gain function can be selected as the regression tree currently determined by the magnitude of the gain function of each regression tree obtained by taking all the feature values as the demarcation points.

Step 305, obtaining the sum of the prediction scores of the data of each sample feature of the sample user in the currently determined regression tree as an output score.

In this step, the output score can be obtained by using the following formula:

wherein, To include a function space for all regression trees, the regression tree is a function that maps attributes to scores; f _k is the prediction score corresponding to the kth regression tree, namely the score of the sample feature in the leaf node of the regression tree; k is the number of regression trees; f _K(x_i) is the output score of the i-th sample user x _i.

Step 306, determining a loss function of the current gradient lifting tree model to be trained.

The current gradient lifting tree model to be trained comprises the following steps: one or more regression trees are currently determined.

In this step, the following formula may be specifically used to determine the loss function of the gradient lifting tree model:

wherein, Representing the loss function value of the current gradient lifting tree model to be trained, y _i representing the label of the sample user,/>And (3) representing the prediction score of the sample characteristic corresponding to the current gradient lifting tree model to be trained, wherein n represents the number of sample users.

Step 307, determining whether the loss function converges, if yes, executing step 308, and if not, executing step 309.

In this step, whether the number of regression trees in the current gradient lifting tree model to be trained reaches a preset value or not may be determined, if the number of regression trees in the current gradient lifting model to be trained reaches the preset value, the loss function may be determined to converge, otherwise, the loss function may be determined to not converge. The preset value may be 100 or 200, and is not limited in particular.

In this step, it may also be determined whether the value of the loss function of the current gradient-lifted tree model to be trained is not further reduced, if the value of the loss function of the current gradient-lifted tree model to be trained is not further reduced, it may be determined that the loss function converges, or else it may be determined that the loss function does not converge.

Step 308, fixing the parameters of the current gradient lifting tree model to be trained to obtain a target gradient lifting tree model, and executing step 310.

Step 309, for each feature value of each sample feature of the sample user, redetermining the regression tree using the feature value as the demarcation point based on the gradient lifting algorithm, and returning to step 303.

In step 310, parameters of each regression tree of the target gradient lifting tree model are extracted, and multiple regression trees representing the same feature value of the same sample feature data are combined to obtain one or more regression trees corresponding to each sample feature.

In the embodiment of the invention, the same feature value of the same sample feature data can correspond to a plurality of regression trees, and in this step, the plurality of regression trees can be combined. Specific: one regression tree can be reserved, then the same characteristic value of the same sample characteristic data is corresponding to each regression tree, the prediction scores of the leaf nodes with the same represented numerical intervals are summed, and the sum of the prediction scores is used as the prediction score corresponding to the leaf node in the reserved regression tree.

For example, referring to fig. 5, for a sample feature of age, if one of the feature values is 60 years old, the feature value of 60 years corresponds to: regression tree treeA, regression tree treeB, regression tree treeC, and regression tree treeD. Wherein leaf nodes leaf _AL and leaf _AR of regression tree treeA correspond to age interval [0, 60 ] and age interval [60, 100], respectively, and the prediction scores corresponding to leaf nodes leaf _AL and leaf _AR are f _AL and f _AR, respectively; leaf nodes leaf _BL and leaf _BR of regression tree treeB correspond to age interval [0, 60 ] and age interval [60, 100], respectively, and the prediction scores corresponding to leaf nodes leaf _BL and leaf _BR are f _BL and f _BR, respectively; leaf nodes leaf _CL and leaf _CR of regression tree treeC correspond to age interval [0, 60 ] and age interval [60, 100], respectively, and the prediction scores corresponding to leaf nodes leaf _CL and leaf _CR are f _CL and f _CR, respectively; leaf nodes leaf _DL and leaf _DR of regression tree treeD correspond to age intervals [0, 60 ] and [60, 100], respectively, and leaf nodes leaf _DL and leaf _DR correspond to prediction scores f _DL and f _DR, respectively.

In this step, multiple regression trees representing the same feature value of the same sample feature data may be combined to obtain one or more regression trees corresponding to each sample feature:

Any one of the retaining regression trees treeA, treeB, treeC, treeD may be selected, in this example retaining regression tree treeA;

Summing the predictive scores for each leaf node representing the same age interval [0, 60), resulting in a sum of predictive scores: f _{And L}＝f_AL+f_BL+f_CL+f_DL, taking the sum value f _{And L} of the prediction scores as a prediction score of leaf nodes of the age interval [0, 60 ] in the reserved regression tree;

The prediction scores of the leaf nodes representing the same age interval [60, 100] are summed to obtain a sum of the prediction scores: f _{And R}＝f_AR+f_BR+f_CR+f_DR, taking the sum value f _{And R} of the prediction scores as a prediction score of leaf nodes of the age interval [60, 100] in the reserved regression tree;

the retained regression tree treeA was obtained as a regression tree corresponding to 60 years of age as a characteristic value of the sample feature of age.

Step 311, sorting the regression trees corresponding to the same sample feature according to the sequence of the feature values corresponding to the regression trees; and respectively determining the intersection of the numerical intervals represented by the leaf nodes on the left side of the first regression tree and the numerical intervals represented by the leaf nodes on the right side of the last regression tree after sequencing and the numerical intervals represented by the adjacent two leaf nodes of different regression trees as target numerical intervals.

For each target value interval, the prediction score corresponding to the target value interval is as follows: the sum of the prediction scores corresponding to the numerical intervals of the respective leaf nodes having intersections with the target numerical interval.

For each regression tree, the leaf node on the left of the regression tree represents the numerical interval: the numerical value is smaller than the numerical value interval of the characteristic value corresponding to the regression tree; the leaf nodes on the right side of the regression tree represent the numerical intervals: the value is greater than or equal to the value interval of the characteristic value corresponding to the regression tree, or the value is greater than or equal to the characteristic value corresponding to the regression tree and is smaller than the value interval of a specific value. For example, referring to fig. 4, regression tree treeA includes: left leaf node leaf _AL and right leaf node leaf _AR. The feature value corresponding to regression tree treeA is age 60 years, then leaf node leaf _AL to the left of regression tree treeA corresponds to age range [0, 60 ], and leaf node leaf _AR to the right of regression tree treeA corresponds to age range [60, 100].

For example, referring to fig. 6, for a revenue sample feature, if the corresponding feature values include: 1000. 10000, 30000 and 50000. And the characteristic value 1000 of income corresponds to the regression tree1, the characteristic value 10000 of income corresponds to the regression tree2, the characteristic value 30000 of income corresponds to the regression tree3, and the characteristic value 50000 of income corresponds to the regression tree4. The leaf nodes leaf _1L and leaf _1R of the regression tree1 respectively correspond to a income interval [0, 1000 ] and a income interval [1000, more than 1000), and the prediction scores corresponding to the leaf nodes leaf _1L and leaf _1R are f _1L and f _1R respectively; leaf nodes leaf _2L and leaf _2R of the regression tree2 correspond to a income interval [0, 10000) and a income interval [10000, more than 10000) respectively, and prediction scores corresponding to leaf nodes leaf _2L and leaf _2R are f _2L and f _2R respectively; leaf nodes leaf _3L and leaf _3R of regression tree3 correspond to revenue intervals [0, 30000) and revenue intervals [30000, 30000 or more), respectively, and the prediction scores corresponding to leaf nodes leaf _3L and leaf _3R are f _3L and f _3R, respectively; leaf nodes leaf _4L and leaf _4R of regression tree4 correspond to revenue intervals [0, 50000) and revenue intervals [50000, 50000) or more, respectively), and the prediction scores corresponding to leaf nodes leaf _4L and leaf _4R are f _4L and f _4R, respectively.

In this step, the regression trees are ordered in order of decreasing feature values to obtain the regression trees tree1, 2, 3 and 4 in order. And determining the intersection of the numerical intervals represented by two adjacent leaf nodes of each regression tree as a target numerical interval: determining intersection [1000, 10000) of numerical intervals [1000, 1000 or more) and [0, 10000) represented by two leaf nodes leaf _1R and leaf _2L adjacent to the regression tree1 and the regression tree2 as a target numerical interval; determining intersection [10000, 30000) of numerical intervals [10000, 10000 or more) and [0, 30000) represented by two leaf nodes leaf _2R and leaf _3L adjacent to the regression tree2 and the regression tree3 as a target numerical interval; intersection [30000, 50000) of numerical intervals [30000, 30000 or more) and [0, 50000) represented by two leaf nodes leaf _3R and leaf _4L adjacent to the regression tree3 and the regression tree4 is determined as a target numerical interval. And respectively determining a numerical interval [0, 1000 ] represented by a leaf node leaf _1L on the left side of the first regression tree1 and a numerical interval [50000, more than 50000 ] represented by a leaf node leaf _4R on the right side of the last ladder regression tree4 after sequencing as target numerical intervals. Obtaining a target numerical value interval: [0, 1000), [1000, 10000), [10000, 30000), [30000, 50000), and [50000, 50000 or more.

And, the prediction score corresponding to the target numerical value interval [0, 1000 ] is: numerical intervals [0, 1000), [0, 10000), [0, 30000), and [0, 50000) of each leaf node having an intersection with the target numerical interval [0, 1000): f _1L+f_2L+f_3L+f_4L. The prediction score corresponding to the target value interval [1000, 10000 ] is: numerical intervals [1000, 1000 or more), [0, 10000), [0, 30000), and [0, 50000) of each leaf node having an intersection with the target numerical interval [1000, 10000): f _1R+f_2L+f_3L+f_4L. The prediction score corresponding to the target value interval [10000, 30000) is: numerical intervals [1000, 1000 or more), [10000, 10000 or more), [0, 30000) and [0, 50000) of each leaf node having intersections with the target numerical interval [10000, 30000): f _1R+f_2R+f_3L+f_4L. The prediction score corresponding to the target value interval [30000, 50000) is: numerical intervals [1000, 1000 or more), [10000, 10000 or more), [30000, 30000 or more) and [0, 50000) of each leaf node having an intersection with the target numerical interval [30000, 50000): f _1R+f_2R+f_3R+f_4L. Target value interval [50000, 50000 above) the corresponding prediction score is: numerical intervals [1000, 1000 or more), [10000, 10000 or more), [30000, 30000 or more) and [50000, 50000 or more) of each leaf node having intersections with the target numerical interval [50000, 50000 or more): f _1R+f_2R+f_3R+f_4R.

Step 312, taking each target value interval as a lifting tree box, and creating a grading card comprising each lifting tree box.

In this step, the scoring cards comprising the bins of the respective regression trees may be created in the following manner of steps A1-A2:

step A1: and obtaining the score corresponding to each target numerical value interval.

The score corresponding to each target numerical value interval is as follows: the sum of the predictive scores corresponding to each leaf node where the value interval intersects with the target value interval.

In this step, for each target value interval, the score corresponding to the target value interval may be determined using the following formula:

Score＝-B{f₁+f₂+…+f_K}

Step A2: taking each target value interval as a regression tree sub-box, taking the score corresponding to the target value interval as the score of the regression tree sub-box, and creating a scoring card comprising each regression tree sub-box and the score corresponding to each regression tree sub-box.

The scoring card comprises scoring corresponding to each regression tree score box and preset basic scores.

By way of example, FIG. 3b is a created scoring card that includes each regression tree bin and the score corresponding to each regression tree bin. Referring to fig. 3b, the feature columns in the scoring card 300 include income, age, gender and marital status, wherein the basis in the scoring card 300 is divided into predetermined constants 35, and the regression tree score corresponding to the income feature includes 5: [0, 1000), [1000, 10000), [10000, 30000), [30000, 50000), and [50000, 50000 or more; the regression tree box corresponding to this age feature includes 4: [0, 20), [20, 30), [30, 50), and [50, 100]; the regression tree box for this feature of gender includes 2: male and female; the regression tree box corresponding to the marital status feature comprises 2 pieces of: married and unmarketed.

Referring to fig. 3b, each regression tree bin for each feature in the scoring card 300 corresponds to a score, taking the revenue [0, 1000 ] regression tree bin as an example, the score corresponding to the bin is-16, and the score corresponding to the regression tree bin can be calculated by the following formula:

Wherein, B is a preset constant parameter, x ₁ represents the feature of income, tree (x ₁) represents the regression Tree corresponding to the sub-bin, and the sub-bin can correspond to one or more regression trees. i e Tree (x ₁) represents the i-th regression Tree where the value interval represented by the leaf node intersects with the regression Tree bins [0, 1000 ], which may correspond to one or more regression trees.

F _iL,f_iR is the prediction score of the left leaf node and the right leaf node of the ith tree in the regression tree. Delta _iL,δ_iR is a 0-1 logical variable, delta _iL,δ_iR has and only one variable has a value of 1. For example, δ _iL =1 while δ _iR =0, representing that there is intersection of the numerical interval represented by the left leaf node representing the ith regression tree with the regression tree bin [0, 1000); δ _iL =0 and δ _iR =1, which means that there is an intersection between the numerical interval represented by the right-hand leaf node of the ith regression tree and the regression tree bins [0, 1000 ].

For example, taking the embodiment shown in fig. 6 as an example, the regression tree bins [0, 1000) may correspond to 4 regression trees: regression trees tree1, 2, 3, and 4. The score corresponding to the regression tree bin [0, 1000 ] is:

In the embodiment of the invention, the basic component A and the constant parameter B can be obtained by adopting the following method:

a ratio may be defined to represent the relative probability odds of a user breach:

wherein, P represents the probability that user x is a high risk user, and F _K (x) is the sum of the scores of the user x after the feature is input into the target regression tree; and/>Substituting odds into/>The method can obtain:

F_K(x)＝log(odds)

the Score of a scoring card may be defined as a linear expression of the log of the ratio, i.e.:

Score＝A-B·log(odds)＝A-B·F_K(x)

Where A and B are constants, the negative sign before B may be such that the lower the probability of breach, the higher the score. Typically, i.e. a high score represents a low risk and a low score represents a high risk.

In general, two hypotheses may be set: suppose 1: an expected score at a given specific relative probability, i.e., a score of P ₀ given a relative probability odds of θ ₀, is obtained from the existing standard scoring card; suppose 2: according to the existing standard scoring card, the score reduced when the relative probability is doubled is PDO;

From the above assumptions it is possible to: the score for the relative probability odds is P ₀ when θ ₀ is θ ₀, and the score for the relative probability odds is P ₀ -PDO when 2θ ₀ is 2, i.e., the score for the relative probability odds decreases by two times is PDO. Substituting the known P ₀ and θ ₀,(P₀ -PDO) and 2θ ₀ into the formula score=a-b·log (odds), respectively, yields:

Solving the equation set to obtain a basic component A and a constant parameter B:

A＝P₀+B×log(θ₀)

By adopting the method provided by the embodiment of the invention, the target value intervals are directly used as regression tree boxes, and the scores corresponding to the regression tree boxes are determined according to the prediction scores corresponding to the target value intervals. Namely, the regression tree box division process and the model training process are combined to generate the regression tree box division, and meanwhile, the corresponding score of the regression tree box division can be obtained, so that the automatic creation of a score card is realized. The method provided by the embodiment of the invention simplifies the creation process of the grading card, so that the operation is simpler and more convenient, and the grading effect of the grading card can be improved.

Referring to fig. 7, a method for obtaining data of a plurality of sample features of a plurality of sample users in the above step 301 may specifically include:

in step 701, data of a plurality of features of a sample user is acquired.

In this step, various attribute data and behavior data of the sample user can be acquired. Specifically, the sex, age, academic, industry, region, number of times the sample user generates a purchase, amount paid by the sample user for the purchase, number of days of interval of the sample user purchase, income of the sample user, proportion of expenditure of the sample user to income, deposit data of the sample user, withdrawal data of the sample user, and the like of the sample user may be obtained.

Step 702, for each feature, detecting the type of the feature; if the characteristic is a numerical characteristic, taking the characteristic as a characteristic to be screened; if the feature is a category type feature, assigning the feature according to a preset assignment rule, and taking the assigned data of the feature as the feature to be screened.

In the embodiment of the present invention, each category type feature is used for representing an attribute of a sample user, and the category type features include: the gender of the sample user, the academy of the sample user, the region to which the sample user belongs, the industry to which the sample user belongs, and the like.

In this step, if the feature is a category feature, assigning the feature according to a preset assignment rule may include:

For example, if the number of all sample users is 1000, for a gender-male-type feature, if the number of male sample users is 600 and the number of pre-labeled high-risk male sample users is 50, the gender-male-type feature may be assigned: 50 ≡600=0.0833; for a gender-female-type feature, if the number of female sample users is 400 and the number of pre-labeled high-risk female sample users is 15, the gender-female-type feature may be assigned: 15 ≡400=0.0375; for the university calendar, if the number of sample users having the university calendar is 680 and the number of sample users having the university calendar, which are pre-labeled as high risk, is 20, the university calendar may be assigned with a value: 20 ≡680=0.0294; for the category type feature of the belonging area being the area a, if the number of sample users of the belonging area being the area a is 100 and the number of sample users of the belonging area being the area a, which are marked as high risk in advance, is 4, the category type feature of the belonging area being the area a may be assigned with a value: 4 ≡100=0.04; aiming at the type feature that the industry is a teacher, if the number of sample users of the industry is 80 and the number of sample users of the industry is1, the number of sample users of the industry is a teacher, the type feature that the industry is a teacher can be assigned: 1 ≡80=0.0125.

Step 703, inputting the plurality of features to be screened into the gradient lifting model to be trained, and extracting the importance degree corresponding to each feature to be screened.

Each feature to be screened is correspondingly provided with a label, and the label is used for representing whether the feature to be screened is important or not. Specifically, if the feature to be screened is an important feature to be screened, the corresponding label is 1, and if the feature to be screened is a non-important feature to be screened, the corresponding label is 0.

In this step, after a plurality of features to be screened are input into the gradient lifting model to be trained, the importance corresponding to each feature to be screened can be extracted for each feature to be screened.

The gradient lifting model to be trained is a model obtained based on a gradient lifting algorithm. And an importance threshold and a preset number can be preset for the gradient lifting model to be trained, wherein the preset number is the number of the most deleted features of a single iteration.

Step 704, regarding each feature to be filtered, when the importance of the feature to be filtered is less than or equal to a preset importance threshold, the feature to be filtered is used as the feature to be deleted.

Step 705, judging whether the number of the features to be deleted is zero; if yes, go to step 706, if no, go to step 707.

In step 706, each feature to be screened is determined as a sample feature.

Step 707, determining whether the number of features to be deleted is smaller than a preset number, if yes, executing step 708, otherwise, executing step 709.

Step 708, all the features to be deleted are deleted, and step 703 is executed back.

Step 709, deleting all the preset number of features to be deleted with low importance among the features to be deleted, taking the remaining features to be deleted as features to be filtered, and returning to step 703.

In this step, the features to be deleted may be sorted according to the order of the importance degree from large to small or the importance degree from small to large, and then the preset number of features to be deleted with low importance degree are deleted, and the remaining features to be deleted are used as the features to be filtered.

By way of example, assume that the data for the plurality of features of the sample user includes: sample 7 characteristics of age, gender, academic, marital status, monthly income, monthly average expenditure, historical default amount of the user, and can respectively identify the category type characteristics: age, gender, academic, marital status, and assigning values according to preset rules. Then, the data containing the seven characteristics to be screened can be input into a gradient lifting model to be trained, the importance of the 7 characteristics to be screened can be obtained, and the importance is assumed to be respectively: age 0.18, gender 0.005, school 0.15, marital status 0.09, month income 0.25, month average expenditure 0.12 and history violating amount 0.3. The importance threshold thred =0.1 may be preset, and the number of features deleted at most per iteration may be nums=1.

The importance of both gender and marital status can be obtained to be less than 0.1. The gender and marital status are determined as the feature to be deleted. Because the number of features to be deleted is nums=1, and the number of features to be deleted is greater than 1, one feature to be deleted with the lowest importance, namely gender, can be deleted, then the marital status is taken as the feature to be screened, each feature to be screened is input into the gradient lifting model to be trained again, the importance of each feature to be screened is determined, iteration is continued until the importance of all the features to be screened is greater than 0.1 and the model effect is not lifted any more, and then each feature to be deleted is determined as a sample feature.

In the embodiment of the present invention, after the scoring card is created in steps 301 to 314, the user to be scored may be scored using the created scoring card. Referring specifically to fig. 8, fig. 8 is a flowchart of scoring for a scoring card creation method according to an embodiment of the present invention, including the following steps:

step 801, obtaining a plurality of feature data of a user to be scored, wherein the feature data comprises: behavior data and attribute data of users to be scored.

The users to be scored are users who need to be risk rated, such as users of financial institutions, etc. The plurality of feature data of the user to be scored may specifically include: the attribute data of the users to be scored can specifically include: gender, age, academic, industry, region and marital status of the user to be scored, etc.; the behavior data of the user to be scored may specifically include: the number of times the user to be scored generates a purchase, the amount paid by the user to be scored for the purchase, the number of days in between the user to be scored for the purchase, the income of the user to be scored, the proportion of the expenditure of the user to be scored to the income, the deposit data of the user to be scored, the withdrawal data of the user to be scored, the historical default amount of the user to be scored, and the like.

Step 802, for each feature data, obtaining a score corresponding to the feature data in a pre-created scoring card.

The scoring card is created based on the scoring card creation method provided by the embodiment of the invention.

And 803, determining the sum of the score of each feature data and the basic score of the scoring card as the score of the user to be scored.

Wherein a higher score indicates a lower risk for the user to be scored.

For example, if the user to be scored is user α, the feature data of user α includes: sex is male, marital status is unmarked, age is 25 years, and income is 15000 yuan. And, as shown in fig. 3b, for each feature data, the score corresponding to the feature data in the pre-created score card 300 may be obtained: revenue 15000 yuan corresponds to regression tree bin [1000, 10000), and the corresponding score is-7.8 points; age 25 years corresponds to regression tree bins [20, 30), and the corresponding score is 23.3 points; the score corresponding to the regression tree bin of gender males is 1.5 score; the score corresponding to the martal status of the unmarketed regression tree is 0.2 score; the basis in the scoring card is divided into 35. Then, the sum of the score of each feature data and the basic score of the scoring card can be determined as the score of the user to be scored:

Score_α＝-7.8+23.3+1.5+0.2+35＝52.2

In the embodiment of the present invention, after each target value interval is obtained through steps 301 to 313, each target value interval may be respectively used as a feature score box, a logistic regression model is adopted to determine a score corresponding to each feature score box, and a score card is created according to each feature score box and the score corresponding to each feature score box. The feature bins are known here, the score corresponding to each feature bin is determined by adopting a logistic regression model according to each feature bin, and the creation of the corresponding score card is described in detail in the prior art, and will not be described here again. In the embodiment of the invention, the known feature box division can be directly applied to the traditional scoring card creation process, so that the effect of the created scoring card is obviously improved, and the application range of the technology of the invention is wider.

Based on the same inventive concept, according to the method for creating a scoring card provided in the above embodiment of the present invention, correspondingly, another embodiment of the present invention further provides a device for creating a scoring card, a schematic structural diagram of which is shown in fig. 9, which specifically includes:

A data acquisition module 901, configured to acquire data of a plurality of sample features of a plurality of sample users, where the data of the plurality of sample features of each sample user includes: behavior data and attribute data of the sample user; each sample user is correspondingly provided with a label, and the label is used for representing whether the sample user is a high risk user or not;

A regression tree training module 902, configured to train to obtain, for each sample feature, one or more regression trees corresponding to the sample feature based on each feature value of the sample feature; each regression tree includes two leaf nodes, representing: two numerical intervals of the sample features divided by the feature values corresponding to the regression tree;

The interval determining module 903 is configured to sort regression trees corresponding to the same sample feature according to the order from small to large of feature values corresponding to each regression tree; determining the intersection of the numerical value interval represented by the left leaf node of the first regression tree after sequencing, the numerical value interval represented by the right leaf node of the last regression tree after sequencing and the numerical value interval represented by the adjacent two leaf nodes of different regression trees as a target numerical value interval;

the scoring card creating module 904 is configured to use each target value interval as a regression tree bin, and create a scoring card including each regression tree bin.

It can be seen that, by adopting the device provided by the embodiment of the invention, according to the data of the plurality of sample features of the plurality of sample users, for each sample feature, based on each feature value of the sample feature, one or more regression trees corresponding to the sample feature are obtained through training, and the regression trees corresponding to the same sample feature are ordered according to the sequence from small to large of the feature value corresponding to each regression tree; and determining the intersection of the numerical value interval represented by the left leaf node of the first regression tree after sequencing, the numerical value interval represented by the right leaf node of the last regression tree after sequencing and the numerical value interval represented by the adjacent two leaf nodes of different regression trees as target numerical value intervals, taking each target numerical value interval as one regression tree to divide boxes, and creating a grading card comprising each regression tree divide box. Compared with the traditional scoring card creation mode, the scoring card creation device provided by the embodiment of the invention combines the box division process and the model training process to generate the regression tree box division, so that the scoring card is automatically created, the scoring card creation process is simplified, the operation is simpler and more convenient, and the scoring effect of the scoring card can be improved.

Further, referring to fig. 10, the regression tree training module 902 includes:

The regression tree determination submodule 1001 is configured to determine, for each sample feature of a sample user, a regression tree using the data of the sample feature as a feature value, and for each feature value of the sample feature, based on a gradient lifting algorithm, using the feature value as a demarcation point; each leaf node of the regression tree corresponds to a predictive score, representing: the data of the sample feature is positioned in the corresponding fraction of the numerical interval represented by the leaf node;

a gain function determining submodule 1002, configured to determine a gain function of each regression tree with each feature value as a demarcation point;

A regression tree selection submodule 1003 for selecting a regression tree with the largest gain function from among the regression trees as the currently determined regression tree

An output score obtaining submodule 1004, configured to obtain, as an output score, a sum of prediction scores of the regression tree currently determined by data of each sample feature of the sample user;

A loss function value determining submodule 1005, configured to determine a loss function of the current gradient lifting tree model to be trained based on the label of the sample user and the output score; the current gradient lifting tree model to be trained comprises the following steps: one or more regression trees currently determined;

A judging submodule 1006, configured to judge whether the loss function converges; if yes, fixing parameters of a current gradient lifting tree model to be trained to obtain a target gradient lifting tree model; if not, re-determining a regression tree taking the characteristic value as a demarcation point based on a gradient lifting algorithm aiming at each characteristic value of each sample characteristic of the sample user, and returning to the step of respectively determining gain functions of each regression tree taking each characteristic value as a demarcation point;

and the merging submodule 1007 is configured to extract parameters of each regression tree of the target gradient lifting tree model, and merge multiple regression trees representing the same feature value of the same sample feature data to obtain one or more regression trees corresponding to each sample feature.

Further, the score card creating module 904 is specifically configured to obtain a score corresponding to each target value interval, where the score corresponding to each target value interval is: the sum of the prediction scores corresponding to the leaf nodes with intersections of the numerical intervals and the target numerical interval; taking each target value interval as a regression tree sub-box, taking the score corresponding to the target value interval as the score of the regression tree sub-box, and creating a scoring card comprising each regression tree sub-box and the score corresponding to each regression tree sub-box; the scoring card comprises scoring corresponding to each regression tree score box and preset basic scores.

Score＝-B{f₁+f₂+…+f_K}

Further, the score card creating module 904 is specifically configured to use each target value interval as a feature score box, determine a score corresponding to each feature score box by using a logistic regression model, and create a score card according to each feature score box and the score corresponding to each feature score box.

Further, the data acquisition module 901 includes:

The characteristic data acquisition sub-module is used for acquiring data of a plurality of characteristics of a sample user;

A data type detection sub-module for detecting, for each feature, a type of the feature; if the characteristic is a numerical characteristic, taking the characteristic as a characteristic to be screened; if the feature is a category type feature, assigning the feature according to a preset assignment rule, and taking the assigned data of the feature as a feature to be screened;

the importance extraction submodule is used for inputting a plurality of features to be screened into the gradient lifting model to be trained and extracting importance corresponding to each feature to be screened; there is one label for each feature, which is used to characterize whether the feature is important.

The feature to be deleted determining submodule is used for regarding each feature to be screened, and taking the feature to be screened as the feature to be deleted when the importance of the feature to be screened is smaller than or equal to a preset importance threshold;

The third judging submodule is used for judging whether the number of the features to be deleted is zero or not; if yes, determining each feature to be screened as a sample feature; if not, judging whether the number of the features to be deleted is smaller than the preset number, if so, deleting all the features to be deleted, and returning to the step of inputting a plurality of features to be screened into a gradient lifting model to be trained, and extracting the importance degree corresponding to each feature to be screened; if not, deleting the preset number of the features to be deleted with low importance among the features to be deleted, taking the remaining features to be deleted as the features to be screened, and returning to the step of inputting the multiple features to be screened into the gradient lifting model to be trained and extracting the importance corresponding to each feature to be screened.

Therefore, by adopting the device provided by the embodiment of the invention, the target value intervals are directly used as regression tree boxes, and the scores corresponding to the regression tree boxes are determined according to the prediction scores corresponding to the target value intervals. Namely, the regression tree box division process and the model training process are combined to generate the regression tree box division, and meanwhile, the corresponding score of the regression tree box division can be obtained, so that the automatic creation of a score card is realized. The device provided by the embodiment of the invention simplifies the creation process of the grading card, so that the operation is simpler and more convenient, and the grading effect of the grading card can be improved.

According to the scoring method provided by the above embodiment of the present invention, correspondingly, another embodiment of the present invention further provides a scoring device, a schematic structural diagram of which is shown in fig. 11, which specifically includes:

the feature data obtaining module 1101 is configured to obtain a plurality of feature data of a user to be scored, where the feature data includes: behavior data and attribute data of users to be scored;

a first score determining module 1102, configured to obtain, for each piece of feature data, a score corresponding to the piece of feature data in a score card created in advance; wherein the scoring card is created by the method of any one of claims 1-7;

A second score determining module 1103, configured to determine a sum of the score of each feature data and the base score of the score card as the score of the user to be scored; wherein a higher score indicates a lower risk for the user to be scored.

The embodiment of the invention also provides an electronic device, as shown in fig. 12, which comprises a processor 1201, a communication interface 1202, a memory 1203 and a communication bus 1204, wherein the processor 1201, the communication interface 1202 and the memory 1203 complete the communication with each other through the communication bus 1204,

A memory 1203 for storing a computer program;

The processor 1201, when executing the program stored in the memory 1203, performs the following steps:

Or the following steps are realized:

The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any one of the above-described score card creation methods or scoring methods.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the scoring card creation methods or scoring methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus, electronic device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and references to the parts of the description of the method embodiments are only needed.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of creating a scoring card, comprising:

Taking each target value interval as a regression tree sub-box, and creating a grading card comprising each regression tree sub-box;

The step of training and obtaining one or more regression trees corresponding to each sample feature based on each feature value of the sample feature comprises the following steps:

For each sample feature of a sample user, taking data of the sample feature as a feature value, and for each feature value of the sample feature, determining a regression tree taking the feature value as a demarcation point based on a gradient lifting algorithm; each leaf node of the regression tree corresponds to a predictive score, representing: the data of the sample feature is positioned in the corresponding fraction of the numerical interval represented by the leaf node;

Respectively determining gain functions of the regression trees taking each characteristic value as a demarcation point;

selecting a regression tree with the maximum gain function from the regression trees as a currently determined regression tree;

obtaining the sum of the predictive scores of the data of each sample characteristic of the sample user in the currently determined regression tree as an output score;

determining a loss function of a current gradient lifting tree model to be trained based on the labels of the sample users and the output scores; the current gradient lifting tree model to be trained comprises the following steps: one or more regression trees currently determined;

judging whether the loss function converges or not;

if not, re-determining a regression tree taking the characteristic value as a demarcation point based on a gradient lifting algorithm aiming at each characteristic value of each sample characteristic of the sample user, and returning to the step of respectively determining gain functions of each regression tree taking each characteristic value as a demarcation point;

2. The method of claim 1, wherein the step of creating a scoring card comprising each regression tree bin with each target value interval as one regression tree bin comprises:

3. The method of claim 2, wherein for each target value interval, the score corresponding to the target value interval is determined using the following formula:

Score＝-B{f₁+f₂+…+f_K}

4. The method of claim 1, wherein the step of creating a scoring card comprising each regression tree bin with each target value interval as one regression tree bin comprises:

5. The method of claim 1, wherein the acquiring data for a plurality of sample characteristics for a plurality of sample users comprises:

Acquiring data of a plurality of characteristics of a sample user;

Inputting a plurality of features to be screened into a gradient lifting model to be trained, and extracting importance degrees corresponding to the features to be screened; each feature has a label corresponding to it, which is used to characterize whether the feature is important;

6. The method of claim 5, wherein each of the category-type features is used to represent an attribute of the sample user, the category-type features comprising: the gender of the sample user, the academy of the sample user, the region to which the sample user belongs and the industry to which the sample user belongs;

7. A scoring method, comprising:

for each piece of characteristic data, obtaining the corresponding score of the characteristic data in a pre-established scoring card; wherein the scoring card is created by the method of any one of claims 1-6;

8. A score card creation device, characterized by comprising:

the scoring card creating module is used for taking each target numerical value interval as a regression tree box to create scoring cards comprising the regression tree boxes;

the regression tree training module is specifically configured to:

judging whether the loss function converges or not;

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

A processor for implementing the method steps of any one of claims 1-6 or claim 7 when executing a program stored on a memory.