CN111583031A

CN111583031A - Application scoring card model building method based on ensemble learning

Info

Publication number: CN111583031A
Application number: CN202010414727.XA
Authority: CN
Inventors: 郑志骏; 韩德志
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-08-25

Abstract

The invention discloses an application scoring card model establishing method based on ensemble learning, which predicts default probabilities given by various data sources by carrying out data preprocessing, feature engineering and building submodels of a deep neural network on data from different sources, and then fuses the submodels of the deep neural network through a logistic regression model to obtain the overall default probabilities and convert the overall default probabilities into credit scores. Compared with the traditional application scoring card model based on single DNN or logistic regression, the method has the advantages of taking stability, accuracy and interpretability into consideration, and greatly improving the overall performance.

Description

Application scoring card model building method based on ensemble learning

Technical Field

The invention relates to the field of credit wind control, in particular to an application scoring card model building method based on ensemble learning.

Background

The application scoring card model is an important model in the field of credit wind control, and the application scoring card model is a scoring model for obtaining different credit grades according to overdue or default probability by utilizing a certain credit scoring model according to various historical credit data of a credit application client. With the development of big data and artificial intelligence, the application scoring card model is very obvious from expert judgment based on business driving to a machine learning model based on data driving compared with the traditional wind control system based on business driving: firstly, under the support of big data, the bad account rate is far lower than that of manual judgment; and secondly, the method is not only dependent on a few experts and is convenient for scale and standardization.

Most of current application score card models based on machine learning are developed based on a logistic regression model, and L1 norm regularization is introduced, so that the learning result meets sparseness, and features are conveniently extracted or screened according to classification information degree (IV value). However, with the prevalence of internet finance, the accuracy of the method is low due to the fact that internet data have the characteristics of strong sparsity and weak univariate risk distinguishing capability.

Because deep neural networks have extremely high classification accuracy, usually reaching more than 90%, the deep neural networks are often used for classification instead of traditional machine learning algorithms. However, the deep neural network has the characteristic of a black box and has no interpretability, and in addition, the stability of the deep neural network is not strong, and a pure deep neural network wind control model is not suitable for being used in a business with strong sensitivity, namely financial wind control.

Disclosure of Invention

The invention provides an application scoring card model building method based on ensemble learning, which combines the characteristics of ensemble learning and deep learning and improves the comprehensive performance of an application scoring card model.

In order to achieve the aim, the invention provides an application scoring card model establishing method based on ensemble learning, which comprises the following steps of:

s1, respectively carrying out data preprocessing on the data of each data source, and carrying out feature engineering on the preprocessed data to obtain the data features of each data source;

s2, respectively constructing a gradient lifting decision tree model for each data source, thereby screening important features from the data features of each data source;

s3, respectively constructing a deep neural network submodel according to the important data characteristics of each data source, thereby predicting default probability given by each data source;

and S4, fusing the deep neural network submodels by constructing a logistic regression model to obtain a credit score.

Preferably, the step S1 includes the following steps:

s1.1, respectively carrying out missing value removal pretreatment on the data of each data source according to the types of the missing values;

s1.2, performing oversampling preprocessing on the corresponding positive case data in each data source by adopting an SMOTE method;

and S1.3, respectively carrying out characteristic engineering on the preprocessed data of each data source.

Preferably, the method for preprocessing the deletedvalue comprises the following steps:

when the type of the missing value is a continuous completely random missing value, replacing the missing value by adopting an arithmetic mean value of 5-10 adjacent numbers of the missing value;

when the type of the missing value is a discrete completely random missing value, replacing the missing value by a random state;

when the type of the missing value is a random missing value or a completely non-random missing value, the missing value is replaced with a new state value-1.

Preferably, the method for preprocessing the SMOTE oversampling includes:

for each sample x in the corresponding positive case data_iFinding distance sample x by Euclidean distance calculation method_iNearest k neighbors, denoted X_i(near)，near∈{1，···，k}；

Then randomly selecting n neighbor x from the k neighbors_i(nn), n ∈ {1, ·, n } and n<And k, respectively carrying out linear interpolation between the n adjacent neighbors and the original sample xi, thereby synthesizing 2n new samples.

Preferably, the feature engineering comprises the following steps:

when the type of the data characteristic is a discrete characteristic, coding by adopting a bad sample rate;

and when the type of the data features is continuous, performing box separation by adopting a card box separation method, and performing evidence weight coding on the characteristics subjected to box separation.

Preferably, the card binning method comprises the following steps:

a. setting a chi-square threshold according to the required box number and the required confidence level;

b. the continuous features needing to be subjected to binning are arranged in a descending order according to the value size, and each value belongs to one interval;

c. calculating chi-square value X of adjacent interval²；

d. Merging the two intervals with the minimum chi-square value;

e. and (d) repeating the steps c and d until the number of the sub-boxes is less than or equal to 5 and the chi-square value is greater than the chi-square threshold value.

Preferably, the chi-squared value calculation formula is:

wherein A is_ijRepresents the value number of the jth class in the ith interval, E_ijIs shown as A_ijDesired frequency of (1), N_iIs the number of samples in the ith group, C_jIs the proportion of the j-th sample in the whole sample.

Preferably, the calculation formula of the evidence weight code is as follows:

B_ithe number of bad samples corresponding to the feature i, B is the total number of bad samples, G_iAnd G is the total number of good samples corresponding to the characteristic i.

Preferably, the step S2 includes the following steps:

s2.1, respectively constructing a gradient lifting decision tree model for each data source, wherein N decision trees are shared in the gradient lifting decision tree model;

the decision function of the gradient lifting decision tree model is as follows:

wherein T (x; theta)_m) Is the weak classifier generated in the mth iteration;

s2.2, processing the N decision trees by adopting a CART decision tree method, and respectively calculating the importance score of each data feature of each data source;

the importance scores of the features are:

wherein, VIM_jIs the sum of the importance of the data feature j in n trees, ∑ VIM_iIs the sum of the importance of all data features in the n trees;

and S2.3, respectively screening the features with the maximum importance score values as the important features of each data source.

Preferably, in the deep neural network submodel described in step S3, the input layer is the number of dimensions (20) of the important features, the hidden layer is 2 layers, the size of the hidden layer is equal to 14 and 10, the number of nodes in the output layer is 2, the output function is a Softmax function, the loss function is a cross entropy loss function, the activation function is a RELU function, and the weights of the nodes are updated by an Adam iterative optimizer and a back propagation algorithm so that the loss function obtains the minimum value.

Preferably, the step S4 includes the following steps:

s4.1, constructing a logistic regression model according to the prediction result of each deep neural network submodel, so that each deep neural network submodel is fused, and the overall default probability is predicted;

the overall default probability is as follows:

wherein the content of the first and second substances,

theta is a factor influencing the target value, and x is an independent variable;

and S4.2, converting the overall default probability into a credit score.

The invention has the following advantages:

according to the method, a chi-square binning method is adopted to carry out feature engineering on data of each data source, a gradient boosting decision tree model is adopted to screen important features of each data source, so that the screened features have strong discrimination, meanwhile, a sub-model of each data source is established based on a deep neural network with high classification accuracy, and finally, the sub-models of each deep neural network are fused through a logistic regression model with strong stability, so that the stability and interpretability of an application scoring card model are ensured. The application scoring card model has expandability, and a method of fusing a plurality of deep nerve submodels is adopted, so that if third-party data cannot be accessed or damaged, only one submodel is influenced, and the integral application scoring card model cannot be greatly influenced.

Drawings

FIG. 1 is a flowchart of an application scoring card model building method based on ensemble learning according to the present invention.

Fig. 2 is a process for preprocessing data of a data source according to an embodiment of the present invention.

Fig. 3 is a result of screening data features by the gradient boosting decision tree model according to the embodiment of the present invention.

Fig. 4 is a training result of the deep neural network submodel provided in the embodiment of the present invention.

Detailed Description

The method for establishing the scoring card model based on ensemble learning according to the present invention will be described in detail with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise ratio for the purpose of facilitating and distinctly aiding in the description of the embodiments of the invention.

As shown in fig. 1, the invention provides a method for establishing an application scoring card model based on ensemble learning, which comprises the following steps:

specifically, the data sources are different data sources such as a free credit data source, a central line credit data source and a third-party credit data source, and data characteristics of the different data sources are obtained by respectively performing data preprocessing and characteristic engineering on the different data sources.

The step S1 includes the following steps:

specifically, the type of the missing value is determined according to the business meaning of the missing value, for example, the missing of a "payroll income" field may be a person subjective reason, so that the missing is completely non-random missing, while some fields, such as a person id and the like, have missing caused by some messy codes in the middle, are business caused by some fluctuation of a system, and belong to completely random missing values. Replacing a continuous completely random missing value by an arithmetic mean value of 5-10 adjacent missing values; replacing a discrete completely random missing value with a random state; the new state value-1 is used to replace both random missing values and completely non-random missing values.

specifically, the SMOTE method is an oversampling method that generates more labeled samples according to a rule that the samples are labeled with fewer samples, so that data tends to be balanced. By performing oversampling preprocessing on less data corresponding to a positive case (namely, data with default behaviors) in the data source, the problem of unbalanced data sample quantity of each type in the data source can be solved.

The SMOTE oversampling preprocessing method comprises the following steps:

for each sample x in the minority class of data_iFinding distance sample x by Euclidean distance calculation method_iNearest k neighbors, denoted X_i(near)，near∈{1，···，k}；

The characteristic engineering comprises the following steps:

when the type of the data feature is a discrete feature, a bad sample rate is adopted for coding, namely the discrete feature is converted into a corresponding bad sample rate;

Specifically, the card box separating method comprises the following steps:

a. setting a chi-square threshold according to the degree of freedom (the required number of bins) and the required confidence level;

specifically, the required confidence level needs to be set by itself, and banks typically require a 90% or 95% confidence.

specifically, the interval set in step b is only a tentative binning, and the total binning result is formed by continuously merging the intervals.

c. Calculating chi-square value X of adjacent interval²；

The chi-square value calculation formula is as follows:

d. Merging the two intervals with the minimum chi-square value;

Specifically, the formula for calculating the evidence weight is as follows:

specifically, the step S2 includes the following steps:

specifically, the Gradient Boost Decision Tree (GBDT) model is a Boosting integration model, which is a highly adaptive method to sequentially learn a series of homogeneous weak learners, i.e., each basic model depends on previous model results and combines them according to a certain deterministic strategy. Its decision function F_m(x) Can be expressed as:

wherein T (x; theta)_m) Is the weak classifier generated in the mth iteration;

specifically, a CART decision tree method is adopted to process N decision trees: firstly, calculating a Gini index of data of a data source; then selecting a data division mode, namely selecting the data characteristic with the minimum Gini index, and dividing the data according to the selected data characteristic and the value of the data characteristic to construct a branch for the data characteristic with the minimum data Gini index after the data characteristic is removed; the used data features are then removed and the above steps are repeated in each data branch until all the data in each branch is of the same class or all the data features are used. The gini index can measure the purity of the data, and the smaller the gini index, the higher the purity of the data and the lower the uncertainty. K samples, which do not assume each discrete feature, can be classified into n classes, and in the m (m < n) th class, the calculation formula of gini index is:

the probability that the kth sample point belongs to the mth class for K samples of each discrete feature.

For the importance of any data feature x at the corresponding node m, namely the variation of gini indexes before and after the node m branches is:

VIM_j＝GI_m-GI_l-GI_r

wherein, GI_lAnd GI_rRespectively representing gini indexes, GI, of two new nodes after branching_mRepresenting gini index before branching.

The gradient lifting decision tree model has N decision trees in total, and the importance of any feature X of the N decision trees is normalized to obtain the importance score of the data feature:

specifically, the deep neural network submodel has the advantages that the input layer is the dimension number (20) of important features, the hidden layer is 2 layers, the size of the hidden layer is equal to 14 and 10, the number of nodes of the output layer is 2, the output function is a normalized exponential function (Softmax function), the loss function is a cross entropy loss function, the activation function is a linear rectification function (RELU function), and the weight of each node is updated through an adaptive matrix (Adam) iterative optimizer and a back propagation algorithm to enable the loss function to obtain the minimum value. In the deep neural network submodel training process, once the loss function is lower than a set threshold value or the continuous multi-round loss function descending amount is lower than a set value, the training is stopped.

The Softmax function is expressed as follows:

in multi-classification models such as multinomial logistic regression and linear discriminant analysis, the input to the Softmax function is the result from M different linear functions, and the probability that the sample vector x belongs to the jth classification is:

wherein W is a weight vector of the sample vector x, and represents a weight corresponding to each element in the sample vector x.

The cross entropy loss function is defined as follows:

H(p，q)＝-∑_xp(x)log(q(x))

where p (x) is the probability of the true distribution and q (x) is the probability estimate calculated by the model from the data.

The RELU function is expressed as follows:

f(x)＝max(0，x)

the RELU function may effectively avoid the gradient vanishing problem.

The Adam optimizer is an iterative optimizer which comprehensively considers the first moment estimation and the second moment estimation of the gradient and calculates the update step length.

Specifically, a logistic regression model is constructed according to the prediction result of each deep neural network submodel, so that each deep neural network submodel is fused to obtain the overall default probability, and the overall default probability is converted into a credit score. The overall default probability is as follows:

wherein the content of the first and second substances,

θ is a factor affecting the target value, and x is an independent variable.

The following are examples provided by the present invention:

executing the script in the server side Spark cluster (or supporting SQL statement query through a visualization window), and respectively executing data preprocessing on data of different data sources, where an execution process of the data preprocessing is shown in fig. 2.

Executing a script on the server, performing characteristic engineering on the preprocessed data, performing box separation on continuous data characteristics by adopting a card-square box separation method, and performing WOE coding on the data characteristics after the box separation is completed.

The script is executed on the server, and the GBDT model is used for data feature selection, and the selection result is shown in FIG. 3. In fig. 3, Best Score shows the final fitting Score of the GBDT model, with higher scores closer to 1 indicating better fitting. Importans displays the importance scores of the individual features, with higher scores indicating higher feature importance.

And executing the script on the server, and training the deep neural network submodel of each data source. The training results are shown in fig. 4, and after the Adam optimizer is used, the sub-model loss of the deep neural network decreases along with the rise of the training batch, and finally becomes stable.

And executing the script on the server, fusing the models and giving a final prediction result.

According to the method, a chi-square binning method is adopted to carry out feature engineering on data of each data source, a gradient boosting decision tree model is adopted to screen important features of each data source, so that the screened features have strong discrimination, meanwhile, a sub-model of each data source is established based on a deep neural network with high classification accuracy, and finally, the sub-models of each deep neural network are fused through a logistic regression model with strong stability, so that the stability and interpretability of an application scoring card model are ensured. The application scoring card model has expandability, and a method of running in a plurality of deep nerve submodels is adopted, so that if third-party data cannot be accessed or damaged, only one submodel is influenced, and the integral application scoring card model cannot be greatly influenced.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A method for establishing an application scoring card model based on ensemble learning is characterized by comprising the following steps:

2. The method for building a scoring card model based on ensemble learning of claim 1, wherein said step S1 comprises the steps of:

3. The method for building a scoring card model based on ensemble learning of claim 1, wherein said step S2 comprises the steps of:

wherein T (c; theta)_m) Generated in the mth iterationA weak classifier;

the importance scores of the features are:

4. The method as claimed in claim 1, wherein the deep neural network submodel in step S3 has an input layer of the number of dimensions (20) of the important features, a hidden layer of 2 layers, hidden layer sizes equal to 14 and 10, an output layer of nodes of 2 layers, an output function of Softmax function, a loss function of cross entropy loss function, and an activation function of RELU function, and the weights of the nodes are updated by Adam iterative optimizer and back propagation algorithm to make the loss function minimum.

5. The method for building a scoring card model based on ensemble learning of claim 1, wherein said step S4 comprises the steps of:

the overall default probability is as follows:

wherein the content of the first and second substances,

and S4.2, converting the overall default probability into a credit score.

6. The method for modeling an application scoring card based on ensemble learning as claimed in claim 2, wherein the method for removing missing values comprises:

and when the type of the missing value is a random missing value or a complete non-random missing value, replacing the missing value by using a new state value-1.

7. The ensemble learning-based application scoring card model building method as claimed in claim 2, wherein the SMOTE oversampling preprocessing method comprises:

8. The method for building a scoring card model based on ensemble learning as claimed in claim 2, wherein the feature engineering comprises the following steps:

9. The method for building an application scoring card model based on ensemble learning as claimed in claim 8, wherein the card classification method comprises the following steps:

c. calculating chi-square value X of adjacent interval²；

The chi-square value calculation formula is as follows:

wherein A is_ijRepresents the value number of the jth class in the ith interval, E_ijIs shown as A_ijDesired frequency of (1), N_iIs the number of samples in the ith group, C_jIs the proportion of the j-th sample in the whole sample;

d. merging the two intervals with the minimum chi-square value;

10. The method for establishing a scoring card model based on ensemble learning according to claim 8, wherein the evidence weight code is calculated by the following formula: