CN111583031A - Application scoring card model building method based on ensemble learning - Google Patents

Application scoring card model building method based on ensemble learning Download PDF

Info

Publication number
CN111583031A
CN111583031A CN202010414727.XA CN202010414727A CN111583031A CN 111583031 A CN111583031 A CN 111583031A CN 202010414727 A CN202010414727 A CN 202010414727A CN 111583031 A CN111583031 A CN 111583031A
Authority
CN
China
Prior art keywords
data
value
data source
features
scoring card
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010414727.XA
Other languages
Chinese (zh)
Inventor
郑志骏
韩德志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202010414727.XA priority Critical patent/CN111583031A/en
Publication of CN111583031A publication Critical patent/CN111583031A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/067Enterprise or organisation modelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Accounting & Taxation (AREA)
  • General Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Technology Law (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an application scoring card model establishing method based on ensemble learning, which predicts default probabilities given by various data sources by carrying out data preprocessing, feature engineering and building submodels of a deep neural network on data from different sources, and then fuses the submodels of the deep neural network through a logistic regression model to obtain the overall default probabilities and convert the overall default probabilities into credit scores. Compared with the traditional application scoring card model based on single DNN or logistic regression, the method has the advantages of taking stability, accuracy and interpretability into consideration, and greatly improving the overall performance.

Description

Application scoring card model building method based on ensemble learning
Technical Field
The invention relates to the field of credit wind control, in particular to an application scoring card model building method based on ensemble learning.
Background
The application scoring card model is an important model in the field of credit wind control, and the application scoring card model is a scoring model for obtaining different credit grades according to overdue or default probability by utilizing a certain credit scoring model according to various historical credit data of a credit application client. With the development of big data and artificial intelligence, the application scoring card model is very obvious from expert judgment based on business driving to a machine learning model based on data driving compared with the traditional wind control system based on business driving: firstly, under the support of big data, the bad account rate is far lower than that of manual judgment; and secondly, the method is not only dependent on a few experts and is convenient for scale and standardization.
Most of current application score card models based on machine learning are developed based on a logistic regression model, and L1 norm regularization is introduced, so that the learning result meets sparseness, and features are conveniently extracted or screened according to classification information degree (IV value). However, with the prevalence of internet finance, the accuracy of the method is low due to the fact that internet data have the characteristics of strong sparsity and weak univariate risk distinguishing capability.
Because deep neural networks have extremely high classification accuracy, usually reaching more than 90%, the deep neural networks are often used for classification instead of traditional machine learning algorithms. However, the deep neural network has the characteristic of a black box and has no interpretability, and in addition, the stability of the deep neural network is not strong, and a pure deep neural network wind control model is not suitable for being used in a business with strong sensitivity, namely financial wind control.
Disclosure of Invention
The invention provides an application scoring card model building method based on ensemble learning, which combines the characteristics of ensemble learning and deep learning and improves the comprehensive performance of an application scoring card model.
In order to achieve the aim, the invention provides an application scoring card model establishing method based on ensemble learning, which comprises the following steps of:
s1, respectively carrying out data preprocessing on the data of each data source, and carrying out feature engineering on the preprocessed data to obtain the data features of each data source;
s2, respectively constructing a gradient lifting decision tree model for each data source, thereby screening important features from the data features of each data source;
s3, respectively constructing a deep neural network submodel according to the important data characteristics of each data source, thereby predicting default probability given by each data source;
and S4, fusing the deep neural network submodels by constructing a logistic regression model to obtain a credit score.
Preferably, the step S1 includes the following steps:
s1.1, respectively carrying out missing value removal pretreatment on the data of each data source according to the types of the missing values;
s1.2, performing oversampling preprocessing on the corresponding positive case data in each data source by adopting an SMOTE method;
and S1.3, respectively carrying out characteristic engineering on the preprocessed data of each data source.
Preferably, the method for preprocessing the deletedvalue comprises the following steps:
when the type of the missing value is a continuous completely random missing value, replacing the missing value by adopting an arithmetic mean value of 5-10 adjacent numbers of the missing value;
when the type of the missing value is a discrete completely random missing value, replacing the missing value by a random state;
when the type of the missing value is a random missing value or a completely non-random missing value, the missing value is replaced with a new state value-1.
Preferably, the method for preprocessing the SMOTE oversampling includes:
for each sample x in the corresponding positive case dataiFinding distance sample x by Euclidean distance calculation methodiNearest k neighbors, denoted Xi(near),near∈{1,···,k};
Then randomly selecting n neighbor x from the k neighborsi(nn), n ∈ {1, ·, n } and n<And k, respectively carrying out linear interpolation between the n adjacent neighbors and the original sample xi, thereby synthesizing 2n new samples.
Preferably, the feature engineering comprises the following steps:
when the type of the data characteristic is a discrete characteristic, coding by adopting a bad sample rate;
and when the type of the data features is continuous, performing box separation by adopting a card box separation method, and performing evidence weight coding on the characteristics subjected to box separation.
Preferably, the card binning method comprises the following steps:
a. setting a chi-square threshold according to the required box number and the required confidence level;
b. the continuous features needing to be subjected to binning are arranged in a descending order according to the value size, and each value belongs to one interval;
c. calculating chi-square value X of adjacent interval2
d. Merging the two intervals with the minimum chi-square value;
e. and (d) repeating the steps c and d until the number of the sub-boxes is less than or equal to 5 and the chi-square value is greater than the chi-square threshold value.
Preferably, the chi-squared value calculation formula is:
Figure BDA0002494535450000031
Figure BDA0002494535450000032
wherein A isijRepresents the value number of the jth class in the ith interval, EijIs shown as AijDesired frequency of (1), NiIs the number of samples in the ith group, CjIs the proportion of the j-th sample in the whole sample.
Preferably, the calculation formula of the evidence weight code is as follows:
Figure BDA0002494535450000033
Bithe number of bad samples corresponding to the feature i, B is the total number of bad samples, GiAnd G is the total number of good samples corresponding to the characteristic i.
Preferably, the step S2 includes the following steps:
s2.1, respectively constructing a gradient lifting decision tree model for each data source, wherein N decision trees are shared in the gradient lifting decision tree model;
the decision function of the gradient lifting decision tree model is as follows:
Figure BDA0002494535450000034
wherein T (x; theta)m) Is the weak classifier generated in the mth iteration;
s2.2, processing the N decision trees by adopting a CART decision tree method, and respectively calculating the importance score of each data feature of each data source;
the importance scores of the features are:
Figure BDA0002494535450000041
wherein, VIMjIs the sum of the importance of the data feature j in n trees, ∑ VIMiIs the sum of the importance of all data features in the n trees;
and S2.3, respectively screening the features with the maximum importance score values as the important features of each data source.
Preferably, in the deep neural network submodel described in step S3, the input layer is the number of dimensions (20) of the important features, the hidden layer is 2 layers, the size of the hidden layer is equal to 14 and 10, the number of nodes in the output layer is 2, the output function is a Softmax function, the loss function is a cross entropy loss function, the activation function is a RELU function, and the weights of the nodes are updated by an Adam iterative optimizer and a back propagation algorithm so that the loss function obtains the minimum value.
Preferably, the step S4 includes the following steps:
s4.1, constructing a logistic regression model according to the prediction result of each deep neural network submodel, so that each deep neural network submodel is fused, and the overall default probability is predicted;
the overall default probability is as follows:
Figure BDA0002494535450000042
wherein the content of the first and second substances,
Figure BDA0002494535450000043
theta is a factor influencing the target value, and x is an independent variable;
and S4.2, converting the overall default probability into a credit score.
The invention has the following advantages:
according to the method, a chi-square binning method is adopted to carry out feature engineering on data of each data source, a gradient boosting decision tree model is adopted to screen important features of each data source, so that the screened features have strong discrimination, meanwhile, a sub-model of each data source is established based on a deep neural network with high classification accuracy, and finally, the sub-models of each deep neural network are fused through a logistic regression model with strong stability, so that the stability and interpretability of an application scoring card model are ensured. The application scoring card model has expandability, and a method of fusing a plurality of deep nerve submodels is adopted, so that if third-party data cannot be accessed or damaged, only one submodel is influenced, and the integral application scoring card model cannot be greatly influenced.
Drawings
FIG. 1 is a flowchart of an application scoring card model building method based on ensemble learning according to the present invention.
Fig. 2 is a process for preprocessing data of a data source according to an embodiment of the present invention.
Fig. 3 is a result of screening data features by the gradient boosting decision tree model according to the embodiment of the present invention.
Fig. 4 is a training result of the deep neural network submodel provided in the embodiment of the present invention.
Detailed Description
The method for establishing the scoring card model based on ensemble learning according to the present invention will be described in detail with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise ratio for the purpose of facilitating and distinctly aiding in the description of the embodiments of the invention.
As shown in fig. 1, the invention provides a method for establishing an application scoring card model based on ensemble learning, which comprises the following steps:
s1, respectively carrying out data preprocessing on the data of each data source, and carrying out feature engineering on the preprocessed data to obtain the data features of each data source;
specifically, the data sources are different data sources such as a free credit data source, a central line credit data source and a third-party credit data source, and data characteristics of the different data sources are obtained by respectively performing data preprocessing and characteristic engineering on the different data sources.
The step S1 includes the following steps:
s1.1, respectively carrying out missing value removal pretreatment on the data of each data source according to the types of the missing values;
specifically, the type of the missing value is determined according to the business meaning of the missing value, for example, the missing of a "payroll income" field may be a person subjective reason, so that the missing is completely non-random missing, while some fields, such as a person id and the like, have missing caused by some messy codes in the middle, are business caused by some fluctuation of a system, and belong to completely random missing values. Replacing a continuous completely random missing value by an arithmetic mean value of 5-10 adjacent missing values; replacing a discrete completely random missing value with a random state; the new state value-1 is used to replace both random missing values and completely non-random missing values.
S1.2, performing oversampling preprocessing on the corresponding positive case data in each data source by adopting an SMOTE method;
specifically, the SMOTE method is an oversampling method that generates more labeled samples according to a rule that the samples are labeled with fewer samples, so that data tends to be balanced. By performing oversampling preprocessing on less data corresponding to a positive case (namely, data with default behaviors) in the data source, the problem of unbalanced data sample quantity of each type in the data source can be solved.
The SMOTE oversampling preprocessing method comprises the following steps:
for each sample x in the minority class of dataiFinding distance sample x by Euclidean distance calculation methodiNearest k neighbors, denoted Xi(near),near∈{1,···,k};
Then randomly selecting n neighbor x from the k neighborsi(nn), n ∈ {1, ·, n } and n<And k, respectively carrying out linear interpolation between the n adjacent neighbors and the original sample xi, thereby synthesizing 2n new samples.
And S1.3, respectively carrying out characteristic engineering on the preprocessed data of each data source.
The characteristic engineering comprises the following steps:
when the type of the data feature is a discrete feature, a bad sample rate is adopted for coding, namely the discrete feature is converted into a corresponding bad sample rate;
and when the type of the data features is continuous, performing box separation by adopting a card box separation method, and performing evidence weight coding on the characteristics subjected to box separation.
Specifically, the card box separating method comprises the following steps:
a. setting a chi-square threshold according to the degree of freedom (the required number of bins) and the required confidence level;
specifically, the required confidence level needs to be set by itself, and banks typically require a 90% or 95% confidence.
b. The continuous features needing to be subjected to binning are arranged in a descending order according to the value size, and each value belongs to one interval;
specifically, the interval set in step b is only a tentative binning, and the total binning result is formed by continuously merging the intervals.
c. Calculating chi-square value X of adjacent interval2
The chi-square value calculation formula is as follows:
Figure BDA0002494535450000061
Figure BDA0002494535450000062
wherein A isijRepresents the value number of the jth class in the ith interval, EijIs shown as AijDesired frequency of (1), NiIs the number of samples in the ith group, CjIs the proportion of the j-th sample in the whole sample.
d. Merging the two intervals with the minimum chi-square value;
e. and (d) repeating the steps c and d until the number of the sub-boxes is less than or equal to 5 and the chi-square value is greater than the chi-square threshold value.
Specifically, the formula for calculating the evidence weight is as follows:
Figure BDA0002494535450000071
Bithe number of bad samples corresponding to the feature i, B is the total number of bad samples, GiAnd G is the total number of good samples corresponding to the characteristic i.
S2, respectively constructing a gradient lifting decision tree model for each data source, thereby screening important features from the data features of each data source;
specifically, the step S2 includes the following steps:
s2.1, respectively constructing a gradient lifting decision tree model for each data source, wherein N decision trees are shared in the gradient lifting decision tree model;
specifically, the Gradient Boost Decision Tree (GBDT) model is a Boosting integration model, which is a highly adaptive method to sequentially learn a series of homogeneous weak learners, i.e., each basic model depends on previous model results and combines them according to a certain deterministic strategy. Its decision function Fm(x) Can be expressed as:
Figure BDA0002494535450000072
wherein T (x; theta)m) Is the weak classifier generated in the mth iteration;
s2.2, processing the N decision trees by adopting a CART decision tree method, and respectively calculating the importance score of each data feature of each data source;
specifically, a CART decision tree method is adopted to process N decision trees: firstly, calculating a Gini index of data of a data source; then selecting a data division mode, namely selecting the data characteristic with the minimum Gini index, and dividing the data according to the selected data characteristic and the value of the data characteristic to construct a branch for the data characteristic with the minimum data Gini index after the data characteristic is removed; the used data features are then removed and the above steps are repeated in each data branch until all the data in each branch is of the same class or all the data features are used. The gini index can measure the purity of the data, and the smaller the gini index, the higher the purity of the data and the lower the uncertainty. K samples, which do not assume each discrete feature, can be classified into n classes, and in the m (m < n) th class, the calculation formula of gini index is:
Figure BDA0002494535450000073
Figure BDA0002494535450000074
the probability that the kth sample point belongs to the mth class for K samples of each discrete feature.
For the importance of any data feature x at the corresponding node m, namely the variation of gini indexes before and after the node m branches is:
VIMj=GIm-GIl-GIr
wherein, GIlAnd GIrRespectively representing gini indexes, GI, of two new nodes after branchingmRepresenting gini index before branching.
The gradient lifting decision tree model has N decision trees in total, and the importance of any feature X of the N decision trees is normalized to obtain the importance score of the data feature:
Figure BDA0002494535450000081
wherein, VIMjIs the sum of the importance of the data feature j in n trees, ∑ VIMiIs the sum of the importance of all data features in the n trees;
and S2.3, respectively screening the features with the maximum importance score values as the important features of each data source.
S3, respectively constructing a deep neural network submodel according to the important data characteristics of each data source, thereby predicting default probability given by each data source;
specifically, the deep neural network submodel has the advantages that the input layer is the dimension number (20) of important features, the hidden layer is 2 layers, the size of the hidden layer is equal to 14 and 10, the number of nodes of the output layer is 2, the output function is a normalized exponential function (Softmax function), the loss function is a cross entropy loss function, the activation function is a linear rectification function (RELU function), and the weight of each node is updated through an adaptive matrix (Adam) iterative optimizer and a back propagation algorithm to enable the loss function to obtain the minimum value. In the deep neural network submodel training process, once the loss function is lower than a set threshold value or the continuous multi-round loss function descending amount is lower than a set value, the training is stopped.
The Softmax function is expressed as follows:
in multi-classification models such as multinomial logistic regression and linear discriminant analysis, the input to the Softmax function is the result from M different linear functions, and the probability that the sample vector x belongs to the jth classification is:
Figure BDA0002494535450000082
wherein W is a weight vector of the sample vector x, and represents a weight corresponding to each element in the sample vector x.
The cross entropy loss function is defined as follows:
H(p,q)=-∑xp(x)log(q(x))
where p (x) is the probability of the true distribution and q (x) is the probability estimate calculated by the model from the data.
The RELU function is expressed as follows:
f(x)=max(0,x)
the RELU function may effectively avoid the gradient vanishing problem.
The Adam optimizer is an iterative optimizer which comprehensively considers the first moment estimation and the second moment estimation of the gradient and calculates the update step length.
And S4, fusing the deep neural network submodels by constructing a logistic regression model to obtain a credit score.
Specifically, a logistic regression model is constructed according to the prediction result of each deep neural network submodel, so that each deep neural network submodel is fused to obtain the overall default probability, and the overall default probability is converted into a credit score. The overall default probability is as follows:
Figure BDA0002494535450000091
wherein the content of the first and second substances,
Figure BDA0002494535450000092
θ is a factor affecting the target value, and x is an independent variable.
The following are examples provided by the present invention:
executing the script in the server side Spark cluster (or supporting SQL statement query through a visualization window), and respectively executing data preprocessing on data of different data sources, where an execution process of the data preprocessing is shown in fig. 2.
Executing a script on the server, performing characteristic engineering on the preprocessed data, performing box separation on continuous data characteristics by adopting a card-square box separation method, and performing WOE coding on the data characteristics after the box separation is completed.
The script is executed on the server, and the GBDT model is used for data feature selection, and the selection result is shown in FIG. 3. In fig. 3, Best Score shows the final fitting Score of the GBDT model, with higher scores closer to 1 indicating better fitting. Importans displays the importance scores of the individual features, with higher scores indicating higher feature importance.
And executing the script on the server, and training the deep neural network submodel of each data source. The training results are shown in fig. 4, and after the Adam optimizer is used, the sub-model loss of the deep neural network decreases along with the rise of the training batch, and finally becomes stable.
And executing the script on the server, fusing the models and giving a final prediction result.
According to the method, a chi-square binning method is adopted to carry out feature engineering on data of each data source, a gradient boosting decision tree model is adopted to screen important features of each data source, so that the screened features have strong discrimination, meanwhile, a sub-model of each data source is established based on a deep neural network with high classification accuracy, and finally, the sub-models of each deep neural network are fused through a logistic regression model with strong stability, so that the stability and interpretability of an application scoring card model are ensured. The application scoring card model has expandability, and a method of running in a plurality of deep nerve submodels is adopted, so that if third-party data cannot be accessed or damaged, only one submodel is influenced, and the integral application scoring card model cannot be greatly influenced.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (10)

1. A method for establishing an application scoring card model based on ensemble learning is characterized by comprising the following steps:
s1, respectively carrying out data preprocessing on the data of each data source, and carrying out feature engineering on the preprocessed data to obtain the data features of each data source;
s2, respectively constructing a gradient lifting decision tree model for each data source, thereby screening important features from the data features of each data source;
s3, respectively constructing a deep neural network submodel according to the important data characteristics of each data source, thereby predicting default probability given by each data source;
and S4, fusing the deep neural network submodels by constructing a logistic regression model to obtain a credit score.
2. The method for building a scoring card model based on ensemble learning of claim 1, wherein said step S1 comprises the steps of:
s1.1, respectively carrying out missing value removal pretreatment on the data of each data source according to the types of the missing values;
s1.2, performing oversampling preprocessing on the corresponding positive case data in each data source by adopting an SMOTE method;
and S1.3, respectively carrying out characteristic engineering on the preprocessed data of each data source.
3. The method for building a scoring card model based on ensemble learning of claim 1, wherein said step S2 comprises the steps of:
s2.1, respectively constructing a gradient lifting decision tree model for each data source, wherein N decision trees are shared in the gradient lifting decision tree model;
the decision function of the gradient lifting decision tree model is as follows:
Figure FDA0002494535440000011
wherein T (c; theta)m) Generated in the mth iterationA weak classifier;
s2.2, processing the N decision trees by adopting a CART decision tree method, and respectively calculating the importance score of each data feature of each data source;
the importance scores of the features are:
Figure FDA0002494535440000012
wherein, VIMjIs the sum of the importance of the data feature j in n trees, ∑ VIMiIs the sum of the importance of all data features in the n trees;
and S2.3, respectively screening the features with the maximum importance score values as the important features of each data source.
4. The method as claimed in claim 1, wherein the deep neural network submodel in step S3 has an input layer of the number of dimensions (20) of the important features, a hidden layer of 2 layers, hidden layer sizes equal to 14 and 10, an output layer of nodes of 2 layers, an output function of Softmax function, a loss function of cross entropy loss function, and an activation function of RELU function, and the weights of the nodes are updated by Adam iterative optimizer and back propagation algorithm to make the loss function minimum.
5. The method for building a scoring card model based on ensemble learning of claim 1, wherein said step S4 comprises the steps of:
s4.1, constructing a logistic regression model according to the prediction result of each deep neural network submodel, so that each deep neural network submodel is fused, and the overall default probability is predicted;
the overall default probability is as follows:
Figure FDA0002494535440000021
wherein the content of the first and second substances,
Figure FDA0002494535440000022
theta is a factor influencing the target value, and x is an independent variable;
and S4.2, converting the overall default probability into a credit score.
6. The method for modeling an application scoring card based on ensemble learning as claimed in claim 2, wherein the method for removing missing values comprises:
when the type of the missing value is a continuous completely random missing value, replacing the missing value by adopting an arithmetic mean value of 5-10 adjacent numbers of the missing value;
when the type of the missing value is a discrete completely random missing value, replacing the missing value by a random state;
and when the type of the missing value is a random missing value or a complete non-random missing value, replacing the missing value by using a new state value-1.
7. The ensemble learning-based application scoring card model building method as claimed in claim 2, wherein the SMOTE oversampling preprocessing method comprises:
for each sample x in the corresponding positive case dataiFinding distance sample x by Euclidean distance calculation methodiNearest k neighbors, denoted Xi(near),near∈{1,···,k};
Then randomly selecting n neighbor x from the k neighborsi(nn), n ∈ {1, ·, n } and n<And k, respectively carrying out linear interpolation between the n adjacent neighbors and the original sample xi, thereby synthesizing 2n new samples.
8. The method for building a scoring card model based on ensemble learning as claimed in claim 2, wherein the feature engineering comprises the following steps:
when the type of the data characteristic is a discrete characteristic, coding by adopting a bad sample rate;
and when the type of the data features is continuous, performing box separation by adopting a card box separation method, and performing evidence weight coding on the characteristics subjected to box separation.
9. The method for building an application scoring card model based on ensemble learning as claimed in claim 8, wherein the card classification method comprises the following steps:
a. setting a chi-square threshold according to the required box number and the required confidence level;
b. the continuous features needing to be subjected to binning are arranged in a descending order according to the value size, and each value belongs to one interval;
c. calculating chi-square value X of adjacent interval2
The chi-square value calculation formula is as follows:
Figure FDA0002494535440000031
Figure FDA0002494535440000032
wherein A isijRepresents the value number of the jth class in the ith interval, EijIs shown as AijDesired frequency of (1), NiIs the number of samples in the ith group, CjIs the proportion of the j-th sample in the whole sample;
d. merging the two intervals with the minimum chi-square value;
e. and (d) repeating the steps c and d until the number of the sub-boxes is less than or equal to 5 and the chi-square value is greater than the chi-square threshold value.
10. The method for establishing a scoring card model based on ensemble learning according to claim 8, wherein the evidence weight code is calculated by the following formula:
Figure FDA0002494535440000033
Bithe number of bad samples corresponding to the feature i, B is the total number of bad samples, GiAnd G is the total number of good samples corresponding to the characteristic i.
CN202010414727.XA 2020-05-15 2020-05-15 Application scoring card model building method based on ensemble learning Pending CN111583031A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010414727.XA CN111583031A (en) 2020-05-15 2020-05-15 Application scoring card model building method based on ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010414727.XA CN111583031A (en) 2020-05-15 2020-05-15 Application scoring card model building method based on ensemble learning

Publications (1)

Publication Number Publication Date
CN111583031A true CN111583031A (en) 2020-08-25

Family

ID=72112779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010414727.XA Pending CN111583031A (en) 2020-05-15 2020-05-15 Application scoring card model building method based on ensemble learning

Country Status (1)

Country Link
CN (1) CN111583031A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110650046A (en) * 2019-09-24 2020-01-03 北京明略软件系统有限公司 Network node importance scoring model training and importance detecting method and device
CN111950937A (en) * 2020-09-01 2020-11-17 上海海事大学 Key personnel risk assessment method based on fusion space-time trajectory
CN112102074A (en) * 2020-10-14 2020-12-18 深圳前海弘犀智能科技有限公司 Grading card modeling method
CN112017040B (en) * 2020-10-16 2021-01-29 银联商务股份有限公司 Credit scoring model training method, scoring system, equipment and medium
CN113269351A (en) * 2021-04-28 2021-08-17 贵州电网有限责任公司 Feature selection method for power grid equipment fault probability prediction
CN113313587A (en) * 2021-06-29 2021-08-27 平安资产管理有限责任公司 Credit risk analysis method, device, equipment and medium based on artificial intelligence
CN113538131A (en) * 2021-07-23 2021-10-22 中信银行股份有限公司 Method and device for modeling modular scoring card, storage medium and electronic equipment
CN113554504A (en) * 2021-06-10 2021-10-26 浙江惠瀜网络科技有限公司 Vehicle loan wind control model generation method and device and scoring card generation method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187712A1 (en) * 2002-03-27 2003-10-02 First Data Corporation Decision tree systems and methods
CN107992982A (en) * 2017-12-28 2018-05-04 上海氪信信息技术有限公司 A kind of Default Probability Forecasting Methodology of the unstructured data based on deep learning
CN108109066A (en) * 2017-12-11 2018-06-01 上海前隆信息科技有限公司 A kind of credit scoring model update method and system
CN108475393A (en) * 2016-01-27 2018-08-31 华为技术有限公司 The system and method that decision tree is predicted are promoted by composite character and gradient
CN108765127A (en) * 2018-04-26 2018-11-06 浙江邦盛科技有限公司 A kind of credit scoring card feature selection approach based on monte-carlo search
WO2019061187A1 (en) * 2017-09-28 2019-04-04 深圳乐信软件技术有限公司 Credit evaluation method and apparatus and gradient boosting decision tree parameter adjustment method and apparatus
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning
CN110232400A (en) * 2019-04-30 2019-09-13 冶金自动化研究设计院 A kind of gradient promotion decision neural network classification prediction technique
CN110580268A (en) * 2019-08-05 2019-12-17 西北大学 Credit scoring integrated classification system and method based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187712A1 (en) * 2002-03-27 2003-10-02 First Data Corporation Decision tree systems and methods
CN108475393A (en) * 2016-01-27 2018-08-31 华为技术有限公司 The system and method that decision tree is predicted are promoted by composite character and gradient
WO2019061187A1 (en) * 2017-09-28 2019-04-04 深圳乐信软件技术有限公司 Credit evaluation method and apparatus and gradient boosting decision tree parameter adjustment method and apparatus
CN108109066A (en) * 2017-12-11 2018-06-01 上海前隆信息科技有限公司 A kind of credit scoring model update method and system
CN107992982A (en) * 2017-12-28 2018-05-04 上海氪信信息技术有限公司 A kind of Default Probability Forecasting Methodology of the unstructured data based on deep learning
CN108765127A (en) * 2018-04-26 2018-11-06 浙江邦盛科技有限公司 A kind of credit scoring card feature selection approach based on monte-carlo search
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning
CN110232400A (en) * 2019-04-30 2019-09-13 冶金自动化研究设计院 A kind of gradient promotion decision neural network classification prediction technique
CN110580268A (en) * 2019-08-05 2019-12-17 西北大学 Credit scoring integrated classification system and method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王程龙;陈程: "基于决策树的P2P网贷平台信用评级体系研究" *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110650046A (en) * 2019-09-24 2020-01-03 北京明略软件系统有限公司 Network node importance scoring model training and importance detecting method and device
CN111950937A (en) * 2020-09-01 2020-11-17 上海海事大学 Key personnel risk assessment method based on fusion space-time trajectory
CN111950937B (en) * 2020-09-01 2023-12-01 上海海事大学 Important personnel risk assessment method based on fusion of space-time trajectories
CN112102074A (en) * 2020-10-14 2020-12-18 深圳前海弘犀智能科技有限公司 Grading card modeling method
CN112102074B (en) * 2020-10-14 2024-01-30 深圳前海弘犀智能科技有限公司 Score card modeling method
CN112017040B (en) * 2020-10-16 2021-01-29 银联商务股份有限公司 Credit scoring model training method, scoring system, equipment and medium
CN113269351A (en) * 2021-04-28 2021-08-17 贵州电网有限责任公司 Feature selection method for power grid equipment fault probability prediction
CN113554504A (en) * 2021-06-10 2021-10-26 浙江惠瀜网络科技有限公司 Vehicle loan wind control model generation method and device and scoring card generation method
CN113313587A (en) * 2021-06-29 2021-08-27 平安资产管理有限责任公司 Credit risk analysis method, device, equipment and medium based on artificial intelligence
CN113538131A (en) * 2021-07-23 2021-10-22 中信银行股份有限公司 Method and device for modeling modular scoring card, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111583031A (en) Application scoring card model building method based on ensemble learning
CN110263227B (en) Group partner discovery method and system based on graph neural network
Krishnaiah et al. Survey of classification techniques in data mining
Hassan et al. A hybrid of multiobjective Evolutionary Algorithm and HMM-Fuzzy model for time series prediction
CN107292097B (en) Chinese medicine principal symptom selection method based on feature group
CN112069310A (en) Text classification method and system based on active learning strategy
CN111008224B (en) Time sequence classification and retrieval method based on deep multitasking representation learning
CN108681742B (en) Analysis method for analyzing sensitivity of driver driving behavior to vehicle energy consumption
Satyanarayana et al. Survey of classification techniques in data mining
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN113139570A (en) Dam safety monitoring data completion method based on optimal hybrid valuation
CN114463540A (en) Segmenting images using neural networks
CN110991247B (en) Electronic component identification method based on deep learning and NCA fusion
CN113837266B (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN111340107A (en) Fault diagnosis method and system based on convolutional neural network cost sensitive learning
CN113179276B (en) Intelligent intrusion detection method and system based on explicit and implicit feature learning
Ali et al. Fake accounts detection on social media using stack ensemble system
CN113095480A (en) Interpretable graph neural network representation method based on knowledge distillation
CN116051924B (en) Divide-and-conquer defense method for image countermeasure sample
CN112528554A (en) Data fusion method and system suitable for multi-launch multi-source rocket test data
CN111708865A (en) Technology forecasting and patent early warning analysis method based on improved XGboost algorithm
Arshad et al. A Hybrid System for Customer Churn Prediction and Retention Analysis via Supervised Learning
Louati et al. Embedding channel pruning within the CNN architecture design using a bi-level evolutionary approach
Jabbari et al. Obtaining accurate probabilistic causal inference by post-processing calibration
Chen Brain Tumor Prediction with LSTM Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination