AU2019101189A4

AU2019101189A4 - A financial mining method for credit prediction

Info

Publication number: AU2019101189A4
Application number: AU2019101189A
Authority: AU
Inventors: Ming Han; Shan Jiang; Ziyan LI; Junyi Ren; Chuyi Xiao; Xinxin Zhang
Original assignee: Han Ming Miss; Jiang Shan Miss; Li Ziyan Miss; Zhang Xinxin Miss
Current assignee: Han Ming Miss; Jiang Shan Miss; Li Ziyan Miss; Zhang Xinxin Miss
Priority date: 2019-10-02
Filing date: 2019-10-02
Publication date: 2020-01-23
Anticipated expiration: 2027-10-02

Abstract

How to evaluate and identify the potential default risk of the borrower or calculate the default probability of the borrower before issuing the loan the basis and significant link of the credit risk management of modem financial institutions.This paper mainly studies the statistical analysis of the historical loan data of Banks and other financial institutions by means of the idea of non-equilibrium data classification, and establish the loan default prediction model which is employing the Random Forest algorithm.The results showed that the Random Forest algorithm was better than the decision tree and logistic regression algorithm in predicting performance. In addition, by using the Random Forest algorithm to sort the importance of the features, it is possible to obtain a feature that has a greater impact on the final breach of contract. Therefore, to make a more effective judgment of lending risk in the financial field.Index Terms-Random Forest, loan default prediction, data mining Introduces unbalanced data classification and Random Forest algorithm Data preprocessing and data analysis Compares models of three different algorithms Conclusion: Random Forest algorithm has better performance Summarizes the paper Figure 1 Fi 2 lass results I X2~ocs voerotheptm A classification results 2clsica x~~ ~X -4 ""'a''°" """'in'rp M" classification results Figure 2

Description

A financial mining method for credit prediction

FIELD OF THE INVENTION

This invention is in the field of Financial Big Data

BACKGROUND

With the vigorous development of world economy and China's reform and opening up gradually in-depth, whether the development of the enterprise or from the change of people consumption idea, loan has become the enterprises and individuals an important way to solve the problem of economy. With the introduction of a variety of bank loans business and the expansion of the growing demand, non-performing loans, that is, the probability of default also proliferated. To avoid default, Banks and other financial institutions when they make loans to evaluate the borrower's credit risk or score, to predict the probability of default and whether lending judgment according to the results. How effective evaluation before granting loans and identify potential borrowers default risk, is the basis of the financial institutions to credit risk management and the important link, with a scientific model and system to determine the risk of loan defaults can minimise risk and profit maximization.

i

2019101189 02 Oct 2019

SUMMARY

This paper mainly studies how to use ideas of unbalanced data classification of the history of Banks and other financial institutions loan data analysis, and based on Random Forest classification model to predict the likelihood of default. The first section mainly introduced in this paper the unbalanced data classification and Random Forest algorithm; the second section mainly for data preprocessing and data analysis. The third section mainly constructs a model of Random Forest classification forecast loan defaults, and get the results of this model and AUC values, through the Random Forest algorithm compared with decision tree model and logistic regression algorithm, getting the Random Forest algorithm better conclusions. Finally, to evaluate the importance of each feature, and draw which characteristics influence the results for the final default. The fourth section summarizes the full text.

Table 1 Default classification based on Random Forests

Default classification based on Random Forests

2019101189 02 Oct 2019

T=train set

N_tree ⁼ the number of decision trees

M=the number of expected variables in each sample

M_try =the number of variables participating in split in each tree nodes

Ssampsize⁼the sample size of Bootstrap

The computation process:

For(Q_ree Ojl^hree — Nt_ree, ί-tree F+)

1. Generating a Bootstrap sample with size S_sampsize by using train set T to

2. Building an untrimmed tree i_tree by using Bootstrap.

Choosing randomly M_try variables and the best one to be a branch based on Value Gini in the process of generating i_tree.

}

Output:

Regression problems: the predicted result based on the average of all returned values. Classification Problems: the predicted result based on the classification outcome of the majority of decision trees.

Table 2 Data set variable case

Variable name	Variable description	Type
SeriousDlqin2yrs	Whether default	Y/N
RevolvingUtilizationOfUnsecuredLines	The total amount of credit card and personal credit loan (excluding mortgages, installment payments like car loans, etc.) divided by the sum of credit lines	Percentage
age	Borrower age	Integer
NumberOfTime30-59DaysPastDueNotWorse	The number of times the borrower has been overdue for 30-59 days in the past two years	Integer

2019101189 02 Oct 2019

DebtRatio	Monthly debt repayments, alimony, living costs, etc. divided by total monthly income	Percentage
Monthlyincome	monthly income	Real
NumberOfOpenCreditLinesAndLoans	Number of open loans (instalments such as car loans and mortgages) and credit lines (such as credit cards)	Integer
NumberOfTimes90DaysLate	The number of times the borrower has been overdue for 90 days and over in the past two years	Integer
NumberRealEstateLoansOrLines	Mortgage and real estate loans with mortgage-backed credits	Integer
NumberOfTime60-89DaysPastDueNotWorse	The number of times the borrower has overdue 60-89 days in the past two years	Integer
NumberOfDependents	Number of people (spouse, children, etc.) who need to be raised in the family, excluding themselves	Integer

Table 3 Table of frequency distribution of variable age

Age	Number of People	Percentage of age interval	Number of people who defaulted	Percentage of defaulters within age interval
Lower than 25	3028	2.02%	338	11.16%
26-35	18458	12.30%	2053	11.12%
36-45	29819	19.90%	2628	8.80%
46-55	36690	24.50%	2786	7.60%
56-65	33406	22.30%	1531	4.60%
Higher than 65	28599	19.10%	690	2.40%

Table 4 variables NumberOfTime30-59 dayspastduenotworse frequency distribution table

2019101189 02 Oct 2019

NumberRealEstateLoans Number of	Ratio	Number of defaults	Percentage of defaults in this range
OrLines	people
Below 5	149207	99.47%	9884	6.6%
6-10	699	0.47%	121	17.3%
11-15	70	0.05%	16	22.8%
16-20	14	0.009%	3	21.4%
Below 5	10	0.007%	2	20%

Table 5 frequency distribution table of variable numberoftime30-59dayspastduenotworse

NumberOfTime30-59Days Number of	Ratio	Number of defaulters	The percentage of default in this interval
PastDueNotWorse	people
0	126018	84%	5041	4%
1	16032	10.70%	2409	15%
2	4598	3.10%	1219	26.50%
3	1754	1.20%	618	35.20%
4	747	0.50%	318	42.60%
5	342	0.23%	154	45%
6	140	0.09%	74	52.90%
7or older	104	0.07%	50	48.07%

Table 6 Random Forests and the comparison of other algorithms

Algorithm	AUC value
Random Forest	0.86
Decision Tree	0.8
Logistic Regression	0.8

Table 7 feature importance of each variable

Variable featurejmportance

RevolvingUtilizationOfUnsecuredLines

0.3411

2019101189 02 Oct 2019

NumberOfTime30-59DaysPastDueNotWorse

NumberOfTime90DaysLate

NumberOfTime60-89DatysPastDueNotWorse age

DebtRatio

Monthlyincome

NumberOfOpenCreditLinesAndLoans

NumberRealEstateLoansOrLines

NumberOfDependents

0.1694

0.1594

0.0727

0.0677

0.0625

0.0488

0.0442

0.0223

0.0117

DESCRIPTION OF DRAWING

Figure 1 Analysis flow chart of credit forecast

Figure 2 Random Forests

Figure 3 Modeling flowcharts

DESCRIPTION OF PREFERRED EMBODIMENT

Random Forest Algorithm

Imbalanced data classification

Imbalanced data which means the number of some data (the majority) far exceeds the other (the minority) is universally existing in network intrusion detection,financial transaction fraud,text classifier and etc. And most of the time we are more interested in the classification of the minority.Imbalanced data classification can be solved by punishment weight of positive and negative sample. In detail, the approach is to give different weights for classification of different sample sizes in algorithm implementation process where small sample size has high weight and

2019101189 02 Oct 2019 large sample size has low weight in general, and then we can compute and make modeling.

Introduction of Random Forest

Random Forest building a forest by random techniques is a combined algorithm based on random decision trees. The main method is to select randomly some variables or features to generate the split and then repeat several times and guarantee the independence between these trees. After getting Random Forest, a new sample will be judged by each decision tree when it enters in the forest and belongs to which classification gets the highest score(process visualized in figure 2)

Random Forest algorithm principle and characteristics

Random Forests algorithm, include classification and regression problems, its algorithm steps are as follows:

Random Forests have the following features: Process can be seen from the above algorithm, the randomness of the Random Forest is mainly manifested in two aspects: The randomness of the data space by Bagging (Bootstrap Aggregating) implementation, the feature space of the randomness of Random sample (Random Subspace). For classification problems, each decision tree in a Random Forests is classified and predicted for new samples. The decision results of these trees are then somehow grouped together to give the final classification of the sample.

2019101189 02 Oct 2019

1, The data in the rows (data records) and columns (variables) two random introduction, so that the Random Forest is not easy to fall into overfitting.

2,.Random Forest has a good anti-noise ability.

3. When there are a large number of missing values in the data set, Random Forests can effectively estimate and process the missing values.

4. the ability to adapt to the data set is strong: can process both discrete data, but also to process continuous data, the data set does not need to be normalized.

5.It can be able to the importance of the variable sorting, easy to explain the variable. There are two methods for calculating the importance of variables in Random Forests: one based on the average decline accuracy of OOB (Out of Bag). That is, in the process of growing the decision tree, the OOB sample is tested and recorded the wrong sample, and then the value order of a column variable in the Bootstrap sample is randomly disrupted, the decision tree is re-predicted, and the number of misdivided samples is recorded again. The number of two prediction errors divided by the total number of OOB samples is the change in error rate of this decision tree, and the average rate of average decline is obtained by summarizing the error rate change sourof for all trees in the Random Forest. The other is based on the GINI drop method at the time of

2019101189 02 Oct 2019 division, the Random Forest in the growth decision tree is in accordance with the GINI non-purity decline in the node split, all the selected a variable in the forest as a split variable of the node summary to obtain gINI drop.

Random Forest in imbalanced data classification

The default of weight for each category is 1 in Random Forest which predicts that all wrong cost is equivalent. In scikitleam, Random Forest supplies the parameter of weight(list or dict)and mutually specifies weights for different sorts.If the parameter is ‘balanced’, each weight has a negative relationship with input frequency since Random Forest automatically adjusts weights by using the value y.

The calculation formula is n _ samples/(n _ classes * npbincountjy)) (1) ‘Balanced subsample’ is similar to the ‘balanced’,which uses sample size of retracted sampling instead of using total number of samples. Therefore, we can solve the unbalanced data classification problem by this approach.

Data preprocessing and data analysis

Data Set

The data Set used by this paper: The loan default data set is 250000 samples that included 150000 training set and 100000 test set.

2019101189 02 Oct 2019

This training set contains 150,000 historical data of borrowers, among which 10026 default samples account for 6.684% of the total sample, 6.684% of the loan default rate, and 139974 non-default samples account for 93.316% of the total sample. It can be seen that this data set is a typical highly unbalanced data. The data set includes the borrower's age, income, family, etc., and loan conditions, with a total of 11 variables, among which SeriousDlqin2yrs is the label’s tag, and the other 10 variables are predictive characteristics. The following table lists variable names and data types:

Data Analysis

The experimental environment used in this paper is Anaconda3+Python3 .Firstly, the data were preliminarily analyzed. This experiment mainly analyzed the distribution of default rate on each independent variable, and generated the frequency distribution table as shown in Table 3 (decimals were rounded).

It can be seen from Table3 that the default rate of people younger than 25 years old and people aged 26-35 years old is more than 10%.Default rates fall as people age.

Table 4 variables NumberOfTime30-59 dayspastduenotworse frequency Table 4 shows that the number of real estate and mortgage loans of 99.47% borrowers is less than 5, but the default rate of borrowers with more than io

2019101189 02 Oct 2019 loans increases significantly, among which the default rate of borrowers with more than 10 loans is above 20%.

It can be seen from Table 5 that the default rate of borrowers who have not defaulted for 30-59 days is only about 4%, but with the increase of the number of delinquencies, the default rate increases significantly. For the other two variables, the frequency distribution table of 60-89 days overdue and 90 or more overdue times of borrowers also shows the same trend as Table 5. Therefore, it can be concluded that the more delinquencies occur, the higher the default rate.

With 10 variables, this study using data set our statistical analysis of each variable and get the frequency distribution table as shown above, in addition to the variable Number Of Open Credit Lines And Loans (open the number of loans and credit loans) and default rate has no obvious correlation, other variables are related to whether the borrower default eventually.

2.3 Data Pre-processing

A preliminary exploration of the data may reveal missing values in the Monthly Income and Number Of Dependents variables, which are 29731 and 3924, respectively.

Outliers: the minimum value in the age variable is 0, which is an outlier.

NumberOfTime30-59DaysPastDueNotWorse NumberOfTime60-89DaysPastDueNotWorse

2019101189 02 Oct 2019

NumberOfTimes90DaysLate three overdue days variables, there are a small number of 96,98 values, may be abnormal values or some code. Data preprocessing: when reading data using the pandas library in Python. Set the na_values parameter in the function pd.read_csv() to our own definition list, treat 0 of the age variable and 96,98 of the three overdue variables as NaN values, then using skleam. Preprocessing. Imputer library will data set all NaN replaced with mean value of the corresponding columns.

Buliding models and experiment result

Random Forest Model

In this experiment, we uses a package of Python—skleam (more specifically, sklean.ensemble.RandomForestClassifier)—to build a Random Forest model.

Here are the parameters and their settings:

n-estimators: The number of Decision Tree, which is set to 100.

oob_score: Whether to use out-of-bag data, set to True.

min_samples_split: The minimum number used to yield an internal node, set to 2.

min_samples_leaf: The minimum number of a leaf node, set to 50.

2019101189 02 Oct 2019 n Jobs: The number of jobs for computer to run parallely, set to -1.

class_weight: Used to control weights of each class, set to balanced_subsample.

bootstrap: Whether boostrap samples are used for generating trees, set to True.

Model Assessment

We use AUC as a indicator to assess the model in this experiment. AUC is defined as area under the curve of ROC(Receiver Operating Characteristic), and apparently the value of this curve is not more than 1. The x-axis of ROC is FPR(False Postive Rate), and the y-axis is TPR(True Positive Rate). Because normally the ROC curve is above line y=x, the value of AUC is between 0.5 and 1. We use AUC as a standard for evaluating models because ROC curve cannot help us clearly determining which classifier is better. On the other hand, as a numerical value, AUC can tell us which classifier is superior more specifically.

Results of comparing three models: Random Forest, Decision Tree and logistic regression, are as follows:

From the Table 6, the Random Forest Algorithm has the greatest AUC value among these three algorithms. Hence, Random Forest's predicting performance is better than the other two algorithms.

2019101189 02 Oct 2019

Feature Importances of Variables

We use feature_importances in the skleam.ensemble.RandomForestClassifier class for this experiment, and the feature importances for each feature are as follows:

From the Table 7, we can notice that these three variables:

RevolvingUtilizationOfUnsecuredFines, NumberOfTime30-59DaysPastDueNotWorse and NumberOfTime90DaysFate have top three feature importances, which have greater impact on determining who may break contracts and bring economic loss to companies. Hence, while companies grant a loan, they can consider these features of an applicant to lower the risk.

Conclusion

This paper mainly studied the loan defaults of common problems in the financial sector, and using the Random Forest of unbalanced data classification method to predict default model is established, the basic idea of Random Forest is in the process of a single tree structure, some random variables or characteristics involved in tree node, repeated many times and ensure the independence between the trees, in view of the unbalanced data, by the method of parameter adjustment makes Random Forest weights can be adjusted according to the y value automatically, thus effectively solve the problem of unbalanced data classification.

2019101189 02 Oct 2019

Experiments show that Random Forest algorithm than the decision tree and the classification of the logistic regression model performance is better, to loan defaults in the field of financial prediction problem has important reference meaning. In addition, based on the importance of the characteristics of the measurement, in this experiment can be lending a person's age, debt ratio and number of real estate and mortgage of the three characteristics of the final is greatly influenced by default, the feature importance measure method is the other feature selection problem in data mining to have the important reference significance.

Claims

1. A financial mining method for credit prediction, wherein the experimental environment used in this experiment is Anaconda3+Python3; First, the data is initially analyzed, this experiment mainly analyzes the distribution of default rate on each independent variable, and generates a frequency distribution table; Data preprocessing: When reading data using the pandas library in Python, set the na_values parameter in the function pd.read_csv() to our own defined list, 0 in the age variable and three overdue variables 96,98 is treated as a NaN value, then the skleam.preprocessing, imputer library is used to replace all NaNs in the dataset with the average of the corresponding columns.