AU2020100709A4

AU2020100709A4 - A method of prediction model based on random forest algorithm

Info

Publication number: AU2020100709A4
Application number: AU2020100709A
Authority: AU
Inventors: Yuhang Bao; Haoyue Chen; Banghao Gao; Donghao LI; Haixin ZHANG; Zhiyi Zhang
Original assignee: Chen Haoyue Miss; Zhang Zhiyi Miss
Current assignee: Chen Haoyue Miss; Zhang Zhiyi Miss
Priority date: 2020-05-05
Filing date: 2020-05-05
Publication date: 2020-06-11
Anticipated expiration: 2028-05-05

Abstract

Abstract When managing credit risk, it is a fundamental and vital segment for modem financial institutions to figure out how to effectively evaluate and identify potential default risk of borrowers before offering loans and calculate the default probability of borrowers. In this paper, the main objective of our investigation is to statistically analyze the historical loan data of banks and other financial institutions and establish a loan default prediction model by applying random forest algorithm and the thinking of unbalanced data classification. According to experiments' result, random forest algorithm performs better than decision tree and logical regression classification algorithms in predictive performance. Additionally, features highly associating with default can be obtained by prioritizing features by using random forest algorithm so that the procedure of judging the risk of offering loan is optimized. Fgre12io DeFF cision Classification 1. Sorting (classifier): md Classification KNN,- (score card) Logistic Regression Bayes Figure 12 1. #The result of RF modeled, 2. TRAIN: [ 17941 117875 67893 ... , 93992 20627 5744] TEST: [64422 113530 30105 .. , 34862 130492 127209] ' 3. the best parameter: {' X': 2, '4 ': 50} 4. the best score: 0.863112669495 ' 5. p 0.907527986443 * 6. ; 0.864487503309 e Figure 13

Description

FIELD OF THE INVENTION

This invention mainly studies how to analyze the historical loan data of Banks and other financial institutions with the idea of unbalanced data classification and predict the possibility of loan default based on the random forest classification model.

BACKGROUND OF THE INVENTION

With the rapid development of the world economy and the deepening of China's reform and opening up, whether it is the development of enterprises or the change of people's consumption concept, loans have become an important way for enterprises and individuals to solve economic problems. With the launch of Banks' various loan businesses and people's growing demand, the probability of non-performing loans, or loans defaulting, has also increased sharply. In order to avoid loan default, Banks and other financial institutions will evaluate or score the credit risk of borrowers when issuing loans, predict the probability of loan default and make a judgment on whether to issue loans based on the results. How to effectively evaluate and identify the potential default risk of borrowers before issuing loans is the basis and important link of credit risk management of financial institutions. Using a set of scientific models and systems to judge the risk of loan defaults can minimize risks and maximize profits.

Financial technology, also known as fintech, is a fast-evolving field that has reshaped the

2020100709 05 May 2020 financial industry, such as insurance, investment and banking business. Recent years have witnessed the huge success of machine learning and deep learning in machine perception areas such as speech recognition and image analysis, but financial services need more, including prediction and decision-making.

Computing technologies play an important role in the transformation of modern financial services. There are many traditional classification Machine learning algorithms, such as Decision Tree, Support Vector Machine and so on. These algorithms are single classifiers, and they face performance bottlenecks and overfitting problems. Therefore, the method of integrating multiple classifiers to improve the prediction performance came into being, which is Ensemble Learning. Bagging(parallel) and Boosting(serial) are two common Ensemble Learning methods. The difference between them is whether the integration method is parallel or serial. Random Forests are the most representative algorithm in the Bagging integration method.

Random Forest is an Ensemble Learning algorithm based on Decision Tree. Decision Tree is a widely used tree classifier. Each node of the tree is classified by selecting the optimal splitting feature until the stop condition of tree construction is reached. For example, data in leaf nodes are of the same category. When entering the sample to be classified, the Decision Tree determines a unique path from the root node to the leaf node, whose category is the category of the sample to be classified. Decision Tree is a simple and fast non-parametric classification method. In general, it has good accuracy, but when the data is complex, the Decision Tree has performance bottlenecks. Random Forest is a machine learning algorithm proposed by Leo Breiman in 2001 by combining Bagging ensemble learning theory with stochastic subspace method. Random Forest is an Ensemble Learning model based on the

2020100709 05 May 2020

Decision Tree classifier. It contains multiple decision trees trained by the Bagging Ensemble Learning technique. And when the samples to be classified are input, the final classification result is voted on by the output result of a single decision tree. Random Forest solves the performance bottleneck of Decision Tree, has good tolerance to noise and outliers, and has good scalability and parallelism for high-dimensional data classification. In addition, Random Forest is a non-parametric classification method driven by data, which only requires learning and training classification rules for a given sample, without prior knowledge.

Since the training of each decision tree is independent of each other, the training of random forest can be realized through parallel processing, which effectively guarantees the efficiency and expansibility of random forest algorithm.

SUMMARY

This invention mainly studies the loan default problems common in the financial field and uses the random forest method of unbalanced data classification to establish a model of predicting loan defaults. The basic idea of random forest is to randomly select some variables or Features participate in tree node division, repeat multiple times and ensure the independence between these trees. For unbalanced data, parameter adjustment enables the random forest method to automatically adjust weights based on the y value, thereby effectively solving the classification of unbalanced data problem.

Experiments show that the random forest algorithm has better classification performance than decision trees and logistic regression models, has important reference significance for loan default prediction in the financial field. In addition, by measuring the importance of each feature, in this experiment, we can obtain the three characteristics of the borrower's age,

2020100709 05 May 2020 debt ratio, and number of real estate and mortgage loans, which have a greater impact on the eventual default. It also has important reference significance for other feature selection problems in data mining.

DESCRIPTION OF DRAWING

Figure 1 shows the procedure of the project.

Figure 2 shows the detailed steps of the project.

Figure 3 shows the quantity variance between non-default samples and default samples.

Figure 4 shows the dataset overview by using df.describe() code.

Figure 5 visualizes the missing values.

Figure 6 visualizes the outliers in hi sts.

Figure 7 shows the boxplots of outliers.

Figure 8 shows the distribution and box line diagram of variable-age.

Figure 9 shows the correlation matrix of variables.

Figure 10 performs statistical analysis on each variable and obtain the frequency distribution table.

Figure 11 shows the Random Forest Algorithm process.

Figure 12 shows the logical frame of modeling.

Figure 13 shows the result of Random Forest Model.

Figure 14 shows the result of XGBoost Model.

Figure 15 shows the importance of each feature.

DESCRIPTION OF PREFERRED EMBODIMENT

Figure 1 and Figure 2 show the procedure of our project and provide its overview.

2020100709 05 May 2020

1. Business understanding

Understanding commercial goals and commercial demands could help us turn it into a data mining problem.

In this invention, analyzing the historical loan data of banks and other financial institutions with the idea of unbalanced data classification and predicting the possibility of loan default are our primary business requirements.

2. Data Preprocessing (The Data construction environment used in this invention are Anaconda3 and Python3.) Before further work, filtering required data and defining the meaning and characteristics of the data are necessary.

Dataset observation:

The loan default dataset we use includes a total of 250,000 samples, of which 150,000 samples are selected as the training set and 10,000 are selected as the test set. In the training set, there are 150,000 borrower information samples. Using df [df.SeriousDlqin2yrs == 1], we know that the number of default samples is 10026, the percentage is 6.684%; the number of non-default samples is 139974, the percentage is 93.316% (Figure 3), which means this is a highly imbalanced data set.

The dataset includes age, monthly income, family members, loans and several other conditions, with totally 11 variables, in which SeriousDlqin2yrs is label and the other 10 are prediction features. The following table shows the data features:

Variable Name

Description

Type

2020100709 05 May 2020

SeriousDlqin2yrs	Person experienced 90 days past due delinquency or worse	Y/N
RevolvingUtilizationOfUnsecuredLines	Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits	percentage
age	Age of borrower in years	integer
NumberOfTime30-59DaysPastDueNotWorse	Number of times borrower has been 30-59 days past due but no worse in the last 2 years.	integer
DebtRatio	Monthly debt payments, alimony, living costs divided by monthly gross income	percentage
Monthlyincome	Monthly income	real
NumberOfOpenCreditLinesAndLoans	Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)	integer
NumberOfTimes90DaysLate	Number of times borrower has been 90 days or more past due.	integer
NumberRealEstateLoansOrLines	Number of mortgage and real estate loans including home equity lines of credit	integer
NumberOfTime60-89DaysPastDueNotWorse	Number of times borrower has been 60-89 days past due but no worse in the last 2 years.	integer
NumberOfDependents	Number of dependents in family excluding themselves (spouse, children etc.)	integer

Using df.describe() to process the training dataset (Figure 4).

Figure 3 demonstrates that Monthlyincome and NumberOfDependents have missing values since count <150000. The mean of SeriousDlqin2yrs is 0.06684 which indicates that the default rate of the dataset is 6.684%. The minimum value for age is 0, which means there exist outliers in age for the reason that the Banks do not lend to people under 18.

Data cleaning:

Deleted the column Unnamed using data=data.drop(data.eolumns[0],axis=l).

Use data data[data['age']<18] to find that the data <18 years of age only reference one row.

Save samples with age >= 18: data=data[data.age>=18].

2020100709 05 May 2020

Datasets with missing values:

We use df.isnull().sum() and find that Monthlyincome and NumberOfDependents have missing values, respectively 29731 and 3924.

Unnamed:0	0
SeriousDlqin2yrs	0
RevolvingUtilizationOfUnsecuredLines	0
age	0
NumberOfTime30-59DaysPastDueNotWorse	0
DebtRatio	0
Monthlyincome	29731
NumberOfOpenCreditLinesAndLoans	0
NurnberOfTimes90DaysLate	0
NumberRealEstateLoansOrLines	0
NumberOfTime60-89DaysPastDueNotWorse	0
NumberOfDependents	3924
dtype: int64

As can be seen from the table, for the case where there are many missing values every month, a random forest model will be established to fill the gaps, and for NumberOfDependents with fewer missing values, the missing samples will be directly deleted.

(Figure 5)Moreover, matrix plots provide a visualization of missing values: we first replace the variable names with xl to xlO. Light color indicates small value while dark color indicates large value. Missing value is red in the plot and we can see x5 (Monthlyincome), xlO (NumberOfDependents) have missing values.

Outliers processing:

Outliers are usually considered a large deviation from the data. For example, in statistics, values below Q1-1.5IQR and above Q3 + 1.5IQR are often considered outliers.

Generally there are several ways to complete outlier precessing. First of all, we can use boxplots to achieve univariate detection. Boxplot() function generates boxplots. The function

2020100709 05 May 2020 lists the data points outside the box-whisker line in the box-and-line diagram. Second, LOF (Local Anomaly Factor) is also useful. LOF is an algorithm identifying outliers which is based on density. The algorithm compares the local density of a point with the density of points distributed around it. If the former is evidently smaller than the latter, then this point is in a relatively sparse area, indicating that the point is an outlier. Third, cluster detection. Aggregate the data into different classes, and select data that does not belong to any class as outliers. K-means algorithm is usually exploited in this case.

The minimum age is 0, which is an outlier. In the variables NumberOfTime30-59DaysPastDueNotWorse, there are approximately 96, 98 values for NumberOfTime30-59DaysPastDueNotWorse and NumberOffimes90DaysLate. They may be outliers or some behavior codes (Figure 6). Through the boxplot, we can see that the three overdue indicators (30-59 days overdue, 60-80 days overdue, 90 days overdue) have severe outliers (Figure 7). It is verified that the 99% quantiles of these three indicators are too divergent from maximum value and there are abnormalities. The dimensions are all (225, 11). It can be guessed that the abnormal situation of these three indicators occurs on the same row.

When using the pandas library to read data in Python, set the na values parameter in the function pd.read csv () to our own defined list; set 0 in the age variable, and set 96 and 98 as the three expired variables respectively. NaN values, then use the skleam.preprocessing. Imputer library to replace all NaNs in the dataset with the average of the corresponding column.

Frequency tables:

Examine the distribution of the default rate over each independent variable to generate a

2020100709 05 May 2020 table of frequency distributions. We start with RevolvingUtilizationOfUnsecuredLines, using the code: datatmp = data[['SeriousDlqin2yrs','RevolvingUtilizationOfUnsecuredLines']], and then add a label column to label each line with a label, which can mark the interval to which the line belongs. Pandas' Cut () function is used to convert continuous variables to categorical variables, such as mapping values [1,2,... 100] to variables of [1-10], [11-20].

In order to calculate the ratio of default staff, we need the total number of staff (number of rows in data tmp). And we need the pandas pivot table () function to generate the summary table. What we want is the default / total number of people in this interval.

The next step is to write the process of generating the frequency table into a function:

Get the total number of people total_size=data_tmp.shape[0] —Calculate the default number of people / total number of people in this time interval — pandasivot table () — To find/add a column, execute hierarchic ally-rename name-adjust the table Format (reindexaxis ()).

The frequency distribution table of age:

	age	SeriousDlqin2yrs
number	percent(%)	number	percent(%)
1 :below 25	3027	2.018	338	11.166
2:26-35	18458	12.305	2053	11.123
3:36-45	29819	19.879	2628	8.813
4:46-55	36690	24.460	2786	7.593
5:56-65	33406	22.271	1531	4.583
6: above 65	28599	19.066	690	2.413

From table we can see that the default rate of people below 35 years old is over 10%. With the increase of age, the default rate falls. The distribution and box line diagram details shown in Figure 8.

The frequency distribution table of DebtRatio:

2020100709 05 May 2020

	DebtRatio	SeriousDlqin2yrs
number	percent(%)	number	percent(%)
Lbelow 0.25	52361.0	34.908	3126	5.970
2:0.25-0.5	41346.0	27.564	2529	6.117
3:0.5-0.75	15728.0	10.485	1484	9.435
4:0.75-1.0	5427.0	3.618	596	10.982
5:1.0-2.0	4092.0	2.728	539	13.172
6:above 2	31045.0	20.697	1752	5.643

With the increase of the debt ratio, the interval default rate continues to increase as well.

Default rate is the highest among those with a debt ratio of 1-2, but when the debt ratio is greater than 2, the default rate drops.

The frequency distribution table of NumberOfOpenCreditLinesAndLoans:

	NumberOfOpenCreditLinesAndLoans	SeriousDlqin2yrs
number	percent(%)	number	percent(%)
1 :below 5	46590	31.060	3922	8.418
2:6-10	60399	40.266	3345	5.538
3:11-15	29184	19.456	1804	6.181
4:16-20	9846	6.564044	676	6.866
5:21-25	2841	1.894	191	6.723
6:26-30	785	0.523	62	7.898
7: above 30	354	0.236	26	7.345

The number of defaulters is more evenly distributed.

The frequency distribution table of NumberRealEstateLoansOrLines:

	NumberRealEstateLoansOrLines	SeriousDlqin2yrs
number	percent(%)	number	percent(%)
1 :below 5	149206	99.471	9884	6.624
2:6-10	699	0.466	121	17.310
3:11-15	70	0.047	16	22.857
4:16-20	14	0.009	3	21.429
5:above 20	10	0.007	2	20.000

99.47% of borrowers own less than 5 real estate and mortgage loans; the default rate of people with more than 5 loans has increased significantly.

2020100709 05 May 2020

The frequency distribution table of NumberOfTime30-59DaysPastDueNotWorse:

	NumberOfTime30-59DaysPastDueNotWorse	SeriousDlqin2yrs
number	percent(%)	number	percent(%)
1:0	142050	94.701	7450	5.245
2:1	4598	3.065	1219	26.512
3:2	1754	1.169	618	35.234
4:3	747	0.498	318	42.570
5:4	342	0.228	154	45.029
6:5	140	0.093	74	52.857
7:6	54	0.036	28	51.852
8:7and above	314	0.209	165	52.548

Borrowers who are not overdue within 30-59 days, the default interest rate is only 4%. As the number of expirations increases, the default ratio continues to rise. The default rates for the two variables, 60-89 days overdue and 90 days overdue, also have the same trend. Therefore, whether a default occurs is an important variable that determines whether a default will occur in the future.

The frequency distribution table of Monthlyincome:

	Monthlyincome	SeriousDlqin2yrs
number	percent(%)	number	percent(%)
libelow 5000	55859.0	37.240	4813	8.616
2:5000-10000	46090.0	30.727	2752	5.971
3:1000-15000	13035.0	8.690	547	4.196
4:above 15000	5284.0	3.523	245	4.637

The higher the income, the lower the default rate. The Mondaylncome column is missing data and can only be used as a reference.

2020100709 05 May 2020

The frequency distribution table of NumberOfDependents:

	NumberOfDependents	SeriousDlqin2yrs
number	percent(%)	number	percent(%)
1:0	113218.0	75.479	7030	6.209
2:1	19521.0	13.014	1584	8.114
3:2	9483.0	6.322	837	8.826
4:3	2862.0	1.908	297	10.377
5:4	746.0	0.497	68	9.115
6:5 and more	245.0	0.163	31	12.653

There is not much difference in default rates among people with different family members.

The dataset used in this experiment has 10 variables. Figure 9 shows that the relationships between each variable are minute. We perform statistical analysis on each variable and obtain the frequency distribution table shown above in Figure 10. Except that the variables NumberOfOpenCreditLinesAndLoans (the number of open lines and credit loans) have no apparent correlation with the default rate, these variables are related to whether the borrower eventually defaults.

3. Algorithm selection and modeling

Understanding the characteristics of this dataset: unbalanced data classification

Unbalanced data, that is, one type of data far exceeds the other (a few types) of data, is widespread in many fields such as network intrusion detection, financial fraud transaction detection, and text classification. The classification problem of dealing with unbalanced data can be solved by the penalty weights of positive and negative samples. The idea is that in the implementation of the algorithm, different weights are given to the categories of different sample numbers in the classification. Generally, the categories with small sample sizes have high weights and large sample size categories are low weighted and then calculated and modeled.

2020100709 05 May 2020

Random forest algorithm

Random forest

The Random Forest algorithm is to build a forest in a random manner. It is a combination learning algorithm based on a decision tree. The basic idea of random forest is that during the process of constructing a single tree, some variables or features are randomly selected to participate in the tree node division, repeated multiple times, and the independence between these trees established is guaranteed. After obtaining a random forest, when a new input sample enters, each decision tree in the forest will judge the sample, get the result of which class the sample belongs to, and finally see which one in the entire forest belongs to. The class has the highest votes, and it is predicted which class the sample is. The process is shown in Figure 11.

Principles and Features of Random Forest Algorithm

The Random Forest algorithm includes classification and regression problems. The algorithm steps are as follows:

Random Forest

Enter:

T = Training Set

Ntree=Number of decision trees in the forest

M = number of predictors in each sample

Mtry=The number of variables participating in the division in each tree node Ssampsize=Bootstrap sample size

2020100709 05 May 2020

Algorithm process:

F or(itree=0; 1 <itree<=Ntree; itree++) • Use training set T to generate a Bootstrap data sample with size Ssampsize.

• Use the generated Bootstrap data to construct an unpruned tree itree. In the process of generating a tree itree, randomly select Mtry variables from M and select the best one to branch according to some standard (Gini).

Output:

• Regression problem: use the average of all but the number of return values as the prediction result.

• Classification problem: use the classification results of most decision trees as prediction results.

The random forest has the following characteristics: As can be seen from the above algorithm process, the randomness of the random forest is mainly reflected in two aspects: the randomness of the data space is realized by Bagging (Bootstrap Aggregating), and the randomness of the feature space is randomized by random samples (Random Subspace). For classification problems, each decision tree in the random forest classifies and predicts new samples, and then somehow combines the decision results of these trees to give the final classification results of the samples.

The random forest algorithm also has the following advantages:

• The introduction of randomness in rows (data records) and columns (variables) in the data makes it difficult for random forests to fall into overfitting.

• Random forest has good anti-noise ability.

2020100709 05 May 2020 • When there are a large number of missing values in the dataset, the random forest can effectively estimate and process the missing values.

• Strong adaptability to the dataset: can handle both discrete data and continuous data, the data set does not need to be standardized.

• It is possible to sort the importance of variables to facilitate the interpretation of variables. There are two methods to calculate the importance of variables in random forests: one is based on the average falling accuracy rate of OOB (Out of Bag). That is, in the process of growing a decision tree, first use the OOB sample to test and record the wrong samples, and then randomly shuffle the value of a column of variables in the Bootstrap sample, use the decision tree to predict it again, and record again. The number of wrong samples. The number of two prediction errors divided by the total number of OOB samples is the error rate change of this decision tree. The error rate change of all trees in a random forest is averaged to obtain the average decline accuracy rate. The other method is based on the amount of GINI drop during splitting. The random forest growing decision tree is splitting nodes according to the decline of the purity of GINI. All nodes in the forest that select a certain variable as the splitting variable are summarized GINI drop.

Random forest method for unbalanced data classification

The random forest algorithm defaults the weight of each class to 1, which assumes that the cost of misclassification of all classes is equal. In scikit-learn, the random forest algorithm provides a class weight parameter, whose value can be a list or dictionary value, and manually specifies the weight of different classes. If the parameter is balanced, then the random forest algorithm uses the y value to automatically adjust the weights, and each type of weight is inversely proportional to the class frequency in the input data.

2020100709 05 May 2020

The calculation formula is:

balancedsubsample is similar to balanced mode. The calculation uses the number of samples in the sample with replaceable type instead of the total number of samples. Therefore, we can solve the problem of unbalanced data classification through this method.

Modeling and Results

Figure 12 shows the logical frame.

We can use skleam.ensemble.RandomForestClassifier in Python to build a random forest model.

Part of the parameters are set as:

n estimators: number of decision trees is 100 oob score: whether to use out-of-bag data—True min samples split: when dividing nodes based on attributes, the minimum number of samples per division, 2 min samples leaf: Samples with the fewest leaf nodes, 50 njobs: parallel number as -1; start as many jobs as the number of computer CPU cores class weight: set as “balanced subsample”; use y value to automatically adjust the weights, each type of weight is inversely proportional to the category frequency in the input data bootstrap: whether to use bootstrap sampling, True

Algorithms:

Firstly, load training and test datasets and preprocess data.

2020100709 05 May 2020

Then, split training data into trainingnew and testnew for validation.

And now, impute the data with imputer: replace missing values with Mean.

Finally, build Random Forest model with training new:

• deal with imbalanced data distribution.

• perform parameter tuning using grid search with CrossValidation.

• output the best model and predict for test data.

Result shown in Figure 13

Extensive Model: XGBoost model

XGBoost is one of the Boosting algorithms. The idea of the Boosting algorithm is to integrate many weak classifiers together to form a more powerful classifier. Since XGBoost is a lifting tree model, it integrates many tree models to form a more powerful classifier. The tree model used is a CART regression tree. XGBoost is an improvement based on GBDT, making it more powerful and applicable to a wider range.

Algorithm:

Firstly, load training and test datasets and preprocess data.

Secondly, split training data into training new and test new for validation.

Thirdly, build XGboost model with the training new data:

• deal with missing values and imbalanced data distribution.

• perform parameter tuning using grid search with CrossValidation.

• output the best model and predict for test data.

Result shown in Figure 14.

2020100709 05 May 2020

4.Model Evaluation

The model evaluation index used in this experiment is the AUC (area under the ROC curve) value. AUC is defined as the area under the ROC (receiver operating characteristic) curve. Obviously, the value of this region will not be greater than 1. The horizontal axis of the ROC curve is the false alarm rate (FPR), and the vertical axis is the forward rotation rate (TPR). Since the ROC curve is usually above the y = x line, the AUC value ranges from 0.5 to 1. We use the AUC value as the evaluation criterion, because usually the ROC curve cannot clearly indicate which classifier performs better, and as a value, the classifier with a larger AUC performs better.

We compared the random forest model with the XGBoost model, the logistic regression model, and the decision tree model. The results are as follows:

Algorithm	AUC value
Random Forest	0.86
XGBoost	0.86
Logistic Regression	0.80
Decision Tree	0.80

According to the table, the AUC of the random forest is higher than the logistic regression and decision tree, which is very close to XGBoost. The higher the AUC, the better the prediction accuracy.

2020100709 05 May 2020

Evaluation of Feature Importance

The feature importances method of skleam.ensemble.RandomForestClassifier will be used in this experiment. The importance of each feature is shown in Figure 15.

As can be seen from Figure 15, RevolvingUtilizationOfUnsecuredLines, NumberOfTime30-59DaysPastDueNotWorse and NumberOfTime90DaysLate are the three most important functions. This has a great impact on whether a default occurs in the end, so we should pay special attention to these characteristics of borrowers when processing loan applications.

Claims

rj What we claim is:

1. A method of prediction model based on random forest algorithm, characterized in that:

Business understanding; understanding commercial goals and commercial demands could help us turn it into a data mining problem, analyzing the historical loan data of banks and other financial institutions with the idea of unbalanced data classification and predicting the possibility of loan default are our primary business requirements; data preprocessing; before further work, filtering required data and defining the meaning and characteristics of the data are necessary; random forest algorithm; the Random Forest algorithm is to build a forest in a random manner; it is a combination learning algorithm based on a decision tree; the basic idea of random forest is that during the process of constructing a single tree, some variables or features are randomly selected to participate in the tree node division, repeated multiple times, and the independence between these trees established is guaranteed; after obtaining a random forest, when a new input sample enters, each decision tree in the forest will judge the sample, get the result of which class the sample belongs to, and finally see which one in the entire forest belongs to; the class has the highest votes, and it is predicted which class the sample is.