AU2019101189A4 - A financial mining method for credit prediction - Google Patents

A financial mining method for credit prediction Download PDF

Info

Publication number
AU2019101189A4
AU2019101189A4 AU2019101189A AU2019101189A AU2019101189A4 AU 2019101189 A4 AU2019101189 A4 AU 2019101189A4 AU 2019101189 A AU2019101189 A AU 2019101189A AU 2019101189 A AU2019101189 A AU 2019101189A AU 2019101189 A4 AU2019101189 A4 AU 2019101189A4
Authority
AU
Australia
Prior art keywords
data
random forest
default
classification
loan
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2019101189A
Inventor
Ming Han
Shan Jiang
Ziyan LI
Junyi Ren
Chuyi Xiao
Xinxin Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Han Ming Miss
Jiang Shan Miss
Li Ziyan Miss
Zhang Xinxin Miss
Original Assignee
Han Ming Miss
Jiang Shan Miss
Li Ziyan Miss
Zhang Xinxin Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Han Ming Miss, Jiang Shan Miss, Li Ziyan Miss, Zhang Xinxin Miss filed Critical Han Ming Miss
Priority to AU2019101189A priority Critical patent/AU2019101189A4/en
Application granted granted Critical
Publication of AU2019101189A4 publication Critical patent/AU2019101189A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

How to evaluate and identify the potential default risk of the borrower or calculate the default probability of the borrower before issuing the loan the basis and significant link of the credit risk management of modem financial institutions.This paper mainly studies the statistical analysis of the historical loan data of Banks and other financial institutions by means of the idea of non-equilibrium data classification, and establish the loan default prediction model which is employing the Random Forest algorithm.The results showed that the Random Forest algorithm was better than the decision tree and logistic regression algorithm in predicting performance. In addition, by using the Random Forest algorithm to sort the importance of the features, it is possible to obtain a feature that has a greater impact on the final breach of contract. Therefore, to make a more effective judgment of lending risk in the financial field.Index Terms-Random Forest, loan default prediction, data mining Introduces unbalanced data classification and Random Forest algorithm Data preprocessing and data analysis Compares models of three different algorithms Conclusion: Random Forest algorithm has better performance Summarizes the paper Figure 1 Fi 2 lass results I X2~ocs voerotheptm A classification results 2clsica x~~ ~X -4 ""'a''°" """'in'rp M" classification results Figure 2

Description

A financial mining method for credit prediction
FIELD OF THE INVENTION
This invention is in the field of Financial Big Data
BACKGROUND
With the vigorous development of world economy and China's reform and opening up gradually in-depth, whether the development of the enterprise or from the change of people consumption idea, loan has become the enterprises and individuals an important way to solve the problem of economy. With the introduction of a variety of bank loans business and the expansion of the growing demand, non-performing loans, that is, the probability of default also proliferated. To avoid default, Banks and other financial institutions when they make loans to evaluate the borrower's credit risk or score, to predict the probability of default and whether lending judgment according to the results. How effective evaluation before granting loans and identify potential borrowers default risk, is the basis of the financial institutions to credit risk management and the important link, with a scientific model and system to determine the risk of loan defaults can minimise risk and profit maximization.
i
2019101189 02 Oct 2019
SUMMARY
This paper mainly studies how to use ideas of unbalanced data classification of the history of Banks and other financial institutions loan data analysis, and based on Random Forest classification model to predict the likelihood of default. The first section mainly introduced in this paper the unbalanced data classification and Random Forest algorithm; the second section mainly for data preprocessing and data analysis. The third section mainly constructs a model of Random Forest classification forecast loan defaults, and get the results of this model and AUC values, through the Random Forest algorithm compared with decision tree model and logistic regression algorithm, getting the Random Forest algorithm better conclusions. Finally, to evaluate the importance of each feature, and draw which characteristics influence the results for the final default. The fourth section summarizes the full text.
Table 1 Default classification based on Random Forests
Default classification based on Random Forests
2019101189 02 Oct 2019
T=train set
Ntree = the number of decision trees
M=the number of expected variables in each sample
Mtry =the number of variables participating in split in each tree nodes
Ssampsize=the sample size of Bootstrap
The computation process:
For(Qree Ojl^hree — Ntree, ί-tree F+)
1. Generating a Bootstrap sample with size Ssampsize by using train set T to
2. Building an untrimmed tree itree by using Bootstrap.
Choosing randomly Mtry variables and the best one to be a branch based on Value Gini in the process of generating itree.
}
Output:
Regression problems: the predicted result based on the average of all returned values. Classification Problems: the predicted result based on the classification outcome of the majority of decision trees.
Table 2 Data set variable case
Variable name Variable description Type
SeriousDlqin2yrs Whether default Y/N
RevolvingUtilizationOfUnsecuredLines The total amount of credit card and personal credit loan (excluding mortgages, installment payments like car loans, etc.) divided by the sum of credit lines Percentage
age Borrower age Integer
NumberOfTime30-59DaysPastDueNotWorse The number of times the borrower has been overdue for 30-59 days in the past two years Integer
2019101189 02 Oct 2019
DebtRatio Monthly debt repayments, alimony, living costs, etc. divided by total monthly income Percentage
Monthlyincome monthly income Real
NumberOfOpenCreditLinesAndLoans Number of open loans (instalments such as car loans and mortgages) and credit lines (such as credit cards) Integer
NumberOfTimes90DaysLate The number of times the borrower has been overdue for 90 days and over in the past two years Integer
NumberRealEstateLoansOrLines Mortgage and real estate loans with mortgage-backed credits Integer
NumberOfTime60-89DaysPastDueNotWorse The number of times the borrower has overdue 60-89 days in the past two years Integer
NumberOfDependents Number of people (spouse, children, etc.) who need to be raised in the family, excluding themselves Integer
Table 3 Table of frequency distribution of variable age
Age Number of People Percentage of age interval Number of people who defaulted Percentage of defaulters within age interval
Lower than 25 3028 2.02% 338 11.16%
26-35 18458 12.30% 2053 11.12%
36-45 29819 19.90% 2628 8.80%
46-55 36690 24.50% 2786 7.60%
56-65 33406 22.30% 1531 4.60%
Higher than 65 28599 19.10% 690 2.40%
Table 4 variables NumberOfTime30-59 dayspastduenotworse frequency distribution table
2019101189 02 Oct 2019
NumberRealEstateLoans Number of Ratio Number of defaults Percentage of defaults in this range
OrLines people
Below 5 149207 99.47% 9884 6.6%
6-10 699 0.47% 121 17.3%
11-15 70 0.05% 16 22.8%
16-20 14 0.009% 3 21.4%
Below 5 10 0.007% 2 20%
Table 5 frequency distribution table of variable numberoftime30-59dayspastduenotworse
NumberOfTime30-59Days Number of Ratio Number of defaulters The percentage of default in this interval
PastDueNotWorse people
0 126018 84% 5041 4%
1 16032 10.70% 2409 15%
2 4598 3.10% 1219 26.50%
3 1754 1.20% 618 35.20%
4 747 0.50% 318 42.60%
5 342 0.23% 154 45%
6 140 0.09% 74 52.90%
7or older 104 0.07% 50 48.07%
Table 6 Random Forests and the comparison of other algorithms
Algorithm AUC value
Random Forest 0.86
Decision Tree 0.8
Logistic Regression 0.8
Table 7 feature importance of each variable
Variable featurejmportance
RevolvingUtilizationOfUnsecuredLines
0.3411
2019101189 02 Oct 2019
NumberOfTime30-59DaysPastDueNotWorse
NumberOfTime90DaysLate
NumberOfTime60-89DatysPastDueNotWorse age
DebtRatio
Monthlyincome
NumberOfOpenCreditLinesAndLoans
NumberRealEstateLoansOrLines
NumberOfDependents
0.1694
0.1594
0.0727
0.0677
0.0625
0.0488
0.0442
0.0223
0.0117
DESCRIPTION OF DRAWING
Figure 1 Analysis flow chart of credit forecast
Figure 2 Random Forests
Figure 3 Modeling flowcharts
DESCRIPTION OF PREFERRED EMBODIMENT
Random Forest Algorithm
Imbalanced data classification
Imbalanced data which means the number of some data (the majority) far exceeds the other (the minority) is universally existing in network intrusion detection,financial transaction fraud,text classifier and etc. And most of the time we are more interested in the classification of the minority.Imbalanced data classification can be solved by punishment weight of positive and negative sample. In detail, the approach is to give different weights for classification of different sample sizes in algorithm implementation process where small sample size has high weight and
2019101189 02 Oct 2019 large sample size has low weight in general, and then we can compute and make modeling.
Introduction of Random Forest
Random Forest building a forest by random techniques is a combined algorithm based on random decision trees. The main method is to select randomly some variables or features to generate the split and then repeat several times and guarantee the independence between these trees. After getting Random Forest, a new sample will be judged by each decision tree when it enters in the forest and belongs to which classification gets the highest score(process visualized in figure 2)
Random Forest algorithm principle and characteristics
Random Forests algorithm, include classification and regression problems, its algorithm steps are as follows:
Random Forests have the following features: Process can be seen from the above algorithm, the randomness of the Random Forest is mainly manifested in two aspects: The randomness of the data space by Bagging (Bootstrap Aggregating) implementation, the feature space of the randomness of Random sample (Random Subspace). For classification problems, each decision tree in a Random Forests is classified and predicted for new samples. The decision results of these trees are then somehow grouped together to give the final classification of the sample.
2019101189 02 Oct 2019
1, The data in the rows (data records) and columns (variables) two random introduction, so that the Random Forest is not easy to fall into overfitting.
2,.Random Forest has a good anti-noise ability.
3. When there are a large number of missing values in the data set, Random Forests can effectively estimate and process the missing values.
4. the ability to adapt to the data set is strong: can process both discrete data, but also to process continuous data, the data set does not need to be normalized.
5.It can be able to the importance of the variable sorting, easy to explain the variable. There are two methods for calculating the importance of variables in Random Forests: one based on the average decline accuracy of OOB (Out of Bag). That is, in the process of growing the decision tree, the OOB sample is tested and recorded the wrong sample, and then the value order of a column variable in the Bootstrap sample is randomly disrupted, the decision tree is re-predicted, and the number of misdivided samples is recorded again. The number of two prediction errors divided by the total number of OOB samples is the change in error rate of this decision tree, and the average rate of average decline is obtained by summarizing the error rate change sourof for all trees in the Random Forest. The other is based on the GINI drop method at the time of
2019101189 02 Oct 2019 division, the Random Forest in the growth decision tree is in accordance with the GINI non-purity decline in the node split, all the selected a variable in the forest as a split variable of the node summary to obtain gINI drop.
Random Forest in imbalanced data classification
The default of weight for each category is 1 in Random Forest which predicts that all wrong cost is equivalent. In scikitleam, Random Forest supplies the parameter of weight(list or dict)and mutually specifies weights for different sorts.If the parameter is ‘balanced’, each weight has a negative relationship with input frequency since Random Forest automatically adjusts weights by using the value y.
The calculation formula is n _ samples/(n _ classes * npbincountjy)) (1) ‘Balanced subsample’ is similar to the ‘balanced’,which uses sample size of retracted sampling instead of using total number of samples. Therefore, we can solve the unbalanced data classification problem by this approach.
Data preprocessing and data analysis
Data Set
The data Set used by this paper: The loan default data set is 250000 samples that included 150000 training set and 100000 test set.
2019101189 02 Oct 2019
This training set contains 150,000 historical data of borrowers, among which 10026 default samples account for 6.684% of the total sample, 6.684% of the loan default rate, and 139974 non-default samples account for 93.316% of the total sample. It can be seen that this data set is a typical highly unbalanced data. The data set includes the borrower's age, income, family, etc., and loan conditions, with a total of 11 variables, among which SeriousDlqin2yrs is the label’s tag, and the other 10 variables are predictive characteristics. The following table lists variable names and data types:
Data Analysis
The experimental environment used in this paper is Anaconda3+Python3 .Firstly, the data were preliminarily analyzed. This experiment mainly analyzed the distribution of default rate on each independent variable, and generated the frequency distribution table as shown in Table 3 (decimals were rounded).
It can be seen from Table3 that the default rate of people younger than 25 years old and people aged 26-35 years old is more than 10%.Default rates fall as people age.
Table 4 variables NumberOfTime30-59 dayspastduenotworse frequency Table 4 shows that the number of real estate and mortgage loans of 99.47% borrowers is less than 5, but the default rate of borrowers with more than io
2019101189 02 Oct 2019 loans increases significantly, among which the default rate of borrowers with more than 10 loans is above 20%.
It can be seen from Table 5 that the default rate of borrowers who have not defaulted for 30-59 days is only about 4%, but with the increase of the number of delinquencies, the default rate increases significantly. For the other two variables, the frequency distribution table of 60-89 days overdue and 90 or more overdue times of borrowers also shows the same trend as Table 5. Therefore, it can be concluded that the more delinquencies occur, the higher the default rate.
With 10 variables, this study using data set our statistical analysis of each variable and get the frequency distribution table as shown above, in addition to the variable Number Of Open Credit Lines And Loans (open the number of loans and credit loans) and default rate has no obvious correlation, other variables are related to whether the borrower default eventually.
2.3 Data Pre-processing
A preliminary exploration of the data may reveal missing values in the Monthly Income and Number Of Dependents variables, which are 29731 and 3924, respectively.
Outliers: the minimum value in the age variable is 0, which is an outlier.
NumberOfTime30-59DaysPastDueNotWorse NumberOfTime60-89DaysPastDueNotWorse
2019101189 02 Oct 2019
NumberOfTimes90DaysLate three overdue days variables, there are a small number of 96,98 values, may be abnormal values or some code. Data preprocessing: when reading data using the pandas library in Python. Set the na_values parameter in the function pd.read_csv() to our own definition list, treat 0 of the age variable and 96,98 of the three overdue variables as NaN values, then using skleam. Preprocessing. Imputer library will data set all NaN replaced with mean value of the corresponding columns.
Buliding models and experiment result
Random Forest Model
In this experiment, we uses a package of Python—skleam (more specifically, sklean.ensemble.RandomForestClassifier)—to build a Random Forest model.
Here are the parameters and their settings:
n-estimators: The number of Decision Tree, which is set to 100.
oob_score: Whether to use out-of-bag data, set to True.
min_samples_split: The minimum number used to yield an internal node, set to 2.
min_samples_leaf: The minimum number of a leaf node, set to 50.
2019101189 02 Oct 2019 n Jobs: The number of jobs for computer to run parallely, set to -1.
class_weight: Used to control weights of each class, set to balanced_subsample.
bootstrap: Whether boostrap samples are used for generating trees, set to True.
Model Assessment
We use AUC as a indicator to assess the model in this experiment. AUC is defined as area under the curve of ROC(Receiver Operating Characteristic), and apparently the value of this curve is not more than 1. The x-axis of ROC is FPR(False Postive Rate), and the y-axis is TPR(True Positive Rate). Because normally the ROC curve is above line y=x, the value of AUC is between 0.5 and 1. We use AUC as a standard for evaluating models because ROC curve cannot help us clearly determining which classifier is better. On the other hand, as a numerical value, AUC can tell us which classifier is superior more specifically.
Results of comparing three models: Random Forest, Decision Tree and logistic regression, are as follows:
From the Table 6, the Random Forest Algorithm has the greatest AUC value among these three algorithms. Hence, Random Forest's predicting performance is better than the other two algorithms.
2019101189 02 Oct 2019
Feature Importances of Variables
We use feature_importances in the skleam.ensemble.RandomForestClassifier class for this experiment, and the feature importances for each feature are as follows:
From the Table 7, we can notice that these three variables:
RevolvingUtilizationOfUnsecuredFines, NumberOfTime30-59DaysPastDueNotWorse and NumberOfTime90DaysFate have top three feature importances, which have greater impact on determining who may break contracts and bring economic loss to companies. Hence, while companies grant a loan, they can consider these features of an applicant to lower the risk.
Conclusion
This paper mainly studied the loan defaults of common problems in the financial sector, and using the Random Forest of unbalanced data classification method to predict default model is established, the basic idea of Random Forest is in the process of a single tree structure, some random variables or characteristics involved in tree node, repeated many times and ensure the independence between the trees, in view of the unbalanced data, by the method of parameter adjustment makes Random Forest weights can be adjusted according to the y value automatically, thus effectively solve the problem of unbalanced data classification.
2019101189 02 Oct 2019
Experiments show that Random Forest algorithm than the decision tree and the classification of the logistic regression model performance is better, to loan defaults in the field of financial prediction problem has important reference meaning. In addition, based on the importance of the characteristics of the measurement, in this experiment can be lending a person's age, debt ratio and number of real estate and mortgage of the three characteristics of the final is greatly influenced by default, the feature importance measure method is the other feature selection problem in data mining to have the important reference significance.

Claims (1)

1. A financial mining method for credit prediction, wherein the experimental environment used in this experiment is Anaconda3+Python3; First, the data is initially analyzed, this experiment mainly analyzes the distribution of default rate on each independent variable, and generates a frequency distribution table; Data preprocessing: When reading data using the pandas library in Python, set the na_values parameter in the function pd.read_csv() to our own defined list, 0 in the age variable and three overdue variables 96,98 is treated as a NaN value, then the skleam.preprocessing, imputer library is used to replace all NaNs in the dataset with the average of the corresponding columns.
AU2019101189A 2019-10-02 2019-10-02 A financial mining method for credit prediction Ceased AU2019101189A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019101189A AU2019101189A4 (en) 2019-10-02 2019-10-02 A financial mining method for credit prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019101189A AU2019101189A4 (en) 2019-10-02 2019-10-02 A financial mining method for credit prediction

Publications (1)

Publication Number Publication Date
AU2019101189A4 true AU2019101189A4 (en) 2020-01-23

Family

ID=69160470

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019101189A Ceased AU2019101189A4 (en) 2019-10-02 2019-10-02 A financial mining method for credit prediction

Country Status (1)

Country Link
AU (1) AU2019101189A4 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393169A (en) * 2021-07-13 2021-09-14 大商所飞泰测试技术有限公司 Financial industry transaction system performance index analysis method based on big data technology
CN113792935A (en) * 2021-09-27 2021-12-14 武汉众邦银行股份有限公司 Small micro enterprise credit default probability prediction method, device, equipment and storage medium
CN115408499A (en) * 2022-11-02 2022-11-29 思创数码科技股份有限公司 Automatic analysis and interpretation method and system for government affair data analysis report chart
CN116364178A (en) * 2023-04-18 2023-06-30 哈尔滨星云生物信息技术开发有限公司 Somatic cell sequence data classification method and related equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393169A (en) * 2021-07-13 2021-09-14 大商所飞泰测试技术有限公司 Financial industry transaction system performance index analysis method based on big data technology
CN113393169B (en) * 2021-07-13 2024-03-01 大商所飞泰测试技术有限公司 Financial industry transaction system performance index analysis method based on big data technology
CN113792935A (en) * 2021-09-27 2021-12-14 武汉众邦银行股份有限公司 Small micro enterprise credit default probability prediction method, device, equipment and storage medium
CN113792935B (en) * 2021-09-27 2024-04-05 武汉众邦银行股份有限公司 Method, device, equipment and storage medium for predicting credit default probability of small micro-enterprises
CN115408499A (en) * 2022-11-02 2022-11-29 思创数码科技股份有限公司 Automatic analysis and interpretation method and system for government affair data analysis report chart
CN116364178A (en) * 2023-04-18 2023-06-30 哈尔滨星云生物信息技术开发有限公司 Somatic cell sequence data classification method and related equipment
CN116364178B (en) * 2023-04-18 2024-01-30 哈尔滨星云生物信息技术开发有限公司 Somatic cell sequence data classification method and related equipment

Similar Documents

Publication Publication Date Title
AU2020100709A4 (en) A method of prediction model based on random forest algorithm
Moradi et al. A dynamic credit risk assessment model with data mining techniques: evidence from Iranian banks
AU2019101189A4 (en) A financial mining method for credit prediction
Tang et al. Applying a nonparametric random forest algorithm to assess the credit risk of the energy industry in China
Coşer et al. PREDICTIVE MODELS FOR LOAN DEFAULT RISK ASSESSMENT.
WO2012018968A1 (en) Method and system for quantifying and rating default risk of business enterprises
Van Thiel et al. Artificial intelligence credit risk prediction: An empirical study of analytical artificial intelligence tools for credit risk prediction in a digital era
AU2020101475A4 (en) A Financial Data Analysis Method Based on Machine Learning Models
Chern et al. A decision tree classifier for credit assessment problems in big data environments
Chen et al. Mixed credit scoring model of logistic regression and evidence weight in the background of big data
CN112329862A (en) Decision tree-based anti-money laundering method and system
Naik Predicting credit risk for unsecured lending: A machine learning approach
Datkhile et al. Statistical modelling on loan default prediction using different models
Becha et al. Use of Machine Learning Techniques in Financial Forecasting
Jin et al. Financial credit default forecast based on big data analysis
Li et al. Construction of credit evaluation index system for two-stage Bayesian discrimination: an empirical analysis of small Chinese enterprises
Zurada Rule Induction Methods for Credit Scoring
Pradnyana et al. Loan Default Prediction in Microfinance Group Lending with Machine Learning
Zakowska A New Credit Scoring Model to Reduce Potential Predatory Lending: A Design Science Approach
Srihari et al. A Study on †œLoan Predictions Using Fintech Decision Tree Analysisâ€
Abid A Logistic Regression Model for Credit Risk of Companies in the Service Sector
Zaytsev Selection and evaluation of relevant predictors for credit scoring in peer-to-peer lending with random forest based methods
Odundo et al. Performance Evaluation Criteria of Credit Scoring Models for Commercial Lenders
Bakker et al. Performance Evaluation Criteria of Credit Scoring Models for Commercial Lenders
Kisutsa Loan Default Prediction Using Machine Learning: a Case of Mobile Based Lending

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry