AU2020100709A4 - A method of prediction model based on random forest algorithm - Google Patents

A method of prediction model based on random forest algorithm Download PDF

Info

Publication number
AU2020100709A4
AU2020100709A4 AU2020100709A AU2020100709A AU2020100709A4 AU 2020100709 A4 AU2020100709 A4 AU 2020100709A4 AU 2020100709 A AU2020100709 A AU 2020100709A AU 2020100709 A AU2020100709 A AU 2020100709A AU 2020100709 A4 AU2020100709 A4 AU 2020100709A4
Authority
AU
Australia
Prior art keywords
data
random forest
default
classification
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2020100709A
Inventor
Yuhang Bao
Haoyue Chen
Banghao Gao
Donghao LI
Haixin ZHANG
Zhiyi Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chen Haoyue Miss
Zhang Zhiyi Miss
Original Assignee
Chen Haoyue Miss
Zhang Zhiyi Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chen Haoyue Miss, Zhang Zhiyi Miss filed Critical Chen Haoyue Miss
Priority to AU2020100709A priority Critical patent/AU2020100709A4/en
Application granted granted Critical
Publication of AU2020100709A4 publication Critical patent/AU2020100709A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Abstract

Abstract When managing credit risk, it is a fundamental and vital segment for modem financial institutions to figure out how to effectively evaluate and identify potential default risk of borrowers before offering loans and calculate the default probability of borrowers. In this paper, the main objective of our investigation is to statistically analyze the historical loan data of banks and other financial institutions and establish a loan default prediction model by applying random forest algorithm and the thinking of unbalanced data classification. According to experiments' result, random forest algorithm performs better than decision tree and logical regression classification algorithms in predictive performance. Additionally, features highly associating with default can be obtained by prioritizing features by using random forest algorithm so that the procedure of judging the risk of offering loan is optimized. Fgre12io DeFF cision Classification 1. Sorting (classifier): md Classification KNN,- (score card) Logistic Regression Bayes Figure 12 1. #The result of RF modeled, 2. TRAIN: [ 17941 117875 67893 ... , 93992 20627 5744] TEST: [64422 113530 30105 .. , 34862 130492 127209] ' 3. the best parameter: {' X': 2, '4 ': 50} 4. the best score: 0.863112669495 ' 5. p 0.907527986443 * 6. ; 0.864487503309 e Figure 13

Description

FIELD OF THE INVENTION
This invention mainly studies how to analyze the historical loan data of Banks and other financial institutions with the idea of unbalanced data classification and predict the possibility of loan default based on the random forest classification model.
BACKGROUND OF THE INVENTION
With the rapid development of the world economy and the deepening of China's reform and opening up, whether it is the development of enterprises or the change of people's consumption concept, loans have become an important way for enterprises and individuals to solve economic problems. With the launch of Banks' various loan businesses and people's growing demand, the probability of non-performing loans, or loans defaulting, has also increased sharply. In order to avoid loan default, Banks and other financial institutions will evaluate or score the credit risk of borrowers when issuing loans, predict the probability of loan default and make a judgment on whether to issue loans based on the results. How to effectively evaluate and identify the potential default risk of borrowers before issuing loans is the basis and important link of credit risk management of financial institutions. Using a set of scientific models and systems to judge the risk of loan defaults can minimize risks and maximize profits.
Financial technology, also known as fintech, is a fast-evolving field that has reshaped the
2020100709 05 May 2020 financial industry, such as insurance, investment and banking business. Recent years have witnessed the huge success of machine learning and deep learning in machine perception areas such as speech recognition and image analysis, but financial services need more, including prediction and decision-making.
Computing technologies play an important role in the transformation of modern financial services. There are many traditional classification Machine learning algorithms, such as Decision Tree, Support Vector Machine and so on. These algorithms are single classifiers, and they face performance bottlenecks and overfitting problems. Therefore, the method of integrating multiple classifiers to improve the prediction performance came into being, which is Ensemble Learning. Bagging(parallel) and Boosting(serial) are two common Ensemble Learning methods. The difference between them is whether the integration method is parallel or serial. Random Forests are the most representative algorithm in the Bagging integration method.
Random Forest is an Ensemble Learning algorithm based on Decision Tree. Decision Tree is a widely used tree classifier. Each node of the tree is classified by selecting the optimal splitting feature until the stop condition of tree construction is reached. For example, data in leaf nodes are of the same category. When entering the sample to be classified, the Decision Tree determines a unique path from the root node to the leaf node, whose category is the category of the sample to be classified. Decision Tree is a simple and fast non-parametric classification method. In general, it has good accuracy, but when the data is complex, the Decision Tree has performance bottlenecks. Random Forest is a machine learning algorithm proposed by Leo Breiman in 2001 by combining Bagging ensemble learning theory with stochastic subspace method. Random Forest is an Ensemble Learning model based on the
2020100709 05 May 2020
Decision Tree classifier. It contains multiple decision trees trained by the Bagging Ensemble Learning technique. And when the samples to be classified are input, the final classification result is voted on by the output result of a single decision tree. Random Forest solves the performance bottleneck of Decision Tree, has good tolerance to noise and outliers, and has good scalability and parallelism for high-dimensional data classification. In addition, Random Forest is a non-parametric classification method driven by data, which only requires learning and training classification rules for a given sample, without prior knowledge.
Since the training of each decision tree is independent of each other, the training of random forest can be realized through parallel processing, which effectively guarantees the efficiency and expansibility of random forest algorithm.
SUMMARY
This invention mainly studies the loan default problems common in the financial field and uses the random forest method of unbalanced data classification to establish a model of predicting loan defaults. The basic idea of random forest is to randomly select some variables or Features participate in tree node division, repeat multiple times and ensure the independence between these trees. For unbalanced data, parameter adjustment enables the random forest method to automatically adjust weights based on the y value, thereby effectively solving the classification of unbalanced data problem.
Experiments show that the random forest algorithm has better classification performance than decision trees and logistic regression models, has important reference significance for loan default prediction in the financial field. In addition, by measuring the importance of each feature, in this experiment, we can obtain the three characteristics of the borrower's age,
2020100709 05 May 2020 debt ratio, and number of real estate and mortgage loans, which have a greater impact on the eventual default. It also has important reference significance for other feature selection problems in data mining.
DESCRIPTION OF DRAWING
Figure 1 shows the procedure of the project.
Figure 2 shows the detailed steps of the project.
Figure 3 shows the quantity variance between non-default samples and default samples.
Figure 4 shows the dataset overview by using df.describe() code.
Figure 5 visualizes the missing values.
Figure 6 visualizes the outliers in hi sts.
Figure 7 shows the boxplots of outliers.
Figure 8 shows the distribution and box line diagram of variable-age.
Figure 9 shows the correlation matrix of variables.
Figure 10 performs statistical analysis on each variable and obtain the frequency distribution table.
Figure 11 shows the Random Forest Algorithm process.
Figure 12 shows the logical frame of modeling.
Figure 13 shows the result of Random Forest Model.
Figure 14 shows the result of XGBoost Model.
Figure 15 shows the importance of each feature.
DESCRIPTION OF PREFERRED EMBODIMENT
Figure 1 and Figure 2 show the procedure of our project and provide its overview.
2020100709 05 May 2020
1. Business understanding
Understanding commercial goals and commercial demands could help us turn it into a data mining problem.
In this invention, analyzing the historical loan data of banks and other financial institutions with the idea of unbalanced data classification and predicting the possibility of loan default are our primary business requirements.
2. Data Preprocessing (The Data construction environment used in this invention are Anaconda3 and Python3.) Before further work, filtering required data and defining the meaning and characteristics of the data are necessary.
Dataset observation:
The loan default dataset we use includes a total of 250,000 samples, of which 150,000 samples are selected as the training set and 10,000 are selected as the test set. In the training set, there are 150,000 borrower information samples. Using df [df.SeriousDlqin2yrs == 1], we know that the number of default samples is 10026, the percentage is 6.684%; the number of non-default samples is 139974, the percentage is 93.316% (Figure 3), which means this is a highly imbalanced data set.
The dataset includes age, monthly income, family members, loans and several other conditions, with totally 11 variables, in which SeriousDlqin2yrs is label and the other 10 are prediction features. The following table shows the data features:
Variable Name Description Type
2020100709 05 May 2020
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N
RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage
age Age of borrower in years integer
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer
DebtRatio Monthly debt payments, alimony, living costs divided by monthly gross income percentage
Monthlyincome Monthly income real
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integer
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integer
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer
Using df.describe() to process the training dataset (Figure 4).
Figure 3 demonstrates that Monthlyincome and NumberOfDependents have missing values since count <150000. The mean of SeriousDlqin2yrs is 0.06684 which indicates that the default rate of the dataset is 6.684%. The minimum value for age is 0, which means there exist outliers in age for the reason that the Banks do not lend to people under 18.
Data cleaning:
Deleted the column Unnamed using data=data.drop(data.eolumns[0],axis=l).
Use data data[data['age']<18] to find that the data <18 years of age only reference one row.
Save samples with age >= 18: data=data[data.age>=18].
2020100709 05 May 2020
Datasets with missing values:
We use df.isnull().sum() and find that Monthlyincome and NumberOfDependents have missing values, respectively 29731 and 3924.
Unnamed:0 0
SeriousDlqin2yrs 0
RevolvingUtilizationOfUnsecuredLines 0
age 0
NumberOfTime30-59DaysPastDueNotWorse 0
DebtRatio 0
Monthlyincome 29731
NumberOfOpenCreditLinesAndLoans 0
NurnberOfTimes90DaysLate 0
NumberRealEstateLoansOrLines 0
NumberOfTime60-89DaysPastDueNotWorse 0
NumberOfDependents 3924
dtype: int64
As can be seen from the table, for the case where there are many missing values every month, a random forest model will be established to fill the gaps, and for NumberOfDependents with fewer missing values, the missing samples will be directly deleted.
(Figure 5)Moreover, matrix plots provide a visualization of missing values: we first replace the variable names with xl to xlO. Light color indicates small value while dark color indicates large value. Missing value is red in the plot and we can see x5 (Monthlyincome), xlO (NumberOfDependents) have missing values.
Outliers processing:
Outliers are usually considered a large deviation from the data. For example, in statistics, values below Q1-1.5IQR and above Q3 + 1.5IQR are often considered outliers.
Generally there are several ways to complete outlier precessing. First of all, we can use boxplots to achieve univariate detection. Boxplot() function generates boxplots. The function
2020100709 05 May 2020 lists the data points outside the box-whisker line in the box-and-line diagram. Second, LOF (Local Anomaly Factor) is also useful. LOF is an algorithm identifying outliers which is based on density. The algorithm compares the local density of a point with the density of points distributed around it. If the former is evidently smaller than the latter, then this point is in a relatively sparse area, indicating that the point is an outlier. Third, cluster detection. Aggregate the data into different classes, and select data that does not belong to any class as outliers. K-means algorithm is usually exploited in this case.
The minimum age is 0, which is an outlier. In the variables NumberOfTime30-59DaysPastDueNotWorse, there are approximately 96, 98 values for NumberOfTime30-59DaysPastDueNotWorse and NumberOffimes90DaysLate. They may be outliers or some behavior codes (Figure 6). Through the boxplot, we can see that the three overdue indicators (30-59 days overdue, 60-80 days overdue, 90 days overdue) have severe outliers (Figure 7). It is verified that the 99% quantiles of these three indicators are too divergent from maximum value and there are abnormalities. The dimensions are all (225, 11). It can be guessed that the abnormal situation of these three indicators occurs on the same row.
When using the pandas library to read data in Python, set the na values parameter in the function pd.read csv () to our own defined list; set 0 in the age variable, and set 96 and 98 as the three expired variables respectively. NaN values, then use the skleam.preprocessing. Imputer library to replace all NaNs in the dataset with the average of the corresponding column.
Frequency tables:
Examine the distribution of the default rate over each independent variable to generate a
2020100709 05 May 2020 table of frequency distributions. We start with RevolvingUtilizationOfUnsecuredLines, using the code: datatmp = data[['SeriousDlqin2yrs','RevolvingUtilizationOfUnsecuredLines']], and then add a label column to label each line with a label, which can mark the interval to which the line belongs. Pandas' Cut () function is used to convert continuous variables to categorical variables, such as mapping values [1,2,... 100] to variables of [1-10], [11-20].
In order to calculate the ratio of default staff, we need the total number of staff (number of rows in data tmp). And we need the pandas pivot table () function to generate the summary table. What we want is the default / total number of people in this interval.
The next step is to write the process of generating the frequency table into a function:
Get the total number of people total_size=data_tmp.shape[0] —Calculate the default number of people / total number of people in this time interval — pandasivot table () — To find/add a column, execute hierarchic ally-rename name-adjust the table Format (reindexaxis ()).
The frequency distribution table of age:
age SeriousDlqin2yrs
number percent(%) number percent(%)
1 :below 25 3027 2.018 338 11.166
2:26-35 18458 12.305 2053 11.123
3:36-45 29819 19.879 2628 8.813
4:46-55 36690 24.460 2786 7.593
5:56-65 33406 22.271 1531 4.583
6: above 65 28599 19.066 690 2.413
From table we can see that the default rate of people below 35 years old is over 10%. With the increase of age, the default rate falls. The distribution and box line diagram details shown in Figure 8.
The frequency distribution table of DebtRatio:
2020100709 05 May 2020
DebtRatio SeriousDlqin2yrs
number percent(%) number percent(%)
Lbelow 0.25 52361.0 34.908 3126 5.970
2:0.25-0.5 41346.0 27.564 2529 6.117
3:0.5-0.75 15728.0 10.485 1484 9.435
4:0.75-1.0 5427.0 3.618 596 10.982
5:1.0-2.0 4092.0 2.728 539 13.172
6:above 2 31045.0 20.697 1752 5.643
With the increase of the debt ratio, the interval default rate continues to increase as well.
Default rate is the highest among those with a debt ratio of 1-2, but when the debt ratio is greater than 2, the default rate drops.
The frequency distribution table of NumberOfOpenCreditLinesAndLoans:
NumberOfOpenCreditLinesAndLoans SeriousDlqin2yrs
number percent(%) number percent(%)
1 :below 5 46590 31.060 3922 8.418
2:6-10 60399 40.266 3345 5.538
3:11-15 29184 19.456 1804 6.181
4:16-20 9846 6.564044 676 6.866
5:21-25 2841 1.894 191 6.723
6:26-30 785 0.523 62 7.898
7: above 30 354 0.236 26 7.345
The number of defaulters is more evenly distributed.
The frequency distribution table of NumberRealEstateLoansOrLines:
NumberRealEstateLoansOrLines SeriousDlqin2yrs
number percent(%) number percent(%)
1 :below 5 149206 99.471 9884 6.624
2:6-10 699 0.466 121 17.310
3:11-15 70 0.047 16 22.857
4:16-20 14 0.009 3 21.429
5:above 20 10 0.007 2 20.000
99.47% of borrowers own less than 5 real estate and mortgage loans; the default rate of people with more than 5 loans has increased significantly.
2020100709 05 May 2020
The frequency distribution table of NumberOfTime30-59DaysPastDueNotWorse:
NumberOfTime30-59DaysPastDueNotWorse SeriousDlqin2yrs
number percent(%) number percent(%)
1:0 142050 94.701 7450 5.245
2:1 4598 3.065 1219 26.512
3:2 1754 1.169 618 35.234
4:3 747 0.498 318 42.570
5:4 342 0.228 154 45.029
6:5 140 0.093 74 52.857
7:6 54 0.036 28 51.852
8:7and above 314 0.209 165 52.548
Borrowers who are not overdue within 30-59 days, the default interest rate is only 4%. As the number of expirations increases, the default ratio continues to rise. The default rates for the two variables, 60-89 days overdue and 90 days overdue, also have the same trend. Therefore, whether a default occurs is an important variable that determines whether a default will occur in the future.
The frequency distribution table of Monthlyincome:
Monthlyincome SeriousDlqin2yrs
number percent(%) number percent(%)
libelow 5000 55859.0 37.240 4813 8.616
2:5000-10000 46090.0 30.727 2752 5.971
3:1000-15000 13035.0 8.690 547 4.196
4:above 15000 5284.0 3.523 245 4.637
The higher the income, the lower the default rate. The Mondaylncome column is missing data and can only be used as a reference.
2020100709 05 May 2020
The frequency distribution table of NumberOfDependents:
NumberOfDependents SeriousDlqin2yrs
number percent(%) number percent(%)
1:0 113218.0 75.479 7030 6.209
2:1 19521.0 13.014 1584 8.114
3:2 9483.0 6.322 837 8.826
4:3 2862.0 1.908 297 10.377
5:4 746.0 0.497 68 9.115
6:5 and more 245.0 0.163 31 12.653
There is not much difference in default rates among people with different family members.
The dataset used in this experiment has 10 variables. Figure 9 shows that the relationships between each variable are minute. We perform statistical analysis on each variable and obtain the frequency distribution table shown above in Figure 10. Except that the variables NumberOfOpenCreditLinesAndLoans (the number of open lines and credit loans) have no apparent correlation with the default rate, these variables are related to whether the borrower eventually defaults.
3. Algorithm selection and modeling
Understanding the characteristics of this dataset: unbalanced data classification
Unbalanced data, that is, one type of data far exceeds the other (a few types) of data, is widespread in many fields such as network intrusion detection, financial fraud transaction detection, and text classification. The classification problem of dealing with unbalanced data can be solved by the penalty weights of positive and negative samples. The idea is that in the implementation of the algorithm, different weights are given to the categories of different sample numbers in the classification. Generally, the categories with small sample sizes have high weights and large sample size categories are low weighted and then calculated and modeled.
2020100709 05 May 2020
Random forest algorithm
Random forest
The Random Forest algorithm is to build a forest in a random manner. It is a combination learning algorithm based on a decision tree. The basic idea of random forest is that during the process of constructing a single tree, some variables or features are randomly selected to participate in the tree node division, repeated multiple times, and the independence between these trees established is guaranteed. After obtaining a random forest, when a new input sample enters, each decision tree in the forest will judge the sample, get the result of which class the sample belongs to, and finally see which one in the entire forest belongs to. The class has the highest votes, and it is predicted which class the sample is. The process is shown in Figure 11.
Principles and Features of Random Forest Algorithm
The Random Forest algorithm includes classification and regression problems. The algorithm steps are as follows:
Random Forest
Enter:
T = Training Set
Ntree=Number of decision trees in the forest
M = number of predictors in each sample
Mtry=The number of variables participating in the division in each tree node Ssampsize=Bootstrap sample size
2020100709 05 May 2020
Algorithm process:
F or(itree=0; 1 <itree<=Ntree; itree++) • Use training set T to generate a Bootstrap data sample with size Ssampsize.
• Use the generated Bootstrap data to construct an unpruned tree itree. In the process of generating a tree itree, randomly select Mtry variables from M and select the best one to branch according to some standard (Gini).
Output:
• Regression problem: use the average of all but the number of return values as the prediction result.
• Classification problem: use the classification results of most decision trees as prediction results.
The random forest has the following characteristics: As can be seen from the above algorithm process, the randomness of the random forest is mainly reflected in two aspects: the randomness of the data space is realized by Bagging (Bootstrap Aggregating), and the randomness of the feature space is randomized by random samples (Random Subspace). For classification problems, each decision tree in the random forest classifies and predicts new samples, and then somehow combines the decision results of these trees to give the final classification results of the samples.
The random forest algorithm also has the following advantages:
• The introduction of randomness in rows (data records) and columns (variables) in the data makes it difficult for random forests to fall into overfitting.
• Random forest has good anti-noise ability.
2020100709 05 May 2020 • When there are a large number of missing values in the dataset, the random forest can effectively estimate and process the missing values.
• Strong adaptability to the dataset: can handle both discrete data and continuous data, the data set does not need to be standardized.
• It is possible to sort the importance of variables to facilitate the interpretation of variables. There are two methods to calculate the importance of variables in random forests: one is based on the average falling accuracy rate of OOB (Out of Bag). That is, in the process of growing a decision tree, first use the OOB sample to test and record the wrong samples, and then randomly shuffle the value of a column of variables in the Bootstrap sample, use the decision tree to predict it again, and record again. The number of wrong samples. The number of two prediction errors divided by the total number of OOB samples is the error rate change of this decision tree. The error rate change of all trees in a random forest is averaged to obtain the average decline accuracy rate. The other method is based on the amount of GINI drop during splitting. The random forest growing decision tree is splitting nodes according to the decline of the purity of GINI. All nodes in the forest that select a certain variable as the splitting variable are summarized GINI drop.
Random forest method for unbalanced data classification
The random forest algorithm defaults the weight of each class to 1, which assumes that the cost of misclassification of all classes is equal. In scikit-learn, the random forest algorithm provides a class weight parameter, whose value can be a list or dictionary value, and manually specifies the weight of different classes. If the parameter is balanced, then the random forest algorithm uses the y value to automatically adjust the weights, and each type of weight is inversely proportional to the class frequency in the input data.
2020100709 05 May 2020
The calculation formula is:
balancedsubsample is similar to balanced mode. The calculation uses the number of samples in the sample with replaceable type instead of the total number of samples. Therefore, we can solve the problem of unbalanced data classification through this method.
Modeling and Results
Figure 12 shows the logical frame.
We can use skleam.ensemble.RandomForestClassifier in Python to build a random forest model.
Part of the parameters are set as:
n estimators: number of decision trees is 100 oob score: whether to use out-of-bag data—True min samples split: when dividing nodes based on attributes, the minimum number of samples per division, 2 min samples leaf: Samples with the fewest leaf nodes, 50 njobs: parallel number as -1; start as many jobs as the number of computer CPU cores class weight: set as “balanced subsample”; use y value to automatically adjust the weights, each type of weight is inversely proportional to the category frequency in the input data bootstrap: whether to use bootstrap sampling, True
Algorithms:
Firstly, load training and test datasets and preprocess data.
2020100709 05 May 2020
Then, split training data into trainingnew and testnew for validation.
And now, impute the data with imputer: replace missing values with Mean.
Finally, build Random Forest model with training new:
• deal with imbalanced data distribution.
• perform parameter tuning using grid search with CrossValidation.
• output the best model and predict for test data.
Result shown in Figure 13
Extensive Model: XGBoost model
XGBoost is one of the Boosting algorithms. The idea of the Boosting algorithm is to integrate many weak classifiers together to form a more powerful classifier. Since XGBoost is a lifting tree model, it integrates many tree models to form a more powerful classifier. The tree model used is a CART regression tree. XGBoost is an improvement based on GBDT, making it more powerful and applicable to a wider range.
Algorithm:
Firstly, load training and test datasets and preprocess data.
Secondly, split training data into training new and test new for validation.
Thirdly, build XGboost model with the training new data:
• deal with missing values and imbalanced data distribution.
• perform parameter tuning using grid search with CrossValidation.
• output the best model and predict for test data.
Result shown in Figure 14.
2020100709 05 May 2020
4.Model Evaluation
The model evaluation index used in this experiment is the AUC (area under the ROC curve) value. AUC is defined as the area under the ROC (receiver operating characteristic) curve. Obviously, the value of this region will not be greater than 1. The horizontal axis of the ROC curve is the false alarm rate (FPR), and the vertical axis is the forward rotation rate (TPR). Since the ROC curve is usually above the y = x line, the AUC value ranges from 0.5 to 1. We use the AUC value as the evaluation criterion, because usually the ROC curve cannot clearly indicate which classifier performs better, and as a value, the classifier with a larger AUC performs better.
We compared the random forest model with the XGBoost model, the logistic regression model, and the decision tree model. The results are as follows:
Algorithm AUC value
Random Forest 0.86
XGBoost 0.86
Logistic Regression 0.80
Decision Tree 0.80
According to the table, the AUC of the random forest is higher than the logistic regression and decision tree, which is very close to XGBoost. The higher the AUC, the better the prediction accuracy.
2020100709 05 May 2020
Evaluation of Feature Importance
The feature importances method of skleam.ensemble.RandomForestClassifier will be used in this experiment. The importance of each feature is shown in Figure 15.
As can be seen from Figure 15, RevolvingUtilizationOfUnsecuredLines, NumberOfTime30-59DaysPastDueNotWorse and NumberOfTime90DaysLate are the three most important functions. This has a great impact on whether a default occurs in the end, so we should pay special attention to these characteristics of borrowers when processing loan applications.

Claims (1)

  1. rj What we claim is:
    1. A method of prediction model based on random forest algorithm, characterized in that:
    Business understanding; understanding commercial goals and commercial demands could help us turn it into a data mining problem, analyzing the historical loan data of banks and other financial institutions with the idea of unbalanced data classification and predicting the possibility of loan default are our primary business requirements; data preprocessing; before further work, filtering required data and defining the meaning and characteristics of the data are necessary; random forest algorithm; the Random Forest algorithm is to build a forest in a random manner; it is a combination learning algorithm based on a decision tree; the basic idea of random forest is that during the process of constructing a single tree, some variables or features are randomly selected to participate in the tree node division, repeated multiple times, and the independence between these trees established is guaranteed; after obtaining a random forest, when a new input sample enters, each decision tree in the forest will judge the sample, get the result of which class the sample belongs to, and finally see which one in the entire forest belongs to; the class has the highest votes, and it is predicted which class the sample is.
AU2020100709A 2020-05-05 2020-05-05 A method of prediction model based on random forest algorithm Ceased AU2020100709A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2020100709A AU2020100709A4 (en) 2020-05-05 2020-05-05 A method of prediction model based on random forest algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2020100709A AU2020100709A4 (en) 2020-05-05 2020-05-05 A method of prediction model based on random forest algorithm

Publications (1)

Publication Number Publication Date
AU2020100709A4 true AU2020100709A4 (en) 2020-06-11

Family

ID=70976398

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2020100709A Ceased AU2020100709A4 (en) 2020-05-05 2020-05-05 A method of prediction model based on random forest algorithm

Country Status (1)

Country Link
AU (1) AU2020100709A4 (en)

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753790A (en) * 2020-07-01 2020-10-09 武汉楚精灵医疗科技有限公司 Video classification method based on random forest algorithm
CN111767958A (en) * 2020-07-01 2020-10-13 武汉楚精灵医疗科技有限公司 Real-time enteroscopy withdrawal time monitoring method based on random forest algorithm
CN111798984A (en) * 2020-07-07 2020-10-20 章越新 Disease prediction scheme based on Fourier transform
CN111951027A (en) * 2020-08-14 2020-11-17 上海冰鉴信息科技有限公司 Enterprise identification method and device with fraud risk
CN111950795A (en) * 2020-08-18 2020-11-17 安徽中烟工业有限责任公司 Random forest based method for predicting water adding proportion of loosening and conditioning
CN112100902A (en) * 2020-08-10 2020-12-18 西安交通大学 Lithium ion battery service life prediction method based on stream data
CN112115991A (en) * 2020-09-09 2020-12-22 福建新大陆软件工程有限公司 Mobile terminal switching prediction method, device, equipment and readable storage medium
CN112132187A (en) * 2020-08-27 2020-12-25 上海大学 Method for rapidly judging perovskite structure stability based on random forest
CN112308146A (en) * 2020-11-02 2021-02-02 国网福建省电力有限公司 Distribution transformer fault identification method based on operation characteristics
CN112465245A (en) * 2020-12-04 2021-03-09 复旦大学青岛研究院 Product quality prediction method for unbalanced data set
CN112487262A (en) * 2020-11-25 2021-03-12 建信金融科技有限责任公司 Data processing method and device
CN112530546A (en) * 2020-12-14 2021-03-19 重庆邮电大学 Psychological pre-judging method and system based on K-means clustering and XGboost algorithm
CN112733903A (en) * 2020-12-30 2021-04-30 许昌学院 Air quality monitoring and alarming method, system, device and medium based on SVM-RF-DT combination
CN112766550A (en) * 2021-01-08 2021-05-07 佰聆数据股份有限公司 Power failure sensitive user prediction method and system based on random forest, storage medium and computer equipment
CN112883564A (en) * 2021-02-01 2021-06-01 中国海洋大学 Water body temperature prediction method and prediction system based on random forest
CN112907359A (en) * 2021-03-24 2021-06-04 四川奇力韦创新科技有限公司 Bank loan business qualification auditing and risk control system and method
CN112990284A (en) * 2021-03-04 2021-06-18 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN113127342A (en) * 2021-03-30 2021-07-16 广东电网有限责任公司 Defect prediction method and device based on power grid information system feature selection
CN113139876A (en) * 2021-04-22 2021-07-20 平安壹钱包电子商务有限公司 Risk model training method and device, computer equipment and readable storage medium
CN113159615A (en) * 2021-05-10 2021-07-23 麦荣章 Intelligent information security risk measuring system and method for industrial control system
CN113205271A (en) * 2021-05-12 2021-08-03 国家税务总局山东省税务局 Method for evaluating enterprise income tax risk based on machine learning
CN113221972A (en) * 2021-04-26 2021-08-06 西安电子科技大学 Unbalanced hyperspectral data classification method based on weighted depth random forest
CN113282886A (en) * 2021-05-26 2021-08-20 北京大唐神州科技有限公司 Bank loan default judgment method based on logistic regression
CN113326664A (en) * 2021-06-28 2021-08-31 南京玻璃纤维研究设计院有限公司 Method for predicting dielectric constant of glass based on M5P algorithm
CN113392585A (en) * 2021-06-10 2021-09-14 京师天启(北京)科技有限公司 Method for spatializing sensitive people around polluted land
CN113470819A (en) * 2021-07-23 2021-10-01 湖南工商大学 Early prediction method for adverse event of pressure sore of small unbalanced sample based on random forest
CN113570191A (en) * 2021-06-21 2021-10-29 天津大学 Intelligent diagnosis method for river ice plug dangerous situations in ice flood
CN113592058A (en) * 2021-07-05 2021-11-02 西安邮电大学 Method for quantitatively predicting microblog forwarding breadth and depth
CN113658680A (en) * 2021-07-29 2021-11-16 广西友迪资讯科技有限公司 Random forest based method for evaluating withdrawal effect of drug addicts
CN113657452A (en) * 2021-07-20 2021-11-16 中国烟草总公司郑州烟草研究院 Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning
CN113705904A (en) * 2021-08-31 2021-11-26 国网上海市电力公司 Chemical plant area power utilization fault prediction method based on random forest algorithm
CN113822536A (en) * 2021-08-26 2021-12-21 国网河北省电力有限公司邢台供电分公司 Power distribution network index evaluation method based on branch definition algorithm
CN113837863A (en) * 2021-09-27 2021-12-24 上海冰鉴信息科技有限公司 Business prediction model creation method and device and computer readable storage medium
CN114154561A (en) * 2021-11-15 2022-03-08 国家电网有限公司 Electric power data management method based on natural language processing and random forest
CN114426894A (en) * 2020-09-29 2022-05-03 中国石油化工股份有限公司 Natural gas hydrate phase equilibrium pressure prediction method based on machine learning
CN114492929A (en) * 2021-12-23 2022-05-13 江南大学 XGboost-based financial credit enterprise credit prediction method
CN114679779A (en) * 2022-03-22 2022-06-28 安徽理工大学 WIFI positioning method based on improved KNN fusion random forest algorithm
CN114710326A (en) * 2022-03-15 2022-07-05 国网甘肃省电力公司电力科学研究院 Intrusion detection method based on GC-Forest
CN115032720A (en) * 2022-07-15 2022-09-09 国网上海市电力公司 Application of multi-mode integrated forecast based on random forest in ground air temperature forecast
CN115064263A (en) * 2022-06-08 2022-09-16 华侨大学 Alzheimer's disease prediction method based on random forest pruning brain region selection
CN115907483A (en) * 2023-01-06 2023-04-04 山东蜂鸟物联网技术有限公司 Personnel risk assessment early warning method
CN116090834A (en) * 2023-03-07 2023-05-09 安徽农业大学 Forestry management method and device based on Flink platform
CN116226767A (en) * 2023-05-08 2023-06-06 国网浙江省电力有限公司宁波供电公司 Automatic diagnosis method for experimental data of power system
CN116364178A (en) * 2023-04-18 2023-06-30 哈尔滨星云生物信息技术开发有限公司 Somatic cell sequence data classification method and related equipment
CN116823014A (en) * 2023-04-06 2023-09-29 南京邮电大学 Method for realizing enterprise employee performance automatic scoring service
CN116861800A (en) * 2023-09-04 2023-10-10 青岛理工大学 Oil well yield increasing measure optimization and effect prediction method based on deep learning
CN117113291A (en) * 2023-10-23 2023-11-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Analysis method for importance of production parameters in semiconductor manufacturing
CN117150389A (en) * 2023-07-14 2023-12-01 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN117370899A (en) * 2023-12-08 2024-01-09 中国地质大学(武汉) Ore control factor weight determining method based on principal component-decision tree model
CN117540830A (en) * 2024-01-05 2024-02-09 中国地质科学院探矿工艺研究所 Debris flow susceptibility prediction method, device and medium based on fault distribution index
CN114679779B (en) * 2022-03-22 2024-04-26 安徽理工大学 WIFI positioning method based on improved KNN fusion random forest algorithm

Cited By (80)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753790A (en) * 2020-07-01 2020-10-09 武汉楚精灵医疗科技有限公司 Video classification method based on random forest algorithm
CN111767958A (en) * 2020-07-01 2020-10-13 武汉楚精灵医疗科技有限公司 Real-time enteroscopy withdrawal time monitoring method based on random forest algorithm
CN111753790B (en) * 2020-07-01 2023-12-12 武汉楚精灵医疗科技有限公司 Video classification method based on random forest algorithm
CN111798984A (en) * 2020-07-07 2020-10-20 章越新 Disease prediction scheme based on Fourier transform
CN112100902A (en) * 2020-08-10 2020-12-18 西安交通大学 Lithium ion battery service life prediction method based on stream data
CN112100902B (en) * 2020-08-10 2023-12-22 西安交通大学 Lithium ion battery life prediction method based on flow data
CN111951027A (en) * 2020-08-14 2020-11-17 上海冰鉴信息科技有限公司 Enterprise identification method and device with fraud risk
CN111950795A (en) * 2020-08-18 2020-11-17 安徽中烟工业有限责任公司 Random forest based method for predicting water adding proportion of loosening and conditioning
CN111950795B (en) * 2020-08-18 2024-04-26 安徽中烟工业有限责任公司 Random forest-based prediction method for loosening and conditioning water adding proportion
CN112132187A (en) * 2020-08-27 2020-12-25 上海大学 Method for rapidly judging perovskite structure stability based on random forest
CN112115991A (en) * 2020-09-09 2020-12-22 福建新大陆软件工程有限公司 Mobile terminal switching prediction method, device, equipment and readable storage medium
CN112115991B (en) * 2020-09-09 2023-08-04 福建新大陆软件工程有限公司 Mobile terminal change prediction method, device, equipment and readable storage medium
CN114426894A (en) * 2020-09-29 2022-05-03 中国石油化工股份有限公司 Natural gas hydrate phase equilibrium pressure prediction method based on machine learning
CN112308146A (en) * 2020-11-02 2021-02-02 国网福建省电力有限公司 Distribution transformer fault identification method based on operation characteristics
CN112487262A (en) * 2020-11-25 2021-03-12 建信金融科技有限责任公司 Data processing method and device
CN112487262B (en) * 2020-11-25 2023-05-26 中国建设银行股份有限公司 Data processing method and device
CN112465245A (en) * 2020-12-04 2021-03-09 复旦大学青岛研究院 Product quality prediction method for unbalanced data set
CN112530546A (en) * 2020-12-14 2021-03-19 重庆邮电大学 Psychological pre-judging method and system based on K-means clustering and XGboost algorithm
CN112530546B (en) * 2020-12-14 2024-03-22 重庆邮电大学 Psychological pre-judging method and system based on K-means clustering and XGBoost algorithm
CN112733903A (en) * 2020-12-30 2021-04-30 许昌学院 Air quality monitoring and alarming method, system, device and medium based on SVM-RF-DT combination
CN112733903B (en) * 2020-12-30 2023-11-17 许昌学院 SVM-RF-DT combination-based air quality monitoring and alarming method, system, device and medium
CN112766550A (en) * 2021-01-08 2021-05-07 佰聆数据股份有限公司 Power failure sensitive user prediction method and system based on random forest, storage medium and computer equipment
CN112766550B (en) * 2021-01-08 2023-10-13 佰聆数据股份有限公司 Random forest-based power failure sensitive user prediction method, system, storage medium and computer equipment
CN112883564A (en) * 2021-02-01 2021-06-01 中国海洋大学 Water body temperature prediction method and prediction system based on random forest
CN112990284B (en) * 2021-03-04 2022-11-22 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN112990284A (en) * 2021-03-04 2021-06-18 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN112907359A (en) * 2021-03-24 2021-06-04 四川奇力韦创新科技有限公司 Bank loan business qualification auditing and risk control system and method
CN112907359B (en) * 2021-03-24 2024-03-12 四川奇力韦创新科技有限公司 Bank loan business qualification auditing and risk control system and method
CN113127342B (en) * 2021-03-30 2023-06-09 广东电网有限责任公司 Defect prediction method and device based on power grid information system feature selection
CN113127342A (en) * 2021-03-30 2021-07-16 广东电网有限责任公司 Defect prediction method and device based on power grid information system feature selection
CN113139876A (en) * 2021-04-22 2021-07-20 平安壹钱包电子商务有限公司 Risk model training method and device, computer equipment and readable storage medium
CN113221972A (en) * 2021-04-26 2021-08-06 西安电子科技大学 Unbalanced hyperspectral data classification method based on weighted depth random forest
CN113221972B (en) * 2021-04-26 2024-02-13 西安电子科技大学 Unbalanced hyperspectral data classification method based on weighted depth random forest
CN113159615A (en) * 2021-05-10 2021-07-23 麦荣章 Intelligent information security risk measuring system and method for industrial control system
CN113205271A (en) * 2021-05-12 2021-08-03 国家税务总局山东省税务局 Method for evaluating enterprise income tax risk based on machine learning
CN113282886A (en) * 2021-05-26 2021-08-20 北京大唐神州科技有限公司 Bank loan default judgment method based on logistic regression
CN113392585B (en) * 2021-06-10 2023-11-03 京师天启(北京)科技有限公司 Method for spatialization of sensitive crowd around polluted land
CN113392585A (en) * 2021-06-10 2021-09-14 京师天启(北京)科技有限公司 Method for spatializing sensitive people around polluted land
CN113570191B (en) * 2021-06-21 2023-10-27 天津大学 Intelligent diagnosis method for dangerous situations of ice plugs in river flood
CN113570191A (en) * 2021-06-21 2021-10-29 天津大学 Intelligent diagnosis method for river ice plug dangerous situations in ice flood
CN113326664B (en) * 2021-06-28 2022-10-21 南京玻璃纤维研究设计院有限公司 Method for predicting dielectric constant of glass based on M5P algorithm
CN113326664A (en) * 2021-06-28 2021-08-31 南京玻璃纤维研究设计院有限公司 Method for predicting dielectric constant of glass based on M5P algorithm
CN113592058B (en) * 2021-07-05 2024-03-12 西安邮电大学 Method for quantitatively predicting microblog forwarding breadth and depth
CN113592058A (en) * 2021-07-05 2021-11-02 西安邮电大学 Method for quantitatively predicting microblog forwarding breadth and depth
CN113657452A (en) * 2021-07-20 2021-11-16 中国烟草总公司郑州烟草研究院 Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning
CN113470819A (en) * 2021-07-23 2021-10-01 湖南工商大学 Early prediction method for adverse event of pressure sore of small unbalanced sample based on random forest
CN113658680A (en) * 2021-07-29 2021-11-16 广西友迪资讯科技有限公司 Random forest based method for evaluating withdrawal effect of drug addicts
CN113658680B (en) * 2021-07-29 2023-10-27 广西友迪资讯科技有限公司 Evaluation method for drug-dropping effect of drug-dropping personnel based on random forest
CN113822536A (en) * 2021-08-26 2021-12-21 国网河北省电力有限公司邢台供电分公司 Power distribution network index evaluation method based on branch definition algorithm
CN113705904A (en) * 2021-08-31 2021-11-26 国网上海市电力公司 Chemical plant area power utilization fault prediction method based on random forest algorithm
CN113837863B (en) * 2021-09-27 2023-12-29 上海冰鉴信息科技有限公司 Business prediction model creation method and device and computer readable storage medium
CN113837863A (en) * 2021-09-27 2021-12-24 上海冰鉴信息科技有限公司 Business prediction model creation method and device and computer readable storage medium
CN114154561A (en) * 2021-11-15 2022-03-08 国家电网有限公司 Electric power data management method based on natural language processing and random forest
CN114154561B (en) * 2021-11-15 2024-02-27 国家电网有限公司 Electric power data management method based on natural language processing and random forest
CN114492929A (en) * 2021-12-23 2022-05-13 江南大学 XGboost-based financial credit enterprise credit prediction method
CN114710326A (en) * 2022-03-15 2022-07-05 国网甘肃省电力公司电力科学研究院 Intrusion detection method based on GC-Forest
CN114679779A (en) * 2022-03-22 2022-06-28 安徽理工大学 WIFI positioning method based on improved KNN fusion random forest algorithm
CN114679779B (en) * 2022-03-22 2024-04-26 安徽理工大学 WIFI positioning method based on improved KNN fusion random forest algorithm
CN115064263A (en) * 2022-06-08 2022-09-16 华侨大学 Alzheimer's disease prediction method based on random forest pruning brain region selection
CN115032720A (en) * 2022-07-15 2022-09-09 国网上海市电力公司 Application of multi-mode integrated forecast based on random forest in ground air temperature forecast
CN115907483A (en) * 2023-01-06 2023-04-04 山东蜂鸟物联网技术有限公司 Personnel risk assessment early warning method
CN115907483B (en) * 2023-01-06 2023-07-04 山东蜂鸟物联网技术有限公司 Personnel risk assessment and early warning method
CN116090834B (en) * 2023-03-07 2023-06-13 安徽农业大学 Forestry management method and device based on Flink platform
CN116090834A (en) * 2023-03-07 2023-05-09 安徽农业大学 Forestry management method and device based on Flink platform
CN116823014A (en) * 2023-04-06 2023-09-29 南京邮电大学 Method for realizing enterprise employee performance automatic scoring service
CN116823014B (en) * 2023-04-06 2024-02-13 南京邮电大学 Method for realizing enterprise employee performance automatic scoring service
CN116364178A (en) * 2023-04-18 2023-06-30 哈尔滨星云生物信息技术开发有限公司 Somatic cell sequence data classification method and related equipment
CN116364178B (en) * 2023-04-18 2024-01-30 哈尔滨星云生物信息技术开发有限公司 Somatic cell sequence data classification method and related equipment
CN116226767B (en) * 2023-05-08 2023-10-17 国网浙江省电力有限公司宁波供电公司 Automatic diagnosis method for experimental data of power system
CN116226767A (en) * 2023-05-08 2023-06-06 国网浙江省电力有限公司宁波供电公司 Automatic diagnosis method for experimental data of power system
CN117150389A (en) * 2023-07-14 2023-12-01 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN117150389B (en) * 2023-07-14 2024-04-12 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN116861800B (en) * 2023-09-04 2023-11-21 青岛理工大学 Oil well yield increasing measure optimization and effect prediction method based on deep learning
CN116861800A (en) * 2023-09-04 2023-10-10 青岛理工大学 Oil well yield increasing measure optimization and effect prediction method based on deep learning
CN117113291B (en) * 2023-10-23 2024-02-09 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Analysis method for importance of production parameters in semiconductor manufacturing
CN117113291A (en) * 2023-10-23 2023-11-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Analysis method for importance of production parameters in semiconductor manufacturing
CN117370899B (en) * 2023-12-08 2024-02-20 中国地质大学(武汉) Ore control factor weight determining method based on principal component-decision tree model
CN117370899A (en) * 2023-12-08 2024-01-09 中国地质大学(武汉) Ore control factor weight determining method based on principal component-decision tree model
CN117540830A (en) * 2024-01-05 2024-02-09 中国地质科学院探矿工艺研究所 Debris flow susceptibility prediction method, device and medium based on fault distribution index
CN117540830B (en) * 2024-01-05 2024-04-12 中国地质科学院探矿工艺研究所 Debris flow susceptibility prediction method, device and medium based on fault distribution index

Similar Documents

Publication Publication Date Title
AU2020100709A4 (en) A method of prediction model based on random forest algorithm
Khandani et al. Consumer credit-risk models via machine-learning algorithms
AU2020101475A4 (en) A Financial Data Analysis Method Based on Machine Learning Models
AU2019101189A4 (en) A financial mining method for credit prediction
Van Thiel et al. Artificial intelligence credit risk prediction: An empirical study of analytical artificial intelligence tools for credit risk prediction in a digital era
Zurada et al. Comparison of the performance of several data mining methods for bad debt recovery in the healthcare industry
Chern et al. A decision tree classifier for credit assessment problems in big data environments
Van Thiel et al. Artificial intelligent credit risk prediction: An empirical study of analytical artificial intelligence tools for credit risk prediction in a digital era
CN113095927A (en) Method and device for identifying suspicious transactions of anti-money laundering
Valavan et al. Predictive-Analysis-based Machine Learning Model for Fraud Detection with Boosting Classifiers.
Hidayattullah et al. Financial statement fraud detection in Indonesia listed companies using machine learning based on meta-heuristic optimization
Mirtalaei et al. A trust-based bio-inspired approach for credit lending decisions
Koç et al. Consumer loans' first payment default detection: a predictive model
Ke et al. Loan repayment behavior prediction of provident fund users using a stacking-based model
Naik Predicting credit risk for unsecured lending: A machine learning approach
Datkhile et al. Statistical modelling on loan default prediction using different models
Becha et al. Use of Machine Learning Techniques in Financial Forecasting
Dasari et al. Prediction of bank loan status using machine learning algorithms
Jin et al. Financial credit default forecast based on big data analysis
Zeng A comparison study on the era of internet finance China construction of credit scoring system model
Panyagometh Impact of baseline population on credit score’s predictive power
Zurada Rule Induction Methods for Credit Scoring
Ruan et al. Personal credit risk identification based on combined machine learning model
Niu et al. Comparison of different individual credit risk assessment models
Salihu et al. Data Mining Based Classifiers for Credit Risk Analysis

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry