AU2020100709A4 - A method of prediction model based on random forest algorithm - Google Patents
A method of prediction model based on random forest algorithm Download PDFInfo
- Publication number
- AU2020100709A4 AU2020100709A4 AU2020100709A AU2020100709A AU2020100709A4 AU 2020100709 A4 AU2020100709 A4 AU 2020100709A4 AU 2020100709 A AU2020100709 A AU 2020100709A AU 2020100709 A AU2020100709 A AU 2020100709A AU 2020100709 A4 AU2020100709 A4 AU 2020100709A4
- Authority
- AU
- Australia
- Prior art keywords
- data
- random forest
- default
- classification
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Abstract
Abstract When managing credit risk, it is a fundamental and vital segment for modem financial institutions to figure out how to effectively evaluate and identify potential default risk of borrowers before offering loans and calculate the default probability of borrowers. In this paper, the main objective of our investigation is to statistically analyze the historical loan data of banks and other financial institutions and establish a loan default prediction model by applying random forest algorithm and the thinking of unbalanced data classification. According to experiments' result, random forest algorithm performs better than decision tree and logical regression classification algorithms in predictive performance. Additionally, features highly associating with default can be obtained by prioritizing features by using random forest algorithm so that the procedure of judging the risk of offering loan is optimized. Fgre12io DeFF cision Classification 1. Sorting (classifier): md Classification KNN,- (score card) Logistic Regression Bayes Figure 12 1. #The result of RF modeled, 2. TRAIN: [ 17941 117875 67893 ... , 93992 20627 5744] TEST: [64422 113530 30105 .. , 34862 130492 127209] ' 3. the best parameter: {' X': 2, '4 ': 50} 4. the best score: 0.863112669495 ' 5. p 0.907527986443 * 6. ; 0.864487503309 e Figure 13
Description
FIELD OF THE INVENTION
This invention mainly studies how to analyze the historical loan data of Banks and other financial institutions with the idea of unbalanced data classification and predict the possibility of loan default based on the random forest classification model.
BACKGROUND OF THE INVENTION
With the rapid development of the world economy and the deepening of China's reform and opening up, whether it is the development of enterprises or the change of people's consumption concept, loans have become an important way for enterprises and individuals to solve economic problems. With the launch of Banks' various loan businesses and people's growing demand, the probability of non-performing loans, or loans defaulting, has also increased sharply. In order to avoid loan default, Banks and other financial institutions will evaluate or score the credit risk of borrowers when issuing loans, predict the probability of loan default and make a judgment on whether to issue loans based on the results. How to effectively evaluate and identify the potential default risk of borrowers before issuing loans is the basis and important link of credit risk management of financial institutions. Using a set of scientific models and systems to judge the risk of loan defaults can minimize risks and maximize profits.
Financial technology, also known as fintech, is a fast-evolving field that has reshaped the
2020100709 05 May 2020 financial industry, such as insurance, investment and banking business. Recent years have witnessed the huge success of machine learning and deep learning in machine perception areas such as speech recognition and image analysis, but financial services need more, including prediction and decision-making.
Computing technologies play an important role in the transformation of modern financial services. There are many traditional classification Machine learning algorithms, such as Decision Tree, Support Vector Machine and so on. These algorithms are single classifiers, and they face performance bottlenecks and overfitting problems. Therefore, the method of integrating multiple classifiers to improve the prediction performance came into being, which is Ensemble Learning. Bagging(parallel) and Boosting(serial) are two common Ensemble Learning methods. The difference between them is whether the integration method is parallel or serial. Random Forests are the most representative algorithm in the Bagging integration method.
Random Forest is an Ensemble Learning algorithm based on Decision Tree. Decision Tree is a widely used tree classifier. Each node of the tree is classified by selecting the optimal splitting feature until the stop condition of tree construction is reached. For example, data in leaf nodes are of the same category. When entering the sample to be classified, the Decision Tree determines a unique path from the root node to the leaf node, whose category is the category of the sample to be classified. Decision Tree is a simple and fast non-parametric classification method. In general, it has good accuracy, but when the data is complex, the Decision Tree has performance bottlenecks. Random Forest is a machine learning algorithm proposed by Leo Breiman in 2001 by combining Bagging ensemble learning theory with stochastic subspace method. Random Forest is an Ensemble Learning model based on the
2020100709 05 May 2020
Decision Tree classifier. It contains multiple decision trees trained by the Bagging Ensemble Learning technique. And when the samples to be classified are input, the final classification result is voted on by the output result of a single decision tree. Random Forest solves the performance bottleneck of Decision Tree, has good tolerance to noise and outliers, and has good scalability and parallelism for high-dimensional data classification. In addition, Random Forest is a non-parametric classification method driven by data, which only requires learning and training classification rules for a given sample, without prior knowledge.
Since the training of each decision tree is independent of each other, the training of random forest can be realized through parallel processing, which effectively guarantees the efficiency and expansibility of random forest algorithm.
SUMMARY
This invention mainly studies the loan default problems common in the financial field and uses the random forest method of unbalanced data classification to establish a model of predicting loan defaults. The basic idea of random forest is to randomly select some variables or Features participate in tree node division, repeat multiple times and ensure the independence between these trees. For unbalanced data, parameter adjustment enables the random forest method to automatically adjust weights based on the y value, thereby effectively solving the classification of unbalanced data problem.
Experiments show that the random forest algorithm has better classification performance than decision trees and logistic regression models, has important reference significance for loan default prediction in the financial field. In addition, by measuring the importance of each feature, in this experiment, we can obtain the three characteristics of the borrower's age,
2020100709 05 May 2020 debt ratio, and number of real estate and mortgage loans, which have a greater impact on the eventual default. It also has important reference significance for other feature selection problems in data mining.
DESCRIPTION OF DRAWING
Figure 1 shows the procedure of the project.
Figure 2 shows the detailed steps of the project.
Figure 3 shows the quantity variance between non-default samples and default samples.
Figure 4 shows the dataset overview by using df.describe() code.
Figure 5 visualizes the missing values.
Figure 6 visualizes the outliers in hi sts.
Figure 7 shows the boxplots of outliers.
Figure 8 shows the distribution and box line diagram of variable-age.
Figure 9 shows the correlation matrix of variables.
Figure 10 performs statistical analysis on each variable and obtain the frequency distribution table.
Figure 11 shows the Random Forest Algorithm process.
Figure 12 shows the logical frame of modeling.
Figure 13 shows the result of Random Forest Model.
Figure 14 shows the result of XGBoost Model.
Figure 15 shows the importance of each feature.
DESCRIPTION OF PREFERRED EMBODIMENT
Figure 1 and Figure 2 show the procedure of our project and provide its overview.
2020100709 05 May 2020
1. Business understanding
Understanding commercial goals and commercial demands could help us turn it into a data mining problem.
In this invention, analyzing the historical loan data of banks and other financial institutions with the idea of unbalanced data classification and predicting the possibility of loan default are our primary business requirements.
2. Data Preprocessing (The Data construction environment used in this invention are Anaconda3 and Python3.) Before further work, filtering required data and defining the meaning and characteristics of the data are necessary.
Dataset observation:
The loan default dataset we use includes a total of 250,000 samples, of which 150,000 samples are selected as the training set and 10,000 are selected as the test set. In the training set, there are 150,000 borrower information samples. Using df [df.SeriousDlqin2yrs == 1], we know that the number of default samples is 10026, the percentage is 6.684%; the number of non-default samples is 139974, the percentage is 93.316% (Figure 3), which means this is a highly imbalanced data set.
The dataset includes age, monthly income, family members, loans and several other conditions, with totally 11 variables, in which SeriousDlqin2yrs is label and the other 10 are prediction features. The following table shows the data features:
Variable Name | Description | Type |
2020100709 05 May 2020
SeriousDlqin2yrs | Person experienced 90 days past due delinquency or worse | Y/N |
RevolvingUtilizationOfUnsecuredLines | Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits | percentage |
age | Age of borrower in years | integer |
NumberOfTime30-59DaysPastDueNotWorse | Number of times borrower has been 30-59 days past due but no worse in the last 2 years. | integer |
DebtRatio | Monthly debt payments, alimony, living costs divided by monthly gross income | percentage |
Monthlyincome | Monthly income | real |
NumberOfOpenCreditLinesAndLoans | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) | integer |
NumberOfTimes90DaysLate | Number of times borrower has been 90 days or more past due. | integer |
NumberRealEstateLoansOrLines | Number of mortgage and real estate loans including home equity lines of credit | integer |
NumberOfTime60-89DaysPastDueNotWorse | Number of times borrower has been 60-89 days past due but no worse in the last 2 years. | integer |
NumberOfDependents | Number of dependents in family excluding themselves (spouse, children etc.) | integer |
Using df.describe() to process the training dataset (Figure 4).
Figure 3 demonstrates that Monthlyincome and NumberOfDependents have missing values since count <150000. The mean of SeriousDlqin2yrs is 0.06684 which indicates that the default rate of the dataset is 6.684%. The minimum value for age is 0, which means there exist outliers in age for the reason that the Banks do not lend to people under 18.
Data cleaning:
Deleted the column Unnamed using data=data.drop(data.eolumns[0],axis=l).
Use data data[data['age']<18] to find that the data <18 years of age only reference one row.
Save samples with age >= 18: data=data[data.age>=18].
2020100709 05 May 2020
Datasets with missing values:
We use df.isnull().sum() and find that Monthlyincome and NumberOfDependents have missing values, respectively 29731 and 3924.
Unnamed:0 | 0 |
SeriousDlqin2yrs | 0 |
RevolvingUtilizationOfUnsecuredLines | 0 |
age | 0 |
NumberOfTime30-59DaysPastDueNotWorse | 0 |
DebtRatio | 0 |
Monthlyincome | 29731 |
NumberOfOpenCreditLinesAndLoans | 0 |
NurnberOfTimes90DaysLate | 0 |
NumberRealEstateLoansOrLines | 0 |
NumberOfTime60-89DaysPastDueNotWorse | 0 |
NumberOfDependents | 3924 |
dtype: int64 |
As can be seen from the table, for the case where there are many missing values every month, a random forest model will be established to fill the gaps, and for NumberOfDependents with fewer missing values, the missing samples will be directly deleted.
(Figure 5)Moreover, matrix plots provide a visualization of missing values: we first replace the variable names with xl to xlO. Light color indicates small value while dark color indicates large value. Missing value is red in the plot and we can see x5 (Monthlyincome), xlO (NumberOfDependents) have missing values.
Outliers processing:
Outliers are usually considered a large deviation from the data. For example, in statistics, values below Q1-1.5IQR and above Q3 + 1.5IQR are often considered outliers.
Generally there are several ways to complete outlier precessing. First of all, we can use boxplots to achieve univariate detection. Boxplot() function generates boxplots. The function
2020100709 05 May 2020 lists the data points outside the box-whisker line in the box-and-line diagram. Second, LOF (Local Anomaly Factor) is also useful. LOF is an algorithm identifying outliers which is based on density. The algorithm compares the local density of a point with the density of points distributed around it. If the former is evidently smaller than the latter, then this point is in a relatively sparse area, indicating that the point is an outlier. Third, cluster detection. Aggregate the data into different classes, and select data that does not belong to any class as outliers. K-means algorithm is usually exploited in this case.
The minimum age is 0, which is an outlier. In the variables NumberOfTime30-59DaysPastDueNotWorse, there are approximately 96, 98 values for NumberOfTime30-59DaysPastDueNotWorse and NumberOffimes90DaysLate. They may be outliers or some behavior codes (Figure 6). Through the boxplot, we can see that the three overdue indicators (30-59 days overdue, 60-80 days overdue, 90 days overdue) have severe outliers (Figure 7). It is verified that the 99% quantiles of these three indicators are too divergent from maximum value and there are abnormalities. The dimensions are all (225, 11). It can be guessed that the abnormal situation of these three indicators occurs on the same row.
When using the pandas library to read data in Python, set the na values parameter in the function pd.read csv () to our own defined list; set 0 in the age variable, and set 96 and 98 as the three expired variables respectively. NaN values, then use the skleam.preprocessing. Imputer library to replace all NaNs in the dataset with the average of the corresponding column.
Frequency tables:
Examine the distribution of the default rate over each independent variable to generate a
2020100709 05 May 2020 table of frequency distributions. We start with RevolvingUtilizationOfUnsecuredLines, using the code: datatmp = data[['SeriousDlqin2yrs','RevolvingUtilizationOfUnsecuredLines']], and then add a label column to label each line with a label, which can mark the interval to which the line belongs. Pandas' Cut () function is used to convert continuous variables to categorical variables, such as mapping values [1,2,... 100] to variables of [1-10], [11-20].
In order to calculate the ratio of default staff, we need the total number of staff (number of rows in data tmp). And we need the pandas pivot table () function to generate the summary table. What we want is the default / total number of people in this interval.
The next step is to write the process of generating the frequency table into a function:
Get the total number of people total_size=data_tmp.shape[0] —Calculate the default number of people / total number of people in this time interval — pandasivot table () — To find/add a column, execute hierarchic ally-rename name-adjust the table Format (reindexaxis ()).
The frequency distribution table of age:
age | SeriousDlqin2yrs | |||
number | percent(%) | number | percent(%) | |
1 :below 25 | 3027 | 2.018 | 338 | 11.166 |
2:26-35 | 18458 | 12.305 | 2053 | 11.123 |
3:36-45 | 29819 | 19.879 | 2628 | 8.813 |
4:46-55 | 36690 | 24.460 | 2786 | 7.593 |
5:56-65 | 33406 | 22.271 | 1531 | 4.583 |
6: above 65 | 28599 | 19.066 | 690 | 2.413 |
From table we can see that the default rate of people below 35 years old is over 10%. With the increase of age, the default rate falls. The distribution and box line diagram details shown in Figure 8.
The frequency distribution table of DebtRatio:
2020100709 05 May 2020
DebtRatio | SeriousDlqin2yrs | |||
number | percent(%) | number | percent(%) | |
Lbelow 0.25 | 52361.0 | 34.908 | 3126 | 5.970 |
2:0.25-0.5 | 41346.0 | 27.564 | 2529 | 6.117 |
3:0.5-0.75 | 15728.0 | 10.485 | 1484 | 9.435 |
4:0.75-1.0 | 5427.0 | 3.618 | 596 | 10.982 |
5:1.0-2.0 | 4092.0 | 2.728 | 539 | 13.172 |
6:above 2 | 31045.0 | 20.697 | 1752 | 5.643 |
With the increase of the debt ratio, the interval default rate continues to increase as well.
Default rate is the highest among those with a debt ratio of 1-2, but when the debt ratio is greater than 2, the default rate drops.
The frequency distribution table of NumberOfOpenCreditLinesAndLoans:
NumberOfOpenCreditLinesAndLoans | SeriousDlqin2yrs | |||
number | percent(%) | number | percent(%) | |
1 :below 5 | 46590 | 31.060 | 3922 | 8.418 |
2:6-10 | 60399 | 40.266 | 3345 | 5.538 |
3:11-15 | 29184 | 19.456 | 1804 | 6.181 |
4:16-20 | 9846 | 6.564044 | 676 | 6.866 |
5:21-25 | 2841 | 1.894 | 191 | 6.723 |
6:26-30 | 785 | 0.523 | 62 | 7.898 |
7: above 30 | 354 | 0.236 | 26 | 7.345 |
The number of defaulters is more evenly distributed.
The frequency distribution table of NumberRealEstateLoansOrLines:
NumberRealEstateLoansOrLines | SeriousDlqin2yrs | |||
number | percent(%) | number | percent(%) | |
1 :below 5 | 149206 | 99.471 | 9884 | 6.624 |
2:6-10 | 699 | 0.466 | 121 | 17.310 |
3:11-15 | 70 | 0.047 | 16 | 22.857 |
4:16-20 | 14 | 0.009 | 3 | 21.429 |
5:above 20 | 10 | 0.007 | 2 | 20.000 |
99.47% of borrowers own less than 5 real estate and mortgage loans; the default rate of people with more than 5 loans has increased significantly.
2020100709 05 May 2020
The frequency distribution table of NumberOfTime30-59DaysPastDueNotWorse:
NumberOfTime30-59DaysPastDueNotWorse | SeriousDlqin2yrs | |||
number | percent(%) | number | percent(%) | |
1:0 | 142050 | 94.701 | 7450 | 5.245 |
2:1 | 4598 | 3.065 | 1219 | 26.512 |
3:2 | 1754 | 1.169 | 618 | 35.234 |
4:3 | 747 | 0.498 | 318 | 42.570 |
5:4 | 342 | 0.228 | 154 | 45.029 |
6:5 | 140 | 0.093 | 74 | 52.857 |
7:6 | 54 | 0.036 | 28 | 51.852 |
8:7and above | 314 | 0.209 | 165 | 52.548 |
Borrowers who are not overdue within 30-59 days, the default interest rate is only 4%. As the number of expirations increases, the default ratio continues to rise. The default rates for the two variables, 60-89 days overdue and 90 days overdue, also have the same trend. Therefore, whether a default occurs is an important variable that determines whether a default will occur in the future.
The frequency distribution table of Monthlyincome:
Monthlyincome | SeriousDlqin2yrs | |||
number | percent(%) | number | percent(%) | |
libelow 5000 | 55859.0 | 37.240 | 4813 | 8.616 |
2:5000-10000 | 46090.0 | 30.727 | 2752 | 5.971 |
3:1000-15000 | 13035.0 | 8.690 | 547 | 4.196 |
4:above 15000 | 5284.0 | 3.523 | 245 | 4.637 |
The higher the income, the lower the default rate. The Mondaylncome column is missing data and can only be used as a reference.
2020100709 05 May 2020
The frequency distribution table of NumberOfDependents:
NumberOfDependents | SeriousDlqin2yrs | |||
number | percent(%) | number | percent(%) | |
1:0 | 113218.0 | 75.479 | 7030 | 6.209 |
2:1 | 19521.0 | 13.014 | 1584 | 8.114 |
3:2 | 9483.0 | 6.322 | 837 | 8.826 |
4:3 | 2862.0 | 1.908 | 297 | 10.377 |
5:4 | 746.0 | 0.497 | 68 | 9.115 |
6:5 and more | 245.0 | 0.163 | 31 | 12.653 |
There is not much difference in default rates among people with different family members.
The dataset used in this experiment has 10 variables. Figure 9 shows that the relationships between each variable are minute. We perform statistical analysis on each variable and obtain the frequency distribution table shown above in Figure 10. Except that the variables NumberOfOpenCreditLinesAndLoans (the number of open lines and credit loans) have no apparent correlation with the default rate, these variables are related to whether the borrower eventually defaults.
3. Algorithm selection and modeling
Understanding the characteristics of this dataset: unbalanced data classification
Unbalanced data, that is, one type of data far exceeds the other (a few types) of data, is widespread in many fields such as network intrusion detection, financial fraud transaction detection, and text classification. The classification problem of dealing with unbalanced data can be solved by the penalty weights of positive and negative samples. The idea is that in the implementation of the algorithm, different weights are given to the categories of different sample numbers in the classification. Generally, the categories with small sample sizes have high weights and large sample size categories are low weighted and then calculated and modeled.
2020100709 05 May 2020
Random forest algorithm
Random forest
The Random Forest algorithm is to build a forest in a random manner. It is a combination learning algorithm based on a decision tree. The basic idea of random forest is that during the process of constructing a single tree, some variables or features are randomly selected to participate in the tree node division, repeated multiple times, and the independence between these trees established is guaranteed. After obtaining a random forest, when a new input sample enters, each decision tree in the forest will judge the sample, get the result of which class the sample belongs to, and finally see which one in the entire forest belongs to. The class has the highest votes, and it is predicted which class the sample is. The process is shown in Figure 11.
Principles and Features of Random Forest Algorithm
The Random Forest algorithm includes classification and regression problems. The algorithm steps are as follows:
Random Forest
Enter:
T = Training Set
Ntree=Number of decision trees in the forest
M = number of predictors in each sample
Mtry=The number of variables participating in the division in each tree node Ssampsize=Bootstrap sample size
2020100709 05 May 2020
Algorithm process:
F or(itree=0; 1 <itree<=Ntree; itree++) • Use training set T to generate a Bootstrap data sample with size Ssampsize.
• Use the generated Bootstrap data to construct an unpruned tree itree. In the process of generating a tree itree, randomly select Mtry variables from M and select the best one to branch according to some standard (Gini).
Output:
• Regression problem: use the average of all but the number of return values as the prediction result.
• Classification problem: use the classification results of most decision trees as prediction results.
The random forest has the following characteristics: As can be seen from the above algorithm process, the randomness of the random forest is mainly reflected in two aspects: the randomness of the data space is realized by Bagging (Bootstrap Aggregating), and the randomness of the feature space is randomized by random samples (Random Subspace). For classification problems, each decision tree in the random forest classifies and predicts new samples, and then somehow combines the decision results of these trees to give the final classification results of the samples.
The random forest algorithm also has the following advantages:
• The introduction of randomness in rows (data records) and columns (variables) in the data makes it difficult for random forests to fall into overfitting.
• Random forest has good anti-noise ability.
2020100709 05 May 2020 • When there are a large number of missing values in the dataset, the random forest can effectively estimate and process the missing values.
• Strong adaptability to the dataset: can handle both discrete data and continuous data, the data set does not need to be standardized.
• It is possible to sort the importance of variables to facilitate the interpretation of variables. There are two methods to calculate the importance of variables in random forests: one is based on the average falling accuracy rate of OOB (Out of Bag). That is, in the process of growing a decision tree, first use the OOB sample to test and record the wrong samples, and then randomly shuffle the value of a column of variables in the Bootstrap sample, use the decision tree to predict it again, and record again. The number of wrong samples. The number of two prediction errors divided by the total number of OOB samples is the error rate change of this decision tree. The error rate change of all trees in a random forest is averaged to obtain the average decline accuracy rate. The other method is based on the amount of GINI drop during splitting. The random forest growing decision tree is splitting nodes according to the decline of the purity of GINI. All nodes in the forest that select a certain variable as the splitting variable are summarized GINI drop.
Random forest method for unbalanced data classification
The random forest algorithm defaults the weight of each class to 1, which assumes that the cost of misclassification of all classes is equal. In scikit-learn, the random forest algorithm provides a class weight parameter, whose value can be a list or dictionary value, and manually specifies the weight of different classes. If the parameter is balanced, then the random forest algorithm uses the y value to automatically adjust the weights, and each type of weight is inversely proportional to the class frequency in the input data.
2020100709 05 May 2020
The calculation formula is:
balancedsubsample is similar to balanced mode. The calculation uses the number of samples in the sample with replaceable type instead of the total number of samples. Therefore, we can solve the problem of unbalanced data classification through this method.
Modeling and Results
Figure 12 shows the logical frame.
We can use skleam.ensemble.RandomForestClassifier in Python to build a random forest model.
Part of the parameters are set as:
n estimators: number of decision trees is 100 oob score: whether to use out-of-bag data—True min samples split: when dividing nodes based on attributes, the minimum number of samples per division, 2 min samples leaf: Samples with the fewest leaf nodes, 50 njobs: parallel number as -1; start as many jobs as the number of computer CPU cores class weight: set as “balanced subsample”; use y value to automatically adjust the weights, each type of weight is inversely proportional to the category frequency in the input data bootstrap: whether to use bootstrap sampling, True
Algorithms:
Firstly, load training and test datasets and preprocess data.
2020100709 05 May 2020
Then, split training data into trainingnew and testnew for validation.
And now, impute the data with imputer: replace missing values with Mean.
Finally, build Random Forest model with training new:
• deal with imbalanced data distribution.
• perform parameter tuning using grid search with CrossValidation.
• output the best model and predict for test data.
Result shown in Figure 13
Extensive Model: XGBoost model
XGBoost is one of the Boosting algorithms. The idea of the Boosting algorithm is to integrate many weak classifiers together to form a more powerful classifier. Since XGBoost is a lifting tree model, it integrates many tree models to form a more powerful classifier. The tree model used is a CART regression tree. XGBoost is an improvement based on GBDT, making it more powerful and applicable to a wider range.
Algorithm:
Firstly, load training and test datasets and preprocess data.
Secondly, split training data into training new and test new for validation.
Thirdly, build XGboost model with the training new data:
• deal with missing values and imbalanced data distribution.
• perform parameter tuning using grid search with CrossValidation.
• output the best model and predict for test data.
Result shown in Figure 14.
2020100709 05 May 2020
4.Model Evaluation
The model evaluation index used in this experiment is the AUC (area under the ROC curve) value. AUC is defined as the area under the ROC (receiver operating characteristic) curve. Obviously, the value of this region will not be greater than 1. The horizontal axis of the ROC curve is the false alarm rate (FPR), and the vertical axis is the forward rotation rate (TPR). Since the ROC curve is usually above the y = x line, the AUC value ranges from 0.5 to 1. We use the AUC value as the evaluation criterion, because usually the ROC curve cannot clearly indicate which classifier performs better, and as a value, the classifier with a larger AUC performs better.
We compared the random forest model with the XGBoost model, the logistic regression model, and the decision tree model. The results are as follows:
Algorithm | AUC value |
Random Forest | 0.86 |
XGBoost | 0.86 |
Logistic Regression | 0.80 |
Decision Tree | 0.80 |
According to the table, the AUC of the random forest is higher than the logistic regression and decision tree, which is very close to XGBoost. The higher the AUC, the better the prediction accuracy.
2020100709 05 May 2020
Evaluation of Feature Importance
The feature importances method of skleam.ensemble.RandomForestClassifier will be used in this experiment. The importance of each feature is shown in Figure 15.
As can be seen from Figure 15, RevolvingUtilizationOfUnsecuredLines, NumberOfTime30-59DaysPastDueNotWorse and NumberOfTime90DaysLate are the three most important functions. This has a great impact on whether a default occurs in the end, so we should pay special attention to these characteristics of borrowers when processing loan applications.
Claims (1)
- rj What we claim is:1. A method of prediction model based on random forest algorithm, characterized in that:Business understanding; understanding commercial goals and commercial demands could help us turn it into a data mining problem, analyzing the historical loan data of banks and other financial institutions with the idea of unbalanced data classification and predicting the possibility of loan default are our primary business requirements; data preprocessing; before further work, filtering required data and defining the meaning and characteristics of the data are necessary; random forest algorithm; the Random Forest algorithm is to build a forest in a random manner; it is a combination learning algorithm based on a decision tree; the basic idea of random forest is that during the process of constructing a single tree, some variables or features are randomly selected to participate in the tree node division, repeated multiple times, and the independence between these trees established is guaranteed; after obtaining a random forest, when a new input sample enters, each decision tree in the forest will judge the sample, get the result of which class the sample belongs to, and finally see which one in the entire forest belongs to; the class has the highest votes, and it is predicted which class the sample is.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2020100709A AU2020100709A4 (en) | 2020-05-05 | 2020-05-05 | A method of prediction model based on random forest algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2020100709A AU2020100709A4 (en) | 2020-05-05 | 2020-05-05 | A method of prediction model based on random forest algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2020100709A4 true AU2020100709A4 (en) | 2020-06-11 |
Family
ID=70976398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2020100709A Ceased AU2020100709A4 (en) | 2020-05-05 | 2020-05-05 | A method of prediction model based on random forest algorithm |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2020100709A4 (en) |
Cited By (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753790A (en) * | 2020-07-01 | 2020-10-09 | 武汉楚精灵医疗科技有限公司 | Video classification method based on random forest algorithm |
CN111767958A (en) * | 2020-07-01 | 2020-10-13 | 武汉楚精灵医疗科技有限公司 | Real-time enteroscopy withdrawal time monitoring method based on random forest algorithm |
CN111798984A (en) * | 2020-07-07 | 2020-10-20 | 章越新 | Disease prediction scheme based on Fourier transform |
CN111951027A (en) * | 2020-08-14 | 2020-11-17 | 上海冰鉴信息科技有限公司 | Enterprise identification method and device with fraud risk |
CN111950795A (en) * | 2020-08-18 | 2020-11-17 | 安徽中烟工业有限责任公司 | Random forest based method for predicting water adding proportion of loosening and conditioning |
CN112100902A (en) * | 2020-08-10 | 2020-12-18 | 西安交通大学 | Lithium ion battery service life prediction method based on stream data |
CN112115991A (en) * | 2020-09-09 | 2020-12-22 | 福建新大陆软件工程有限公司 | Mobile terminal switching prediction method, device, equipment and readable storage medium |
CN112132187A (en) * | 2020-08-27 | 2020-12-25 | 上海大学 | Method for rapidly judging perovskite structure stability based on random forest |
CN112308146A (en) * | 2020-11-02 | 2021-02-02 | 国网福建省电力有限公司 | Distribution transformer fault identification method based on operation characteristics |
CN112465245A (en) * | 2020-12-04 | 2021-03-09 | 复旦大学青岛研究院 | Product quality prediction method for unbalanced data set |
CN112487262A (en) * | 2020-11-25 | 2021-03-12 | 建信金融科技有限责任公司 | Data processing method and device |
CN112530546A (en) * | 2020-12-14 | 2021-03-19 | 重庆邮电大学 | Psychological pre-judging method and system based on K-means clustering and XGboost algorithm |
CN112733903A (en) * | 2020-12-30 | 2021-04-30 | 许昌学院 | Air quality monitoring and alarming method, system, device and medium based on SVM-RF-DT combination |
CN112766550A (en) * | 2021-01-08 | 2021-05-07 | 佰聆数据股份有限公司 | Power failure sensitive user prediction method and system based on random forest, storage medium and computer equipment |
CN112883564A (en) * | 2021-02-01 | 2021-06-01 | 中国海洋大学 | Water body temperature prediction method and prediction system based on random forest |
CN112907359A (en) * | 2021-03-24 | 2021-06-04 | 四川奇力韦创新科技有限公司 | Bank loan business qualification auditing and risk control system and method |
CN112990284A (en) * | 2021-03-04 | 2021-06-18 | 安徽大学 | Individual trip behavior prediction method, system and terminal based on XGboost algorithm |
CN113127342A (en) * | 2021-03-30 | 2021-07-16 | 广东电网有限责任公司 | Defect prediction method and device based on power grid information system feature selection |
CN113139876A (en) * | 2021-04-22 | 2021-07-20 | 平安壹钱包电子商务有限公司 | Risk model training method and device, computer equipment and readable storage medium |
CN113159615A (en) * | 2021-05-10 | 2021-07-23 | 麦荣章 | Intelligent information security risk measuring system and method for industrial control system |
CN113205271A (en) * | 2021-05-12 | 2021-08-03 | 国家税务总局山东省税务局 | Method for evaluating enterprise income tax risk based on machine learning |
CN113221972A (en) * | 2021-04-26 | 2021-08-06 | 西安电子科技大学 | Unbalanced hyperspectral data classification method based on weighted depth random forest |
CN113282886A (en) * | 2021-05-26 | 2021-08-20 | 北京大唐神州科技有限公司 | Bank loan default judgment method based on logistic regression |
CN113326664A (en) * | 2021-06-28 | 2021-08-31 | 南京玻璃纤维研究设计院有限公司 | Method for predicting dielectric constant of glass based on M5P algorithm |
CN113392585A (en) * | 2021-06-10 | 2021-09-14 | 京师天启(北京)科技有限公司 | Method for spatializing sensitive people around polluted land |
CN113470819A (en) * | 2021-07-23 | 2021-10-01 | 湖南工商大学 | Early prediction method for adverse event of pressure sore of small unbalanced sample based on random forest |
CN113570191A (en) * | 2021-06-21 | 2021-10-29 | 天津大学 | Intelligent diagnosis method for river ice plug dangerous situations in ice flood |
CN113592058A (en) * | 2021-07-05 | 2021-11-02 | 西安邮电大学 | Method for quantitatively predicting microblog forwarding breadth and depth |
CN113658680A (en) * | 2021-07-29 | 2021-11-16 | 广西友迪资讯科技有限公司 | Random forest based method for evaluating withdrawal effect of drug addicts |
CN113657452A (en) * | 2021-07-20 | 2021-11-16 | 中国烟草总公司郑州烟草研究院 | Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning |
CN113705904A (en) * | 2021-08-31 | 2021-11-26 | 国网上海市电力公司 | Chemical plant area power utilization fault prediction method based on random forest algorithm |
CN113822536A (en) * | 2021-08-26 | 2021-12-21 | 国网河北省电力有限公司邢台供电分公司 | Power distribution network index evaluation method based on branch definition algorithm |
CN113837863A (en) * | 2021-09-27 | 2021-12-24 | 上海冰鉴信息科技有限公司 | Business prediction model creation method and device and computer readable storage medium |
CN114154561A (en) * | 2021-11-15 | 2022-03-08 | 国家电网有限公司 | Electric power data management method based on natural language processing and random forest |
CN114426894A (en) * | 2020-09-29 | 2022-05-03 | 中国石油化工股份有限公司 | Natural gas hydrate phase equilibrium pressure prediction method based on machine learning |
CN114492929A (en) * | 2021-12-23 | 2022-05-13 | 江南大学 | XGboost-based financial credit enterprise credit prediction method |
CN114679779A (en) * | 2022-03-22 | 2022-06-28 | 安徽理工大学 | WIFI positioning method based on improved KNN fusion random forest algorithm |
CN114710326A (en) * | 2022-03-15 | 2022-07-05 | 国网甘肃省电力公司电力科学研究院 | Intrusion detection method based on GC-Forest |
CN115032720A (en) * | 2022-07-15 | 2022-09-09 | 国网上海市电力公司 | Application of multi-mode integrated forecast based on random forest in ground air temperature forecast |
CN115064263A (en) * | 2022-06-08 | 2022-09-16 | 华侨大学 | Alzheimer's disease prediction method based on random forest pruning brain region selection |
CN115907483A (en) * | 2023-01-06 | 2023-04-04 | 山东蜂鸟物联网技术有限公司 | Personnel risk assessment early warning method |
CN116090834A (en) * | 2023-03-07 | 2023-05-09 | 安徽农业大学 | Forestry management method and device based on Flink platform |
CN116226767A (en) * | 2023-05-08 | 2023-06-06 | 国网浙江省电力有限公司宁波供电公司 | Automatic diagnosis method for experimental data of power system |
CN116364178A (en) * | 2023-04-18 | 2023-06-30 | 哈尔滨星云生物信息技术开发有限公司 | Somatic cell sequence data classification method and related equipment |
CN116823014A (en) * | 2023-04-06 | 2023-09-29 | 南京邮电大学 | Method for realizing enterprise employee performance automatic scoring service |
CN116861800A (en) * | 2023-09-04 | 2023-10-10 | 青岛理工大学 | Oil well yield increasing measure optimization and effect prediction method based on deep learning |
CN117113291A (en) * | 2023-10-23 | 2023-11-24 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Analysis method for importance of production parameters in semiconductor manufacturing |
CN117150389A (en) * | 2023-07-14 | 2023-12-01 | 广州易尊网络科技股份有限公司 | Model training method, carrier card activation prediction method and equipment thereof |
CN117370899A (en) * | 2023-12-08 | 2024-01-09 | 中国地质大学(武汉) | Ore control factor weight determining method based on principal component-decision tree model |
CN117540830A (en) * | 2024-01-05 | 2024-02-09 | 中国地质科学院探矿工艺研究所 | Debris flow susceptibility prediction method, device and medium based on fault distribution index |
CN114679779B (en) * | 2022-03-22 | 2024-04-26 | 安徽理工大学 | WIFI positioning method based on improved KNN fusion random forest algorithm |
-
2020
- 2020-05-05 AU AU2020100709A patent/AU2020100709A4/en not_active Ceased
Cited By (80)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753790A (en) * | 2020-07-01 | 2020-10-09 | 武汉楚精灵医疗科技有限公司 | Video classification method based on random forest algorithm |
CN111767958A (en) * | 2020-07-01 | 2020-10-13 | 武汉楚精灵医疗科技有限公司 | Real-time enteroscopy withdrawal time monitoring method based on random forest algorithm |
CN111753790B (en) * | 2020-07-01 | 2023-12-12 | 武汉楚精灵医疗科技有限公司 | Video classification method based on random forest algorithm |
CN111798984A (en) * | 2020-07-07 | 2020-10-20 | 章越新 | Disease prediction scheme based on Fourier transform |
CN112100902A (en) * | 2020-08-10 | 2020-12-18 | 西安交通大学 | Lithium ion battery service life prediction method based on stream data |
CN112100902B (en) * | 2020-08-10 | 2023-12-22 | 西安交通大学 | Lithium ion battery life prediction method based on flow data |
CN111951027A (en) * | 2020-08-14 | 2020-11-17 | 上海冰鉴信息科技有限公司 | Enterprise identification method and device with fraud risk |
CN111950795A (en) * | 2020-08-18 | 2020-11-17 | 安徽中烟工业有限责任公司 | Random forest based method for predicting water adding proportion of loosening and conditioning |
CN111950795B (en) * | 2020-08-18 | 2024-04-26 | 安徽中烟工业有限责任公司 | Random forest-based prediction method for loosening and conditioning water adding proportion |
CN112132187A (en) * | 2020-08-27 | 2020-12-25 | 上海大学 | Method for rapidly judging perovskite structure stability based on random forest |
CN112115991A (en) * | 2020-09-09 | 2020-12-22 | 福建新大陆软件工程有限公司 | Mobile terminal switching prediction method, device, equipment and readable storage medium |
CN112115991B (en) * | 2020-09-09 | 2023-08-04 | 福建新大陆软件工程有限公司 | Mobile terminal change prediction method, device, equipment and readable storage medium |
CN114426894A (en) * | 2020-09-29 | 2022-05-03 | 中国石油化工股份有限公司 | Natural gas hydrate phase equilibrium pressure prediction method based on machine learning |
CN112308146A (en) * | 2020-11-02 | 2021-02-02 | 国网福建省电力有限公司 | Distribution transformer fault identification method based on operation characteristics |
CN112487262A (en) * | 2020-11-25 | 2021-03-12 | 建信金融科技有限责任公司 | Data processing method and device |
CN112487262B (en) * | 2020-11-25 | 2023-05-26 | 中国建设银行股份有限公司 | Data processing method and device |
CN112465245A (en) * | 2020-12-04 | 2021-03-09 | 复旦大学青岛研究院 | Product quality prediction method for unbalanced data set |
CN112530546A (en) * | 2020-12-14 | 2021-03-19 | 重庆邮电大学 | Psychological pre-judging method and system based on K-means clustering and XGboost algorithm |
CN112530546B (en) * | 2020-12-14 | 2024-03-22 | 重庆邮电大学 | Psychological pre-judging method and system based on K-means clustering and XGBoost algorithm |
CN112733903A (en) * | 2020-12-30 | 2021-04-30 | 许昌学院 | Air quality monitoring and alarming method, system, device and medium based on SVM-RF-DT combination |
CN112733903B (en) * | 2020-12-30 | 2023-11-17 | 许昌学院 | SVM-RF-DT combination-based air quality monitoring and alarming method, system, device and medium |
CN112766550A (en) * | 2021-01-08 | 2021-05-07 | 佰聆数据股份有限公司 | Power failure sensitive user prediction method and system based on random forest, storage medium and computer equipment |
CN112766550B (en) * | 2021-01-08 | 2023-10-13 | 佰聆数据股份有限公司 | Random forest-based power failure sensitive user prediction method, system, storage medium and computer equipment |
CN112883564A (en) * | 2021-02-01 | 2021-06-01 | 中国海洋大学 | Water body temperature prediction method and prediction system based on random forest |
CN112990284B (en) * | 2021-03-04 | 2022-11-22 | 安徽大学 | Individual trip behavior prediction method, system and terminal based on XGboost algorithm |
CN112990284A (en) * | 2021-03-04 | 2021-06-18 | 安徽大学 | Individual trip behavior prediction method, system and terminal based on XGboost algorithm |
CN112907359A (en) * | 2021-03-24 | 2021-06-04 | 四川奇力韦创新科技有限公司 | Bank loan business qualification auditing and risk control system and method |
CN112907359B (en) * | 2021-03-24 | 2024-03-12 | 四川奇力韦创新科技有限公司 | Bank loan business qualification auditing and risk control system and method |
CN113127342B (en) * | 2021-03-30 | 2023-06-09 | 广东电网有限责任公司 | Defect prediction method and device based on power grid information system feature selection |
CN113127342A (en) * | 2021-03-30 | 2021-07-16 | 广东电网有限责任公司 | Defect prediction method and device based on power grid information system feature selection |
CN113139876A (en) * | 2021-04-22 | 2021-07-20 | 平安壹钱包电子商务有限公司 | Risk model training method and device, computer equipment and readable storage medium |
CN113221972A (en) * | 2021-04-26 | 2021-08-06 | 西安电子科技大学 | Unbalanced hyperspectral data classification method based on weighted depth random forest |
CN113221972B (en) * | 2021-04-26 | 2024-02-13 | 西安电子科技大学 | Unbalanced hyperspectral data classification method based on weighted depth random forest |
CN113159615A (en) * | 2021-05-10 | 2021-07-23 | 麦荣章 | Intelligent information security risk measuring system and method for industrial control system |
CN113205271A (en) * | 2021-05-12 | 2021-08-03 | 国家税务总局山东省税务局 | Method for evaluating enterprise income tax risk based on machine learning |
CN113282886A (en) * | 2021-05-26 | 2021-08-20 | 北京大唐神州科技有限公司 | Bank loan default judgment method based on logistic regression |
CN113392585B (en) * | 2021-06-10 | 2023-11-03 | 京师天启(北京)科技有限公司 | Method for spatialization of sensitive crowd around polluted land |
CN113392585A (en) * | 2021-06-10 | 2021-09-14 | 京师天启(北京)科技有限公司 | Method for spatializing sensitive people around polluted land |
CN113570191B (en) * | 2021-06-21 | 2023-10-27 | 天津大学 | Intelligent diagnosis method for dangerous situations of ice plugs in river flood |
CN113570191A (en) * | 2021-06-21 | 2021-10-29 | 天津大学 | Intelligent diagnosis method for river ice plug dangerous situations in ice flood |
CN113326664B (en) * | 2021-06-28 | 2022-10-21 | 南京玻璃纤维研究设计院有限公司 | Method for predicting dielectric constant of glass based on M5P algorithm |
CN113326664A (en) * | 2021-06-28 | 2021-08-31 | 南京玻璃纤维研究设计院有限公司 | Method for predicting dielectric constant of glass based on M5P algorithm |
CN113592058B (en) * | 2021-07-05 | 2024-03-12 | 西安邮电大学 | Method for quantitatively predicting microblog forwarding breadth and depth |
CN113592058A (en) * | 2021-07-05 | 2021-11-02 | 西安邮电大学 | Method for quantitatively predicting microblog forwarding breadth and depth |
CN113657452A (en) * | 2021-07-20 | 2021-11-16 | 中国烟草总公司郑州烟草研究院 | Tobacco leaf quality grade classification prediction method based on principal component analysis and super learning |
CN113470819A (en) * | 2021-07-23 | 2021-10-01 | 湖南工商大学 | Early prediction method for adverse event of pressure sore of small unbalanced sample based on random forest |
CN113658680A (en) * | 2021-07-29 | 2021-11-16 | 广西友迪资讯科技有限公司 | Random forest based method for evaluating withdrawal effect of drug addicts |
CN113658680B (en) * | 2021-07-29 | 2023-10-27 | 广西友迪资讯科技有限公司 | Evaluation method for drug-dropping effect of drug-dropping personnel based on random forest |
CN113822536A (en) * | 2021-08-26 | 2021-12-21 | 国网河北省电力有限公司邢台供电分公司 | Power distribution network index evaluation method based on branch definition algorithm |
CN113705904A (en) * | 2021-08-31 | 2021-11-26 | 国网上海市电力公司 | Chemical plant area power utilization fault prediction method based on random forest algorithm |
CN113837863B (en) * | 2021-09-27 | 2023-12-29 | 上海冰鉴信息科技有限公司 | Business prediction model creation method and device and computer readable storage medium |
CN113837863A (en) * | 2021-09-27 | 2021-12-24 | 上海冰鉴信息科技有限公司 | Business prediction model creation method and device and computer readable storage medium |
CN114154561A (en) * | 2021-11-15 | 2022-03-08 | 国家电网有限公司 | Electric power data management method based on natural language processing and random forest |
CN114154561B (en) * | 2021-11-15 | 2024-02-27 | 国家电网有限公司 | Electric power data management method based on natural language processing and random forest |
CN114492929A (en) * | 2021-12-23 | 2022-05-13 | 江南大学 | XGboost-based financial credit enterprise credit prediction method |
CN114710326A (en) * | 2022-03-15 | 2022-07-05 | 国网甘肃省电力公司电力科学研究院 | Intrusion detection method based on GC-Forest |
CN114679779A (en) * | 2022-03-22 | 2022-06-28 | 安徽理工大学 | WIFI positioning method based on improved KNN fusion random forest algorithm |
CN114679779B (en) * | 2022-03-22 | 2024-04-26 | 安徽理工大学 | WIFI positioning method based on improved KNN fusion random forest algorithm |
CN115064263A (en) * | 2022-06-08 | 2022-09-16 | 华侨大学 | Alzheimer's disease prediction method based on random forest pruning brain region selection |
CN115032720A (en) * | 2022-07-15 | 2022-09-09 | 国网上海市电力公司 | Application of multi-mode integrated forecast based on random forest in ground air temperature forecast |
CN115907483A (en) * | 2023-01-06 | 2023-04-04 | 山东蜂鸟物联网技术有限公司 | Personnel risk assessment early warning method |
CN115907483B (en) * | 2023-01-06 | 2023-07-04 | 山东蜂鸟物联网技术有限公司 | Personnel risk assessment and early warning method |
CN116090834B (en) * | 2023-03-07 | 2023-06-13 | 安徽农业大学 | Forestry management method and device based on Flink platform |
CN116090834A (en) * | 2023-03-07 | 2023-05-09 | 安徽农业大学 | Forestry management method and device based on Flink platform |
CN116823014A (en) * | 2023-04-06 | 2023-09-29 | 南京邮电大学 | Method for realizing enterprise employee performance automatic scoring service |
CN116823014B (en) * | 2023-04-06 | 2024-02-13 | 南京邮电大学 | Method for realizing enterprise employee performance automatic scoring service |
CN116364178A (en) * | 2023-04-18 | 2023-06-30 | 哈尔滨星云生物信息技术开发有限公司 | Somatic cell sequence data classification method and related equipment |
CN116364178B (en) * | 2023-04-18 | 2024-01-30 | 哈尔滨星云生物信息技术开发有限公司 | Somatic cell sequence data classification method and related equipment |
CN116226767B (en) * | 2023-05-08 | 2023-10-17 | 国网浙江省电力有限公司宁波供电公司 | Automatic diagnosis method for experimental data of power system |
CN116226767A (en) * | 2023-05-08 | 2023-06-06 | 国网浙江省电力有限公司宁波供电公司 | Automatic diagnosis method for experimental data of power system |
CN117150389A (en) * | 2023-07-14 | 2023-12-01 | 广州易尊网络科技股份有限公司 | Model training method, carrier card activation prediction method and equipment thereof |
CN117150389B (en) * | 2023-07-14 | 2024-04-12 | 广州易尊网络科技股份有限公司 | Model training method, carrier card activation prediction method and equipment thereof |
CN116861800B (en) * | 2023-09-04 | 2023-11-21 | 青岛理工大学 | Oil well yield increasing measure optimization and effect prediction method based on deep learning |
CN116861800A (en) * | 2023-09-04 | 2023-10-10 | 青岛理工大学 | Oil well yield increasing measure optimization and effect prediction method based on deep learning |
CN117113291B (en) * | 2023-10-23 | 2024-02-09 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Analysis method for importance of production parameters in semiconductor manufacturing |
CN117113291A (en) * | 2023-10-23 | 2023-11-24 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Analysis method for importance of production parameters in semiconductor manufacturing |
CN117370899B (en) * | 2023-12-08 | 2024-02-20 | 中国地质大学(武汉) | Ore control factor weight determining method based on principal component-decision tree model |
CN117370899A (en) * | 2023-12-08 | 2024-01-09 | 中国地质大学(武汉) | Ore control factor weight determining method based on principal component-decision tree model |
CN117540830A (en) * | 2024-01-05 | 2024-02-09 | 中国地质科学院探矿工艺研究所 | Debris flow susceptibility prediction method, device and medium based on fault distribution index |
CN117540830B (en) * | 2024-01-05 | 2024-04-12 | 中国地质科学院探矿工艺研究所 | Debris flow susceptibility prediction method, device and medium based on fault distribution index |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2020100709A4 (en) | A method of prediction model based on random forest algorithm | |
Khandani et al. | Consumer credit-risk models via machine-learning algorithms | |
AU2020101475A4 (en) | A Financial Data Analysis Method Based on Machine Learning Models | |
AU2019101189A4 (en) | A financial mining method for credit prediction | |
Van Thiel et al. | Artificial intelligence credit risk prediction: An empirical study of analytical artificial intelligence tools for credit risk prediction in a digital era | |
Zurada et al. | Comparison of the performance of several data mining methods for bad debt recovery in the healthcare industry | |
Chern et al. | A decision tree classifier for credit assessment problems in big data environments | |
Van Thiel et al. | Artificial intelligent credit risk prediction: An empirical study of analytical artificial intelligence tools for credit risk prediction in a digital era | |
CN113095927A (en) | Method and device for identifying suspicious transactions of anti-money laundering | |
Valavan et al. | Predictive-Analysis-based Machine Learning Model for Fraud Detection with Boosting Classifiers. | |
Hidayattullah et al. | Financial statement fraud detection in Indonesia listed companies using machine learning based on meta-heuristic optimization | |
Mirtalaei et al. | A trust-based bio-inspired approach for credit lending decisions | |
Koç et al. | Consumer loans' first payment default detection: a predictive model | |
Ke et al. | Loan repayment behavior prediction of provident fund users using a stacking-based model | |
Naik | Predicting credit risk for unsecured lending: A machine learning approach | |
Datkhile et al. | Statistical modelling on loan default prediction using different models | |
Becha et al. | Use of Machine Learning Techniques in Financial Forecasting | |
Dasari et al. | Prediction of bank loan status using machine learning algorithms | |
Jin et al. | Financial credit default forecast based on big data analysis | |
Zeng | A comparison study on the era of internet finance China construction of credit scoring system model | |
Panyagometh | Impact of baseline population on credit score’s predictive power | |
Zurada | Rule Induction Methods for Credit Scoring | |
Ruan et al. | Personal credit risk identification based on combined machine learning model | |
Niu et al. | Comparison of different individual credit risk assessment models | |
Salihu et al. | Data Mining Based Classifiers for Credit Risk Analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |