AU2019101197A4 - Method of analysis of bank customer churn based on random forest - Google Patents
Method of analysis of bank customer churn based on random forest Download PDFInfo
- Publication number
- AU2019101197A4 AU2019101197A4 AU2019101197A AU2019101197A AU2019101197A4 AU 2019101197 A4 AU2019101197 A4 AU 2019101197A4 AU 2019101197 A AU2019101197 A AU 2019101197A AU 2019101197 A AU2019101197 A AU 2019101197A AU 2019101197 A4 AU2019101197 A4 AU 2019101197A4
- Authority
- AU
- Australia
- Prior art keywords
- data
- model
- random forest
- training
- analysis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/02—Banking, e.g. interest calculation or account maintenance
Landscapes
- Business, Economics & Management (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Engineering & Computer Science (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Complex Calculations (AREA)
Abstract
The invention lies in the field of data classification. It is a bank account loss recognition system with a wide range of models based on deep learning. The invention consists of the following procedures. Firstly, we preprocessed data via data normalization, missing value processing and also correlation analysis in order to make the implementation of following steps more convenient and efficient. Secondly, the data set having been selected and preprocessed is divided into training set and test set. At the third stage, we applied the training data on various models ranging from Random Forest, Decision Tree, Support vector Machine (SVM), Logistic Regression to Extreme Learning machines (ELM). During this process, we discovered the best functions for each model by using gradient descent and adjusting parameters of the network like base learning rate. Then, testing data is also applied on the each model that we have researched in order to test the accuracy of each.
Description
TITLE
Method of analysis of bank customer chum based on random forest
FIELD OF THE INVENTION
The invention lies in the field of data classification. It is a bank account loss recognition system with a wide range of models based on deep learning.
BACKGROUND
Nowadays, due to a wide spectrum of factors such as the frequency with which customers experience frustrations and change in their job status, bank loyalty is consistently declining, which makes bank less stable.
The results of Market Force Information’s 2017 Customer Experiences and Competitive Benchmarks Study should serve as a call to action for banks across the country. Despite the best of intentions and millions of dollars invested for customer experience improvements, customer loyalty scores at traditional banks have declined across the board. And not surprisingly, the percentage of consumers who say they intend on switching banks in the next 6 months has edged up to 14% in 2017 .
2019101197 05 Dec 2019
With the rising probability of banking customer loss, it is crucial for commercial banks to make accurate predictions on how many and which group of banking customers will leave their banks and switch to others.
Consumers remain frustrated with megabanks, and those ready to switch are putting $649 billion in deposits and over $30 billion in revenues in jeopardy [2.]
The loss of banking customers of some megabanks is a serious phenomenon since this may probably cause billions of dollars loss for banks. Therefore, our invention which measures the relationship among various variables and whether the banking customer will leave will greatly help banks to avoid risk and prevent losing huge amount of money.
In order to discover one of the most appropriate models to estimate banking customer loss, we applied data on several different models including Random Forest, Decision Tree, Support Vector Machine (SVM), Logistic Regression and Extreme Learning Machines (ELM) since each model has its own advantages and drawbacks. For example, in terms of random forest, it can handle very high dimensional (feature) data and help select more essential features with extremely high generalization ability. As for Support Vector Machine, it can achieve much better results than other algorithms given a small sample of training set, hence high efficiency Regarding to Logistic Regression, it does not only give a
2019101197 05 Dec 2019 measure of how relevant a predictor is (coefficient size) but also its direction of association (positive or negative) which other algorithms cannot express. However, Extreme Learning Machine has high capability of generalization and speed of learning. Every coin has two sides, so we implement all the models above in order to obtain the most suitable algorithm for banking customers loss recognition.
In the end, we implement performance measurement which counts on the AUC (Area Under The Curve)—ROC curve (Receiver Operating Characteristics), one of the most important evaluation metrics for checking any classification model’s performance. Higher the AUC, better the model is at distinguishing between bank account loss and maintenance. According to our experiment result, we found out that random forest is the most appropriate model for bank account loss estimation
SUMMARY
In order to study the situation and influencing factors of bank customer loss, and to make more accurate predictions based on relevant indicators to solve the shortcomings in the existing banking system, proposes Convolutional Neural Network, K Nearest Neighbors (KNN), Logistic Regression (LR), and Random Eorest (RF) multiple model reference schemes based on machine learning and using Python. After training the model and adjusting the artificial parameters, the most suitable model is
2019101197 05 Dec 2019 selected according to the confusion matrix, AUC, accuracy, and Fl value.
Our analysis framework for bank chum problems includes: browsing all data and analyzing the background of the project, processing the raw data and dividing it into training sets and test sets, building models (SVM, LR, RF, ELM), training Data, using the accuracy of the test set to adjust the parameters and calculate their respective confusion matrix, AUC value and Fl value, combined with various indicators to select the optimal model, and finally model analysis.
In order to protect the privacy data, we first desensitize the original data, and then use the average value to fill the missing values in the data processing part. In addition, in order to prevent over-fitting, we have regularized the data, and then feature selection to make the data more clear.
Before establishing various models, we divide the processed data into training sets and test sets, and then import the training sets into the established models (SVM, LR, RF, ELM) for training, and finally according to the model prediction on test set, the parameters are adjusted so that the model is best suited to our needs. In this process, we separately obtain the comprehensive evaluation indicators of each model: confusion matrix, AUC value and Fl value, and compare with other models. Finally,
2019101197 05 Dec 2019 we choose random forest as the optimal model and most consistent with the research topic bank customer loss. .
DESCRIPTION OF DRAWING
Figure 1 shows the procedure of our project.
DESCRIPTION OF PREFERRED EMBODIMENT principle of random forest
Introduction of principle of random forest
Machine learning can be divided into individual learning and and ensemble learning. Individual learner is usually generated from the training data by using a existing algorithm. Ensemble learning that using a variety of learning algorithm can obtain a better predicting performance than using any other single algorithm. In other word, here we combine multiple hypothesis to form a (hopefully) better hypothesis.
Bagging is a ensemble method, which adopts random sampling with replacement (to make each of the ensemble model has the same weight when voting) to obtain a training set. Then the individuals in the ensemble learning are generated by using this dataset. The differences among individuals are obtained by Bootstrap re-sampling technique. This method is mainly used for learning algorithms that are unstable (small
2019101197 05 Dec 2019 changes in the data set will lead to big changes in the model) like decision tree.
Random forest (RF) is a variation of bagging and an integration of decision tree. Many decision trees are assembled into a forest in a random manner, and each decision tree votes on the final category of test samples during classification. Its composition can be represented as RF = Bagging + CART
CART algorithm - classification tree
Random forest can be applied to classification and regression problems, and we introduce the Gini index calculation principle CART tree (classification tree) since the classification problem is studied.
The gini index is the gini impurity, which represents the probability that a randomly selected sample in the sample set is misclassified. The smaller the gini index is, the lower the probability that the selected samples in the sample set is misclassified, which means the higher purity of the sample set, vice versa.
This method can be represented by function:
Gini index (Gini impurity) = the probability for a sample to be selected * the probability for a sample to be misclassified
2019101197 05 Dec 2019 κ κ
Gini(p) = Σ pfc(l - pk) = 1 - Pfc2 k=l k=l *
*Description of this function
Pk represents the probability that the selected sample belongs to category k, and the probability that this sample is misclassified is (1-pk).
There are k categories in the sample set, a randomly selected sample can belong to any of those k categories. As a result, we add the categories up. When there is binary classification, Gini(p) = 2p(l-p)
The gini index of the sample set D: assume that there are k categories in the set, then:
Glni(D) = l-£(gfy k=l 1 1
The process of generation of the random forest
1) Choose a sample set
Assuming that the original sample set has n samples in total, n samples are extracted from the original sample set by Bootstrap (sampling with replacement) for each round to obtain a training set of size n. In the extraction process of the original sample set, there may be samples that are repeatedly extracted, or samples that are not extracted at all once.
2019101197 05 Dec 2019
Totally there are k rounds of extraction, so the training sets extracted in each round were Ti,T2,... , Tk.
2) Generate decision trees
If there are D features in the feature space, then d features (d < D) are randomly selected from D features to form a new feature set in the process of generating decision tree in each round. And the new feature set is used to generate the decision trees. There are k decision trees generated in the k rounds. Since these k decision trees are random in the selection of training sets and features, they are independent of each other.
3) Combined model
Since the k decision trees generated are independent from each other, and the significance of each decision tree is equal. So their weights do not need to be considered when they are combined together, or they can be considered to have the same weight. So for the classification problem, votes of all decision trees are used to determine the final result of classification.
Validation of the model
The validation of the model requires a validation set, but here we don't need to specifically obtain an additional validation set. So we just need to select samples from the original sample set that have not been used. When selecting training set from the origin samples, there are some
2019101197 05 Dec 2019 samples have not been selected before. And there are also some features have not been used during selecting features. We only need to verify the final model using these unused data.
About bootstrap sampling:
Assuming that there are n samples in the sample set, the probability of picking anyone of these samples is 1/n, and the probability of a sample not being picked in each sample is (1 - 1/n). Since it is sampling with replacement, each sampling is independent from each other. Therefore, for n consecutive sampling, the probability of a sample not being selected is: (1 - l/n)n
By important limit:
( l + l/n) = e
So, as n approaches infinity, the limit of this value is 1/e (about 0.368). If size of the sample is large, the probability of each sample not being selected in the whole process is about 0.368. As the samples in the sample set are independent of each other, about 36.8% of all samples in the whole sample were not selected. These samples are called out-of-package data and can be used for testing.
In addition, there are 2 parameters needed to be controlled manually. One is the number of trees in the random forest(most of the time it is quite
2019101197 05 Dec 2019 large), the other one is the size of d (the recommended value of d is root-mean-square of d).
Procedure
1. Pre-treatment with data: firstly, we deleted the rows which contains more than five missing values, then we counted the lines and rows (288 and 249 respectively). Then we counted the mean, standard deviation, maximum value and minimum values of the data, (for instance, -0.483228, 112.491521,81.357000,-890.000000 respectively of J137)
2. Data processing: we filled the missing values with mean values, and normalized the data using Python.
3. Split the data into train data and test data: we run feature selection and split the data into two sets: train data and test data.
4. We set several models (SVM, LR, RF, EFM):
5. Several models are trained, accuracy and AUC score(test) are obtained: random forest(74.28%, 0.693677), Fogistic regression(64.98%, 0.72), Support vector machine(67.51%, 0.7059) .Parameters were adjusted to reach the requirement.
6. We picked the best model: Random forest.
7. We predicted test data using Random Forest, ( )
8. We analyzed the model we chose and prediction resultsQ were
2019101197 05 Dec 2019 obtained.
Testing result
No. | Model | Accurac y(%) | AUC (Test) | Fl | Precisio n | Recall | Confusi on matrix |
1 | Random forest | 74.28 | 0.693677 | 0.671674 621 | 0.651 | 0.693 | [[9239 2633] [1224 1903]] |
2 | Logistic regressio n(LR) | 64.98 | 0.72 | 0.443007 104 | 0.3311 | 0.6693 | [[7657 4221 [1032 2089]] |
3 | Support vector machine (SVM) | 67.51 | 0.705883 131 | 0.626519 246 | 0.606 | 0.649 | [[8240 3638] [1235 1886]] |
Claims (3)
1. Method of analysis of bank customer chum based on random forest, wherein in order to deal with sample unbalance, random generate new data by copying existing data to improve accuracy of prediction, analysis framework for bank chum problems includes: browsing all data and analyzing the background of the project, processing the raw data and dividing it into training sets and test sets, building models, training Data, using the accuracy of the test set to adjust the parameters and calculate their respective confusion matrix, AUC value and Fl value, combined with various indicators to select the optimal model, and finally model analysis.
2. According to method of claim 1, wherein in order to protect the privacy data, desensitize the original data, and then use the average value to fill the missing values in the data processing part.
3. According to method of claim 1, wherein prevent over-fitting, regularize the data, and then feature selection to make the data more clear.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2019101197A AU2019101197A4 (en) | 2019-10-03 | 2019-10-03 | Method of analysis of bank customer churn based on random forest |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2019101197A AU2019101197A4 (en) | 2019-10-03 | 2019-10-03 | Method of analysis of bank customer churn based on random forest |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2019101197A4 true AU2019101197A4 (en) | 2020-01-23 |
Family
ID=69160468
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2019101197A Ceased AU2019101197A4 (en) | 2019-10-03 | 2019-10-03 | Method of analysis of bank customer churn based on random forest |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2019101197A4 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113410861A (en) * | 2020-03-17 | 2021-09-17 | 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 | Droop control parameter optimization method suitable for multi-terminal flexible direct current system |
CN113591911A (en) * | 2021-06-25 | 2021-11-02 | 南京财经大学 | Cascade multi-class abnormity identification method in logistics transportation process of bulk grain container |
-
2019
- 2019-10-03 AU AU2019101197A patent/AU2019101197A4/en not_active Ceased
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113410861A (en) * | 2020-03-17 | 2021-09-17 | 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 | Droop control parameter optimization method suitable for multi-terminal flexible direct current system |
CN113591911A (en) * | 2021-06-25 | 2021-11-02 | 南京财经大学 | Cascade multi-class abnormity identification method in logistics transportation process of bulk grain container |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019179403A1 (en) | Fraud transaction detection method based on sequence width depth learning | |
CN108898479B (en) | Credit evaluation model construction method and device | |
CN108351985A (en) | Method and apparatus for large-scale machines study | |
Boonpeng et al. | Decision support system for investing in stock market by using OAA-neural network | |
CN111080442A (en) | Credit scoring model construction method, device, equipment and storage medium | |
CN110059852A (en) | A kind of stock yield prediction technique based on improvement random forests algorithm | |
AU2019101197A4 (en) | Method of analysis of bank customer churn based on random forest | |
CN110135167A (en) | A kind of edge calculations terminal security grade appraisal procedure of random forest | |
CN104809476B (en) | A kind of multi-target evolution Fuzzy Rule Classification method based on decomposition | |
CN108874889A (en) | Objective body search method, system and device based on objective body image | |
CN110930038A (en) | Loan demand identification method, loan demand identification device, loan demand identification terminal and loan demand identification storage medium | |
CN112001788A (en) | Credit card default fraud identification method based on RF-DBSCAN algorithm | |
CN104850868A (en) | Customer segmentation method based on k-means and neural network cluster | |
CN109472453A (en) | Power consumer credit assessment method based on global optimum's fuzzy kernel clustering model | |
Pambudi et al. | Improving money laundering detection using optimized support vector machine | |
Dhonge et al. | IPL cricket score and winning prediction using machine learning techniques | |
CN110324178B (en) | Network intrusion detection method based on multi-experience nuclear learning | |
De Melo Junior et al. | An empirical comparison of classification algorithms for imbalanced credit scoring datasets | |
CN108737429B (en) | Network intrusion detection method | |
CN114519508A (en) | Credit risk assessment method based on time sequence deep learning and legal document information | |
CN106161458B (en) | Network inbreak detection method based on double online extreme learning machines of weighting | |
CN116702132A (en) | Network intrusion detection method and system | |
CN117035983A (en) | Method and device for determining credit risk level, storage medium and electronic equipment | |
CN111899092B (en) | Business data screening method and device based on two-channel model | |
CN109508350A (en) | The method and apparatus that a kind of pair of data are sampled |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |