AU2019101197A4 - Method of analysis of bank customer churn based on random forest - Google Patents

Method of analysis of bank customer churn based on random forest Download PDF

Info

Publication number
AU2019101197A4
AU2019101197A4 AU2019101197A AU2019101197A AU2019101197A4 AU 2019101197 A4 AU2019101197 A4 AU 2019101197A4 AU 2019101197 A AU2019101197 A AU 2019101197A AU 2019101197 A AU2019101197 A AU 2019101197A AU 2019101197 A4 AU2019101197 A4 AU 2019101197A4
Authority
AU
Australia
Prior art keywords
data
model
random forest
training
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2019101197A
Inventor
Hanfei Gong
Huaping Hu
Yining Liu
Yundi Liu
Honghao ZHANG
Zhanyu ZHANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gong Hanfei Miss
Hu Huaping Miss
Liu Yining Miss
Original Assignee
Gong Hanfei Miss
Hu Huaping Miss
Liu Yining Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gong Hanfei Miss, Hu Huaping Miss, Liu Yining Miss filed Critical Gong Hanfei Miss
Priority to AU2019101197A priority Critical patent/AU2019101197A4/en
Application granted granted Critical
Publication of AU2019101197A4 publication Critical patent/AU2019101197A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention lies in the field of data classification. It is a bank account loss recognition system with a wide range of models based on deep learning. The invention consists of the following procedures. Firstly, we preprocessed data via data normalization, missing value processing and also correlation analysis in order to make the implementation of following steps more convenient and efficient. Secondly, the data set having been selected and preprocessed is divided into training set and test set. At the third stage, we applied the training data on various models ranging from Random Forest, Decision Tree, Support vector Machine (SVM), Logistic Regression to Extreme Learning machines (ELM). During this process, we discovered the best functions for each model by using gradient descent and adjusting parameters of the network like base learning rate. Then, testing data is also applied on the each model that we have researched in order to test the accuracy of each.

Description

TITLE
Method of analysis of bank customer chum based on random forest
FIELD OF THE INVENTION
The invention lies in the field of data classification. It is a bank account loss recognition system with a wide range of models based on deep learning.
BACKGROUND
Nowadays, due to a wide spectrum of factors such as the frequency with which customers experience frustrations and change in their job status, bank loyalty is consistently declining, which makes bank less stable.
The results of Market Force Information’s 2017 Customer Experiences and Competitive Benchmarks Study should serve as a call to action for banks across the country. Despite the best of intentions and millions of dollars invested for customer experience improvements, customer loyalty scores at traditional banks have declined across the board. And not surprisingly, the percentage of consumers who say they intend on switching banks in the next 6 months has edged up to 14% in 2017 .
2019101197 05 Dec 2019
With the rising probability of banking customer loss, it is crucial for commercial banks to make accurate predictions on how many and which group of banking customers will leave their banks and switch to others.
Consumers remain frustrated with megabanks, and those ready to switch are putting $649 billion in deposits and over $30 billion in revenues in jeopardy [2.]
The loss of banking customers of some megabanks is a serious phenomenon since this may probably cause billions of dollars loss for banks. Therefore, our invention which measures the relationship among various variables and whether the banking customer will leave will greatly help banks to avoid risk and prevent losing huge amount of money.
In order to discover one of the most appropriate models to estimate banking customer loss, we applied data on several different models including Random Forest, Decision Tree, Support Vector Machine (SVM), Logistic Regression and Extreme Learning Machines (ELM) since each model has its own advantages and drawbacks. For example, in terms of random forest, it can handle very high dimensional (feature) data and help select more essential features with extremely high generalization ability. As for Support Vector Machine, it can achieve much better results than other algorithms given a small sample of training set, hence high efficiency Regarding to Logistic Regression, it does not only give a
2019101197 05 Dec 2019 measure of how relevant a predictor is (coefficient size) but also its direction of association (positive or negative) which other algorithms cannot express. However, Extreme Learning Machine has high capability of generalization and speed of learning. Every coin has two sides, so we implement all the models above in order to obtain the most suitable algorithm for banking customers loss recognition.
In the end, we implement performance measurement which counts on the AUC (Area Under The Curve)—ROC curve (Receiver Operating Characteristics), one of the most important evaluation metrics for checking any classification model’s performance. Higher the AUC, better the model is at distinguishing between bank account loss and maintenance. According to our experiment result, we found out that random forest is the most appropriate model for bank account loss estimation
SUMMARY
In order to study the situation and influencing factors of bank customer loss, and to make more accurate predictions based on relevant indicators to solve the shortcomings in the existing banking system, proposes Convolutional Neural Network, K Nearest Neighbors (KNN), Logistic Regression (LR), and Random Eorest (RF) multiple model reference schemes based on machine learning and using Python. After training the model and adjusting the artificial parameters, the most suitable model is
2019101197 05 Dec 2019 selected according to the confusion matrix, AUC, accuracy, and Fl value.
Our analysis framework for bank chum problems includes: browsing all data and analyzing the background of the project, processing the raw data and dividing it into training sets and test sets, building models (SVM, LR, RF, ELM), training Data, using the accuracy of the test set to adjust the parameters and calculate their respective confusion matrix, AUC value and Fl value, combined with various indicators to select the optimal model, and finally model analysis.
In order to protect the privacy data, we first desensitize the original data, and then use the average value to fill the missing values in the data processing part. In addition, in order to prevent over-fitting, we have regularized the data, and then feature selection to make the data more clear.
Before establishing various models, we divide the processed data into training sets and test sets, and then import the training sets into the established models (SVM, LR, RF, ELM) for training, and finally according to the model prediction on test set, the parameters are adjusted so that the model is best suited to our needs. In this process, we separately obtain the comprehensive evaluation indicators of each model: confusion matrix, AUC value and Fl value, and compare with other models. Finally,
2019101197 05 Dec 2019 we choose random forest as the optimal model and most consistent with the research topic bank customer loss. .
DESCRIPTION OF DRAWING
Figure 1 shows the procedure of our project.
DESCRIPTION OF PREFERRED EMBODIMENT principle of random forest
Introduction of principle of random forest
Machine learning can be divided into individual learning and and ensemble learning. Individual learner is usually generated from the training data by using a existing algorithm. Ensemble learning that using a variety of learning algorithm can obtain a better predicting performance than using any other single algorithm. In other word, here we combine multiple hypothesis to form a (hopefully) better hypothesis.
Bagging is a ensemble method, which adopts random sampling with replacement (to make each of the ensemble model has the same weight when voting) to obtain a training set. Then the individuals in the ensemble learning are generated by using this dataset. The differences among individuals are obtained by Bootstrap re-sampling technique. This method is mainly used for learning algorithms that are unstable (small
2019101197 05 Dec 2019 changes in the data set will lead to big changes in the model) like decision tree.
Random forest (RF) is a variation of bagging and an integration of decision tree. Many decision trees are assembled into a forest in a random manner, and each decision tree votes on the final category of test samples during classification. Its composition can be represented as RF = Bagging + CART
CART algorithm - classification tree
Random forest can be applied to classification and regression problems, and we introduce the Gini index calculation principle CART tree (classification tree) since the classification problem is studied.
The gini index is the gini impurity, which represents the probability that a randomly selected sample in the sample set is misclassified. The smaller the gini index is, the lower the probability that the selected samples in the sample set is misclassified, which means the higher purity of the sample set, vice versa.
This method can be represented by function:
Gini index (Gini impurity) = the probability for a sample to be selected * the probability for a sample to be misclassified
2019101197 05 Dec 2019 κ κ
Gini(p) = Σ pfc(l - pk) = 1 - Pfc2 k=l k=l *
*Description of this function
Pk represents the probability that the selected sample belongs to category k, and the probability that this sample is misclassified is (1-pk).
There are k categories in the sample set, a randomly selected sample can belong to any of those k categories. As a result, we add the categories up. When there is binary classification, Gini(p) = 2p(l-p)
The gini index of the sample set D: assume that there are k categories in the set, then:
Glni(D) = l-£(gfy k=l 1 1
The process of generation of the random forest
1) Choose a sample set
Assuming that the original sample set has n samples in total, n samples are extracted from the original sample set by Bootstrap (sampling with replacement) for each round to obtain a training set of size n. In the extraction process of the original sample set, there may be samples that are repeatedly extracted, or samples that are not extracted at all once.
2019101197 05 Dec 2019
Totally there are k rounds of extraction, so the training sets extracted in each round were Ti,T2,... , Tk.
2) Generate decision trees
If there are D features in the feature space, then d features (d < D) are randomly selected from D features to form a new feature set in the process of generating decision tree in each round. And the new feature set is used to generate the decision trees. There are k decision trees generated in the k rounds. Since these k decision trees are random in the selection of training sets and features, they are independent of each other.
3) Combined model
Since the k decision trees generated are independent from each other, and the significance of each decision tree is equal. So their weights do not need to be considered when they are combined together, or they can be considered to have the same weight. So for the classification problem, votes of all decision trees are used to determine the final result of classification.
Validation of the model
The validation of the model requires a validation set, but here we don't need to specifically obtain an additional validation set. So we just need to select samples from the original sample set that have not been used. When selecting training set from the origin samples, there are some
2019101197 05 Dec 2019 samples have not been selected before. And there are also some features have not been used during selecting features. We only need to verify the final model using these unused data.
About bootstrap sampling:
Assuming that there are n samples in the sample set, the probability of picking anyone of these samples is 1/n, and the probability of a sample not being picked in each sample is (1 - 1/n). Since it is sampling with replacement, each sampling is independent from each other. Therefore, for n consecutive sampling, the probability of a sample not being selected is: (1 - l/n)n
By important limit:
( l + l/n) = e
So, as n approaches infinity, the limit of this value is 1/e (about 0.368). If size of the sample is large, the probability of each sample not being selected in the whole process is about 0.368. As the samples in the sample set are independent of each other, about 36.8% of all samples in the whole sample were not selected. These samples are called out-of-package data and can be used for testing.
In addition, there are 2 parameters needed to be controlled manually. One is the number of trees in the random forest(most of the time it is quite
2019101197 05 Dec 2019 large), the other one is the size of d (the recommended value of d is root-mean-square of d).
Procedure
1. Pre-treatment with data: firstly, we deleted the rows which contains more than five missing values, then we counted the lines and rows (288 and 249 respectively). Then we counted the mean, standard deviation, maximum value and minimum values of the data, (for instance, -0.483228, 112.491521,81.357000,-890.000000 respectively of J137)
2. Data processing: we filled the missing values with mean values, and normalized the data using Python.
3. Split the data into train data and test data: we run feature selection and split the data into two sets: train data and test data.
4. We set several models (SVM, LR, RF, EFM):
5. Several models are trained, accuracy and AUC score(test) are obtained: random forest(74.28%, 0.693677), Fogistic regression(64.98%, 0.72), Support vector machine(67.51%, 0.7059) .Parameters were adjusted to reach the requirement.
6. We picked the best model: Random forest.
7. We predicted test data using Random Forest, ( )
8. We analyzed the model we chose and prediction resultsQ were
2019101197 05 Dec 2019 obtained.
Testing result
No. Model Accurac y(%) AUC (Test) Fl Precisio n Recall Confusi on matrix
1 Random forest 74.28 0.693677 0.671674 621 0.651 0.693 [[9239 2633] [1224 1903]]
2 Logistic regressio n(LR) 64.98 0.72 0.443007 104 0.3311 0.6693 [[7657 4221 [1032 2089]]
3 Support vector machine (SVM) 67.51 0.705883 131 0.626519 246 0.606 0.649 [[8240 3638] [1235 1886]]

Claims (3)

1. Method of analysis of bank customer chum based on random forest, wherein in order to deal with sample unbalance, random generate new data by copying existing data to improve accuracy of prediction, analysis framework for bank chum problems includes: browsing all data and analyzing the background of the project, processing the raw data and dividing it into training sets and test sets, building models, training Data, using the accuracy of the test set to adjust the parameters and calculate their respective confusion matrix, AUC value and Fl value, combined with various indicators to select the optimal model, and finally model analysis.
2. According to method of claim 1, wherein in order to protect the privacy data, desensitize the original data, and then use the average value to fill the missing values in the data processing part.
3. According to method of claim 1, wherein prevent over-fitting, regularize the data, and then feature selection to make the data more clear.
AU2019101197A 2019-10-03 2019-10-03 Method of analysis of bank customer churn based on random forest Ceased AU2019101197A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019101197A AU2019101197A4 (en) 2019-10-03 2019-10-03 Method of analysis of bank customer churn based on random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019101197A AU2019101197A4 (en) 2019-10-03 2019-10-03 Method of analysis of bank customer churn based on random forest

Publications (1)

Publication Number Publication Date
AU2019101197A4 true AU2019101197A4 (en) 2020-01-23

Family

ID=69160468

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019101197A Ceased AU2019101197A4 (en) 2019-10-03 2019-10-03 Method of analysis of bank customer churn based on random forest

Country Status (1)

Country Link
AU (1) AU2019101197A4 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113410861A (en) * 2020-03-17 2021-09-17 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 Droop control parameter optimization method suitable for multi-terminal flexible direct current system
CN113591911A (en) * 2021-06-25 2021-11-02 南京财经大学 Cascade multi-class abnormity identification method in logistics transportation process of bulk grain container

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113410861A (en) * 2020-03-17 2021-09-17 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 Droop control parameter optimization method suitable for multi-terminal flexible direct current system
CN113591911A (en) * 2021-06-25 2021-11-02 南京财经大学 Cascade multi-class abnormity identification method in logistics transportation process of bulk grain container

Similar Documents

Publication Publication Date Title
WO2019179403A1 (en) Fraud transaction detection method based on sequence width depth learning
CN108898479B (en) Credit evaluation model construction method and device
CN108351985A (en) Method and apparatus for large-scale machines study
Boonpeng et al. Decision support system for investing in stock market by using OAA-neural network
CN111080442A (en) Credit scoring model construction method, device, equipment and storage medium
CN110059852A (en) A kind of stock yield prediction technique based on improvement random forests algorithm
AU2019101197A4 (en) Method of analysis of bank customer churn based on random forest
CN110135167A (en) A kind of edge calculations terminal security grade appraisal procedure of random forest
CN104809476B (en) A kind of multi-target evolution Fuzzy Rule Classification method based on decomposition
CN108874889A (en) Objective body search method, system and device based on objective body image
CN110930038A (en) Loan demand identification method, loan demand identification device, loan demand identification terminal and loan demand identification storage medium
CN112001788A (en) Credit card default fraud identification method based on RF-DBSCAN algorithm
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN109472453A (en) Power consumer credit assessment method based on global optimum&#39;s fuzzy kernel clustering model
Pambudi et al. Improving money laundering detection using optimized support vector machine
Dhonge et al. IPL cricket score and winning prediction using machine learning techniques
CN110324178B (en) Network intrusion detection method based on multi-experience nuclear learning
De Melo Junior et al. An empirical comparison of classification algorithms for imbalanced credit scoring datasets
CN108737429B (en) Network intrusion detection method
CN114519508A (en) Credit risk assessment method based on time sequence deep learning and legal document information
CN106161458B (en) Network inbreak detection method based on double online extreme learning machines of weighting
CN116702132A (en) Network intrusion detection method and system
CN117035983A (en) Method and device for determining credit risk level, storage medium and electronic equipment
CN111899092B (en) Business data screening method and device based on two-channel model
CN109508350A (en) The method and apparatus that a kind of pair of data are sampled

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry