AU2019101197A4

AU2019101197A4 - Method of analysis of bank customer churn based on random forest

Info

Publication number: AU2019101197A4
Application number: AU2019101197A
Authority: AU
Inventors: Hanfei Gong; Huaping Hu; Yining Liu; Yundi Liu; Honghao ZHANG; Zhanyu ZHANG
Original assignee: Gong Hanfei Miss; Hu Huaping Miss; Liu Yining Miss
Current assignee: Gong Hanfei Miss; Hu Huaping Miss; Liu Yining Miss
Priority date: 2019-10-03
Filing date: 2019-10-03
Publication date: 2020-01-23
Anticipated expiration: 2027-10-03

Abstract

The invention lies in the field of data classification. It is a bank account loss recognition system with a wide range of models based on deep learning. The invention consists of the following procedures. Firstly, we preprocessed data via data normalization, missing value processing and also correlation analysis in order to make the implementation of following steps more convenient and efficient. Secondly, the data set having been selected and preprocessed is divided into training set and test set. At the third stage, we applied the training data on various models ranging from Random Forest, Decision Tree, Support vector Machine (SVM), Logistic Regression to Extreme Learning machines (ELM). During this process, we discovered the best functions for each model by using gradient descent and adjusting parameters of the network like base learning rate. Then, testing data is also applied on the each model that we have researched in order to test the accuracy of each.

Description

TITLE

Method of analysis of bank customer chum based on random forest

FIELD OF THE INVENTION

The invention lies in the field of data classification. It is a bank account loss recognition system with a wide range of models based on deep learning.

BACKGROUND

Nowadays, due to a wide spectrum of factors such as the frequency with which customers experience frustrations and change in their job status, bank loyalty is consistently declining, which makes bank less stable.

The results of Market Force Information’s 2017 Customer Experiences and Competitive Benchmarks Study should serve as a call to action for banks across the country. Despite the best of intentions and millions of dollars invested for customer experience improvements, customer loyalty scores at traditional banks have declined across the board. And not surprisingly, the percentage of consumers who say they intend on switching banks in the next 6 months has edged up to 14% in 2017 .

2019101197 05 Dec 2019

With the rising probability of banking customer loss, it is crucial for commercial banks to make accurate predictions on how many and which group of banking customers will leave their banks and switch to others.

Consumers remain frustrated with megabanks, and those ready to switch are putting $649 billion in deposits and over $30 billion in revenues in jeopardy [2.]

The loss of banking customers of some megabanks is a serious phenomenon since this may probably cause billions of dollars loss for banks. Therefore, our invention which measures the relationship among various variables and whether the banking customer will leave will greatly help banks to avoid risk and prevent losing huge amount of money.

In order to discover one of the most appropriate models to estimate banking customer loss, we applied data on several different models including Random Forest, Decision Tree, Support Vector Machine (SVM), Logistic Regression and Extreme Learning Machines (ELM) since each model has its own advantages and drawbacks. For example, in terms of random forest, it can handle very high dimensional (feature) data and help select more essential features with extremely high generalization ability. As for Support Vector Machine, it can achieve much better results than other algorithms given a small sample of training set, hence high efficiency Regarding to Logistic Regression, it does not only give a

2019101197 05 Dec 2019 measure of how relevant a predictor is (coefficient size) but also its direction of association (positive or negative) which other algorithms cannot express. However, Extreme Learning Machine has high capability of generalization and speed of learning. Every coin has two sides, so we implement all the models above in order to obtain the most suitable algorithm for banking customers loss recognition.

In the end, we implement performance measurement which counts on the AUC (Area Under The Curve)—ROC curve (Receiver Operating Characteristics), one of the most important evaluation metrics for checking any classification model’s performance. Higher the AUC, better the model is at distinguishing between bank account loss and maintenance. According to our experiment result, we found out that random forest is the most appropriate model for bank account loss estimation

SUMMARY

In order to study the situation and influencing factors of bank customer loss, and to make more accurate predictions based on relevant indicators to solve the shortcomings in the existing banking system, proposes Convolutional Neural Network, K Nearest Neighbors (KNN), Logistic Regression (LR), and Random Eorest (RF) multiple model reference schemes based on machine learning and using Python. After training the model and adjusting the artificial parameters, the most suitable model is

2019101197 05 Dec 2019 selected according to the confusion matrix, AUC, accuracy, and Fl value.

Our analysis framework for bank chum problems includes: browsing all data and analyzing the background of the project, processing the raw data and dividing it into training sets and test sets, building models (SVM, LR, RF, ELM), training Data, using the accuracy of the test set to adjust the parameters and calculate their respective confusion matrix, AUC value and Fl value, combined with various indicators to select the optimal model, and finally model analysis.

In order to protect the privacy data, we first desensitize the original data, and then use the average value to fill the missing values in the data processing part. In addition, in order to prevent over-fitting, we have regularized the data, and then feature selection to make the data more clear.

Before establishing various models, we divide the processed data into training sets and test sets, and then import the training sets into the established models (SVM, LR, RF, ELM) for training, and finally according to the model prediction on test set, the parameters are adjusted so that the model is best suited to our needs. In this process, we separately obtain the comprehensive evaluation indicators of each model: confusion matrix, AUC value and Fl value, and compare with other models. Finally,

2019101197 05 Dec 2019 we choose random forest as the optimal model and most consistent with the research topic bank customer loss. .

DESCRIPTION OF DRAWING

Figure 1 shows the procedure of our project.

DESCRIPTION OF PREFERRED EMBODIMENT principle of random forest

Introduction of principle of random forest

Machine learning can be divided into individual learning and and ensemble learning. Individual learner is usually generated from the training data by using a existing algorithm. Ensemble learning that using a variety of learning algorithm can obtain a better predicting performance than using any other single algorithm. In other word, here we combine multiple hypothesis to form a (hopefully) better hypothesis.

Bagging is a ensemble method, which adopts random sampling with replacement (to make each of the ensemble model has the same weight when voting) to obtain a training set. Then the individuals in the ensemble learning are generated by using this dataset. The differences among individuals are obtained by Bootstrap re-sampling technique. This method is mainly used for learning algorithms that are unstable (small

2019101197 05 Dec 2019 changes in the data set will lead to big changes in the model) like decision tree.

Random forest (RF) is a variation of bagging and an integration of decision tree. Many decision trees are assembled into a forest in a random manner, and each decision tree votes on the final category of test samples during classification. Its composition can be represented as RF = Bagging + CART

CART algorithm - classification tree

Random forest can be applied to classification and regression problems, and we introduce the Gini index calculation principle CART tree (classification tree) since the classification problem is studied.

The gini index is the gini impurity, which represents the probability that a randomly selected sample in the sample set is misclassified. The smaller the gini index is, the lower the probability that the selected samples in the sample set is misclassified, which means the higher purity of the sample set, vice versa.

This method can be represented by function:

Gini index (Gini impurity) = the probability for a sample to be selected * the probability for a sample to be misclassified

2019101197 05 Dec 2019 κ κ

Gini(p) = Σ p_fc(l - p_k) = 1 - Pfc² k=l k=l *

*Description of this function

Pk represents the probability that the selected sample belongs to category k, and the probability that this sample is misclassified is (1-pk).

There are k categories in the sample set, a randomly selected sample can belong to any of those k categories. As a result, we add the categories up. When there is binary classification, Gini(p) = 2p(l-p)

The gini index of the sample set D: assume that there are k categories in the set, then:

Glni(D) = l-£(gfy k=l ^{1 1}

The process of generation of the random forest

1) Choose a sample set

Assuming that the original sample set has n samples in total, n samples are extracted from the original sample set by Bootstrap (sampling with replacement) for each round to obtain a training set of size n. In the extraction process of the original sample set, there may be samples that are repeatedly extracted, or samples that are not extracted at all once.

2019101197 05 Dec 2019

Totally there are k rounds of extraction, so the training sets extracted in each round were Ti,T₂,... , T_k.

2) Generate decision trees

If there are D features in the feature space, then d features (d < D) are randomly selected from D features to form a new feature set in the process of generating decision tree in each round. And the new feature set is used to generate the decision trees. There are k decision trees generated in the k rounds. Since these k decision trees are random in the selection of training sets and features, they are independent of each other.

3) Combined model

Since the k decision trees generated are independent from each other, and the significance of each decision tree is equal. So their weights do not need to be considered when they are combined together, or they can be considered to have the same weight. So for the classification problem, votes of all decision trees are used to determine the final result of classification.

Validation of the model

The validation of the model requires a validation set, but here we don't need to specifically obtain an additional validation set. So we just need to select samples from the original sample set that have not been used. When selecting training set from the origin samples, there are some

2019101197 05 Dec 2019 samples have not been selected before. And there are also some features have not been used during selecting features. We only need to verify the final model using these unused data.

About bootstrap sampling:

Assuming that there are n samples in the sample set, the probability of picking anyone of these samples is 1/n, and the probability of a sample not being picked in each sample is (1 - 1/n). Since it is sampling with replacement, each sampling is independent from each other. Therefore, for n consecutive sampling, the probability of a sample not being selected is: (1 - l/n)ⁿ

By important limit:

( l + l/n) = e

So, as n approaches infinity, the limit of this value is 1/e (about 0.368). If size of the sample is large, the probability of each sample not being selected in the whole process is about 0.368. As the samples in the sample set are independent of each other, about 36.8% of all samples in the whole sample were not selected. These samples are called out-of-package data and can be used for testing.

In addition, there are 2 parameters needed to be controlled manually. One is the number of trees in the random forest(most of the time it is quite

2019101197 05 Dec 2019 large), the other one is the size of d (the recommended value of d is root-mean-square of d).

Procedure

1. Pre-treatment with data: firstly, we deleted the rows which contains more than five missing values, then we counted the lines and rows (288 and 249 respectively). Then we counted the mean, standard deviation, maximum value and minimum values of the data, (for instance, -0.483228, 112.491521,81.357000,-890.000000 respectively of J137)

2. Data processing: we filled the missing values with mean values, and normalized the data using Python.

3. Split the data into train data and test data: we run feature selection and split the data into two sets: train data and test data.

4. We set several models (SVM, LR, RF, EFM):

5. Several models are trained, accuracy and AUC score(test) are obtained: random forest(74.28%, 0.693677), Fogistic regression(64.98%, 0.72), Support vector machine(67.51%, 0.7059) .Parameters were adjusted to reach the requirement.

6. We picked the best model: Random forest.

7. We predicted test data using Random Forest, ( )

8. We analyzed the model we chose and prediction resultsQ were

2019101197 05 Dec 2019 obtained.

Testing result

No.	Model	Accurac y(%)	AUC (Test)	Fl	Precisio n	Recall	Confusi on matrix
1	Random forest	74.28	0.693677	0.671674 621	0.651	0.693	[[9239 2633] [1224 1903]]
2	Logistic regressio n(LR)	64.98	0.72	0.443007 104	0.3311	0.6693	[[7657 4221 [1032 2089]]
3	Support vector machine (SVM)	67.51	0.705883 131	0.626519 246	0.606	0.649	[[8240 3638] [1235 1886]]

Claims

1. Method of analysis of bank customer chum based on random forest, wherein in order to deal with sample unbalance, random generate new data by copying existing data to improve accuracy of prediction, analysis framework for bank chum problems includes: browsing all data and analyzing the background of the project, processing the raw data and dividing it into training sets and test sets, building models, training Data, using the accuracy of the test set to adjust the parameters and calculate their respective confusion matrix, AUC value and Fl value, combined with various indicators to select the optimal model, and finally model analysis.

2. According to method of claim 1, wherein in order to protect the privacy data, desensitize the original data, and then use the average value to fill the missing values in the data processing part.

3. According to method of claim 1, wherein prevent over-fitting, regularize the data, and then feature selection to make the data more clear.