CN112419045A

CN112419045A - Unbalanced credit user classification method based on oversampling and random forest

Info

Publication number: CN112419045A
Application number: CN202011344142.1A
Authority: CN
Inventors: 陶砚蕴; 黄锐; 岳国旗; 吴澄
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-02-26

Abstract

The invention discloses an unbalanced credit user classification method based on oversampling and random forests, which comprises the following steps: the method comprises the following steps: in the data cleaning stage, exploratory data analysis is firstly carried out, the deficiency in the data set is processed, and characteristic engineering is carried out on the data set; step two: in the stage of data transformation, an oversampling SMOTE method is adopted for processing, the number of default users and clear users in a training set is enabled to be approximately consistent, and the discrete variable is subjected to one-hot coding or label coding. The invention has the beneficial effects that: compared with the traditional credit evaluation method of the P2P credit platform, the method can quickly and accurately extract the features before the user is credited, classify the user and do not need to spend a large amount of time for manual examination.

Description

Unbalanced credit user classification method based on oversampling and random forest

Technical Field

The invention relates to the field of credit user classification, in particular to an unbalanced credit user classification method based on oversampling and random forests.

Background

The P2P credit is a typical representative of interconnected finance with "general" as the core idea, and it uses network to realize the direct connection between investor and borrower, so that the credit is revitalized in the P2P field. However, at present, the P2P market situation is complex, the quality of borrowed users is uneven, and the problems of information asymmetry, reverse selection, flock effect and the like exist between investors and borrowers, so that the bad account rate of the P2P credit related platform is high, and the number of the reverse closed platforms is increased year by year. In order to reduce the default rate of users, reasonably and scientifically discriminate different loan users, reduce the platform operation risk and establish a reliable pre-loan user classification system, the method is of great importance. In general, the P2P credit platform models risk data for users on first loans, and for people with poor predicted credit, the lending institution typically gives them higher interest and smaller loan amounts; for those who predict better credit, the lending institution will give them lower interest and a larger loan amount. In the early development stage of the P2P platform, the credit of the borrower is predicted by adopting a machine screening and manual review mode, and the machine screening is used for assisting the manual review to improve the efficiency. However, because the borrower provides more information, manual review usually takes a lot of time, and the manual review has a larger subjective preference, so that the problems of missed review, misreview and the like are inevitable, and the accuracy of the method is usually not high. The introduction of the machine learning-dominated pre-loan classification method for loan users can greatly improve the efficiency and accuracy.

In recent years, the academic research of the P2P credit platform has never been interrupted, and commonly used methods include a BP neural network, a K-means clustering method, a Support Vector Machine (SVM), a decision tree and the like, which are not satisfactory in the performance of high-dimensional large-scale user data, and each of the methods has certain defects. The output result of the BP neural network is difficult to explain, and the learning time is too long; the K value of the K mean value clustering method is very difficult to select, and the center needs to be adjusted continuously, so that the K mean value clustering method is not suitable for being used in mass data; the support vector machine is extremely sensitive to missing data, has no general solution to the nonlinear problem, and is not suitable for mass data; the interpretability of the decision tree is good, but the decision tree is easy to fall into overfitting, pruning is needed, and a large memory is occupied when the number of layers is too large. The data volume of the loan platform is often very large, the user characteristics are various, the user qualifications are unbalanced, and the default users are often only a few. The traditional method is difficult to give the importance degree of the pre-credit characteristic of the user.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an unbalanced credit user classification method based on oversampling and random forest, which has higher accuracy and good interpretability.

In order to solve the technical problem, the invention provides an unbalanced credit user classification method based on oversampling and random forests, which comprises the following steps:

the method comprises the following steps: in the data cleaning stage, exploratory data analysis is firstly carried out, the deficiency in the data set is processed, and characteristic engineering is carried out on the data set;

step two: in the data conversion stage, an oversampling SMOTE method is adopted for processing, the number of default users and clear users in a training set is close to the same, and the discrete variable is subjected to one-hot coding or label coding;

step three: in the model training stage, dividing a training set and a test set, establishing a random forest model for model training, and performing parameter adjustment to give optimal parameters;

step four: and predicting users of the test set by using the random forest model, and giving importance ranking of the pre-loan features of the borrowers.

The invention has the beneficial effects that:

compared with the traditional credit evaluation method of the P2P credit platform, the method can quickly and accurately extract the features before the user is credited, classify the user and do not need to consume a large amount of time for manual examination; compared with the traditional data sampling method, the invention adopts the SMOTE oversampling method, avoids induction preference generated during model training due to unbalanced data set, and can improve the accuracy of identification of default users; compared with a decision tree model classified before the user loan, the random forest model is not easy to fall into overfitting, can be processed in a parallelization mode, is high in training speed, and can output the importance of the features; has higher accuracy and good interpretability.

In one embodiment, the dataset is the binding Club publication 2007-2018 loan dataset; the data set has 2260668 data items and 145 fields, the label item is loan _ status, which represents the loan state, and the total number of the data items is 9; for the loan user pre-loan classification, only 2 values, namely Fully Paid and Charged Off, need to be reserved.

In one embodiment, "processing the deficiency in the data set" specifically includes: deleting the field if the field is missing more than 70%; deleting the data item with the missing value if the field is missing less than 5%; filling the continuous fields by using median or average; for discrete fields, the mode is used for filling.

In one embodiment, in the "feature engineering of data sets", a variable is deleted if its deletion rate is greater than 0.9 or iv value is less than 0.05 or correlation is greater than 0.7.

In one embodiment, performing business understanding on the data set between "processing the deficiency value in the data set" and "performing feature engineering on the data set" further includes performing business understanding on the data set, where performing business understanding on the data set specifically includes: understanding all characteristics in the data set, and further cleaning the data; deleting discrete fields with more categories to prevent the feature space from being too large after single hot coding; and screening and deleting the variables after the loan to prevent the label from being revealed.

In one embodiment, "one-hot or tag encoding discrete variables; the method specifically comprises the following steps: if the field types are less than 2, adopting label coding to code; and in other cases, the discrete fields are converted into continuous fields by adopting one-hot coding.

In one embodiment, in step three, a random search method is used to obtain the optimal parameters of the random forest model on the data set.

Based on the same inventive concept, the present application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods when executing the program.

Based on the same inventive concept, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of any of the methods.

Based on the same inventive concept, the present application further provides a processor for executing a program, wherein the program executes to perform any one of the methods.

Drawings

FIG. 1 is a flow diagram of the unbalanced credit user classification method based on oversampling and random forests of the present invention.

FIG. 2 is a smote diagram of the unbalanced credit user classification method of the present invention based on oversampling and random forests.

FIG. 3 is a schematic diagram of the feature importance ranking of the unbalanced credit user classification method based on oversampling and random forests of the present invention.

FIG. 4 is a schematic diagram of the ROC curve of the unbalanced credit user classification method based on oversampling and random forests of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Referring to fig. 1, an unbalanced credit user classification method based on oversampling and random forests includes:

The invention has the beneficial effects that:

A specific application scenario of the present invention is given below:

the process involved in the present invention is shown in FIG. 1

Data cleansing phase

(1) Reading in a data set

Read in the 2007-2018 loan dataset published by binding Club. The data set has 2260668 data items and 145 fields, the label item is loan _ status, which represents the loan status, and there are 9 values, as shown in the following figure. For the loan user pre-loan classification, only 2 values, namely full pay and Charged Off, need to be kept.

loan _ status (loan status)	Means of
		FullyPaid	Huaqing (medicine for clearing away heat and toxic materials)
Current	In repayment
		ChargedOff	Bad account
Default	Default
		Late(16-30days)	16-30 days after
Late(31-120days)	The expiration is 31-120 days
		InGracePeriod	During the grace period
Doesnotmeetthecreditpolicy.Status:FullyPaid	Violation of credit card policy: huaqing (medicine for clearing away heat and toxic materials)
		Doesnotmeetthecreditpolicy.Status:ChargedOff	Violation of credit card policy: bad account

(2) And (3) default value treatment: and checking whether the data set has a default value by using a third-party library pandas of Python, and processing the default value. Deleting the field if the field is missing more than 70%; a field missing less than 5% deletes the data item that contains the missing value. Filling the continuous fields by using median or average; for discrete fields, the mode is used for filling. And carrying out primary cleaning on the data.

(3) Service understanding: all features in the dataset are understood and the data is further cleaned. Deleting discrete fields with more categories to prevent the feature space from being too large after single hot coding; and screening and deleting the variables after the loan to prevent the label from being revealed.

TABLE 1 COUNTER-VARIATION METER

Data transformation phase

(1) Characteristic engineering: and (4) performing feature engineering by using a third-party library toad library of Python, and deleting the variable if the deletion rate of the variable is more than 0.9 or the iv value is less than 0.05 or the correlation is more than 0.7. Finally, 9 features remain.

Table 2 table of characteristics finally retained by characteristic engineering

Feature(s)	Means of
		term	Term of loan
int_rate	Interest rate of loan
		grade	Credit rating
verification_status	Verification of annual income
		dti	Debit and credit ratio
acc_open_past_24mths	Past 24 months transaction volume
		avg_cur_bal	Current average balance of all accounts
bc_open_to_buy	Can be purchased on a circulating bank card
		debt_settlement_flag	Indicating whether the borrower is affiliated with the debt settlement company

(2) Discrete feature transformation: the discrete fields are tag-encoded or one-hot-encoded. If the field types are less than 2, adopting label encoding (label encoding) to encode; in other cases, one hot encoding (one hot encoding) is used to convert the discrete fields into continuous fields.

(3) Sample equalization: referring to fig. 3, the SMOTE method is used for sample equalization. The users in the loan data set have the characteristic of unbalance, and normal users: number of defaulting users is equal to about 4: 1. and processing default users by adopting an oversampling SMOTE method, so that the number of default users and the number of normal users tend to be balanced, and induction preference is prevented from being generated during model training.

Wherein x is_newTo generate a new sample, x is the original sample,

randomly selected samples from k neighbors of the original sample.

Model training phase

(1) Model establishment and parameter adjustment: and (3) establishing a random forest model, adjusting parameters of the model, and obtaining the optimal parameters of the model on the data set by a random search method (RandomizedSearchcCV). And (4) training the model by using the optimal parameters, outputting the accuracy and drawing an ROC curve (receiver operation characteristic curve).

(2) Visualization of feature importance: and (3) sequencing the features by using the random forest model obtained by training, sequencing the features from large to small according to the importance of each feature, and simultaneously drawing bar chart visualization output (refer to fig. 3).

Specific test results of the present invention are given below:

tests were performed in Jupyter notewood platform as follows:

(1) establishing a random forest model for training:

the prediction accuracy in the training set is 86.53%; the prediction accuracy on the test set was 86.49%.

(2) ROC curves were plotted and the results are shown in figure 4.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. An unbalanced credit user classification method based on oversampling and random forests is characterized by comprising the following steps:

2. The method of oversampled and random forest based unbalanced credit user classification as claimed in claim 1 wherein the dataset is the Lending Club publication 2007-2018 loan dataset; the data set has 2260668 data items and 145 fields, the label item is loan _ status, which represents the loan state, and the total number of the data items is 9; for the loan user pre-loan classification, only 2 values, namely Fully Paid and Charged Off, need to be reserved.

3. The method for classifying unbalanced credit users based on oversampling and random forest as claimed in claim 1, wherein "processing the deficiency in the data set" specifically includes: deleting the field if the field is missing more than 70%; deleting the data item with the missing value if the field is missing less than 5%; filling the continuous fields by using median or average; for discrete fields, the mode is used for filling.

4. The method of unbalanced credit user classification based on oversampled and random forests as claimed in claim 1 wherein in "data set feature engineering", a variable is deleted if its loss rate is greater than 0.9 or iv value is less than 0.05 or correlation is higher than 0.7.

5. The method of unbalanced credit user classification based on oversampled and random forest as claimed in claim 1 wherein between "processing the deficiency in the dataset" and "feature engineering the dataset" further comprises doing business understanding to the dataset, specifically including: understanding all characteristics in the data set, and further cleaning the data; deleting discrete fields with more categories to prevent the feature space from being too large after single hot coding; and screening and deleting the variables after the loan to prevent the label from being revealed.

6. The method of oversampled and random forest based unbalanced credit user classification as claimed in claim 1 wherein "one-hot coding or tag coding discrete variables; the method specifically comprises the following steps: if the field types are less than 2, adopting label coding to code; and in other cases, the discrete fields are converted into continuous fields by adopting one-hot coding.

7. The method of oversampled random forest based unbalanced credit user classification as claimed in claim 1 wherein in step three, the optimal parameters of the random forest model on the data set are obtained using a random search method.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 7.