CN117172910A

CN117172910A - Credit evaluation method and device based on EBM model, electronic equipment and storage medium

Info

Publication number: CN117172910A
Application number: CN202311196623.6A
Authority: CN
Inventors: 刘佳明; 张雪妹; 王刘安
Original assignee: Beijing Institute of Technology BIT; Beijing Technology and Business University
Current assignee: Beijing Institute of Technology BIT; Beijing Technology and Business University
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-12-05

Abstract

A credit assessment method, a credit assessment device, an electronic device and a storage medium based on an EBM model, wherein the credit assessment method comprises the following steps: acquiring first data of an object to be predicted, wherein the first data is credit risk related data of the object to be predicted; inputting the first data into an EBM model, and acquiring a prediction result output by the EBM model, wherein the prediction result comprises a first result which comprises whether the object to be predicted breaks down after acquiring a loan; the training process of the EBM model comprises the following steps: obtaining a credit risk related dataset, each sample of the dataset comprising a plurality of features; based on the dataset, the EBM model is trained, the EBN model being the sum of all main effect shape functions and all interaction effect shape functions. The invention can more accurately predict whether the borrower can violate the constraint.

Description

Credit evaluation method and device based on EBM model, electronic equipment and storage medium

Technical Field

The present invention relates to the field of credit evaluation technologies, and in particular, to a credit evaluation method, apparatus, electronic device, and storage medium based on an EBM model.

Background

The industries such as third party payment platforms and the like are gradually raised, banks promote personal credit consumption business, provide benefits and face corresponding operation risks, and various personal consumption such as house mortgage, automobile loan, bank cards and the like are urgently required for credit guarantee. However, as the personal credit system is still in a continuous perfection stage, the property information of a large number of credit customers is unclear and the economic condition is ambiguous, thereby bringing a multiplicative opportunity for a plurality of borrowers with weak honest awareness, and the risk is basically borne by commercial banks, and the default loan phenomenon is unfavorable for the stable operation of the commercial banks, so that the serious financial systematical risk can be caused. The current research on credit risk assessment models mostly pursues accurate prediction performance, but ignores the interpretability of decisions, and the accuracy and the interpretability are often difficult to be compatible, and there is an accuracy-interpretability dilemma, so that research on key factors causing borrower default and influence states among default factors is needed.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a credit evaluation method, a credit evaluation device, electronic equipment and a credit evaluation storage medium based on an EBM model.

The technical scheme of the invention provides a credit assessment method based on an EBM model, which comprises the following steps:

acquiring first data of an object to be predicted, wherein the first data is credit risk related data of the object to be predicted;

inputting the first data into an EBM model, and acquiring a prediction result output by the EBM model, wherein the prediction result comprises a first result which comprises whether the object to be predicted breaks down after acquiring a loan;

the training process of the EBM model comprises the following steps:

obtaining a credit risk related dataset, each sample of the dataset comprising a plurality of features;

based on the dataset, the EBM model is trained, the EBN model being the sum of all main effect shape functions and all interaction effect shape functions, the main effect shape functions being shape functions for a single said feature, the interaction effect shape functions being shape functions for two different said features.

Optionally, the predicted outcome further comprises a second outcome comprising a degree of influence of the plurality of features on the first outcome for analyzing a reason for predicting the first outcome.

Optionally, obtaining a credit risk related dataset, each sample of the dataset comprising a plurality of features, further comprising:

obtaining a credit risk related sample set, each sample of the sample set comprising a plurality of items;

and screening the plurality of items through a Lasso algorithm, and taking the screened items as the plurality of characteristics.

Optionally, obtaining a credit risk related sample set, further comprises:

acquiring personal credit data revealed by a bank;

preprocessing the personal credit data to form the sample set;

wherein each sample of the sample set includes at least two items including:

the method comprises the steps of current loan interest rate, company type, working life, whether houses exist, auditing conditions, loan application types, initial list state of loans, the number of repayment times in advance of borrowers, the accumulated sum of repayment in advance of borrowers, working types, mortgage account number, education types, family member numbers, repayment month number of loans and whether the borrowers violate the rules.

Optionally, the pretreatment includes at least one of:

missing value processing, outlier processing, data balancing processing and data normalization processing.

Optionally, the formula of the Lasso algorithm is:

where y is a vector of whether the borrower violates, X is a matrix of the plurality of terms, β is a coefficient vector, N is the number of samples, and α is a regularized strength hyper-parameter.

Optionally, the training process of the EBM model further includes:

after the EBM model training is completed, generating a global interpretation;

wherein the global interpretation includes: the importance of each of said features, and/or the functional relationship of each of said features and said first result.

The technical scheme of the invention also provides a credit assessment device based on the EBM model, which comprises:

the acquisition module is used for acquiring first data of an object to be predicted, wherein the first data is credit risk related data of the object to be predicted;

the prediction module is used for inputting the first data into an EBM model, and obtaining a prediction result output by the EBM model, wherein the prediction result comprises a first result which comprises whether the object to be predicted breaks down after obtaining a loan;

the training process of the EBM model comprises the following steps:

Based on the data set, training the EBM model, wherein the EBN model is the sum of all main effect shape functions and all interaction effect shape functions, the main effect shape functions are shape functions aiming at a single characteristic, and the interaction effect shape functions are shape functions aiming at two different characteristics.

The technical scheme of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the credit evaluation method based on the EBM model when executing the program.

The technical solution of the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the credit assessment method based on the EBM model as described in any one of the above.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following brief description will be given of the drawings used in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a credit evaluation method based on an EBM model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a credit evaluation device based on an EBM model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a credit risk assessment and interpretation system based on an EBM model according to an embodiment of the present invention;

FIG. 4 is a ranking chart of feature importance provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a shape function between a feature of "current loan interest rate" and a target variable and a distribution of values of the feature itself, provided by an embodiment of the invention;

FIG. 6 is a schematic diagram illustrating a partial interpretation of a sample in a dataset according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an entity structure of an electronic device according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The credit evaluation method based on the EBM model provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Fig. 1 is a flow chart of a credit evaluation method based on an EBM model according to an embodiment of the present application, as shown in fig. 1, the method for credit evaluation based on an EBM (Explainable Boosting Machine, interpretable machine learning) model according to the technical scheme of the present application includes the following steps:

s110, acquiring first data of an object to be predicted, wherein the first data is credit risk related data of the object to be predicted;

s120, inputting first data into the EBM model, and obtaining a prediction result output by the EBM model, wherein the prediction result comprises a first result which comprises whether an object to be predicted breaks down after obtaining a loan;

the training process of the EBM model comprises the following steps:

acquiring a credit risk related dataset, each sample of the dataset comprising a plurality of features;

based on the data set, an EBM model is trained, which is the sum of all main effect shape functions, which are shape functions for a single feature, and all interaction effect shape functions, which are shape functions for two different features.

And improving the loan default prediction accuracy of the object to be predicted (borrower) through the EBM model.

Optionally, the predicted result further includes a second result including a degree of influence of the plurality of features on the first result for analyzing and predicting a cause of the first result. Accordingly, the subject to be predicted or the approver can be informed of the main factors that the loan cannot be obtained specifically.

Optionally, acquiring a credit risk related dataset, each sample of the dataset comprising a plurality of features, further comprising:

acquiring a credit risk related sample set, each sample of the sample set comprising a plurality of items;

the plurality of items are filtered through Lasso (Least absolute shrinkage and selection operator, minimum absolute shrinkage and selection operator) algorithm, and the filtered items are used as a plurality of characteristics.

Through the term screening described above, the features used to influence the results can be controlled, thereby controlling the amount of computation of the overall algorithm and the term specifically influencing the predicted results.

Optionally, obtaining a credit risk related sample set, further comprises:

acquiring personal credit data revealed by a bank;

preprocessing personal credit data to form a sample set;

wherein each sample of the sample set comprises at least two items including:

The method comprises the steps of current loan interest rate, company type, working life, whether houses exist, auditing conditions, loan application types, initial list state of loans, the number of repayment times in advance of borrowers, the accumulated sum of repayment in advance of borrowers, working types, mortgage account number, education types, family member numbers, repayment month number of loans and whether the borrowers violate the rules. Whether the borrower violates the objective variable or not is the explanatory variable.

Optionally, the pretreatment comprises at least one of:

Alternatively, the formula of the Lasso algorithm is:

where y is the vector of whether the borrower violates, X is the matrix of multiple terms, β is the coefficient vector, N is the number of samples, and α is the regularized strength hyper-parameter.

In the fitting process, regularization parameters alpha beta beta|are adjusted ₁ To control the sparsity of the term. The best regularization strength super-parameters can be found using methods such as cross-validation or grid search.

Acquiring coefficients of all items according to a trained Lasso regression model; sequencing the coefficients, and sequencing the coefficients from big to small according to the absolute value; and determining the item with the optimal threshold retention coefficient larger than the threshold value through priori knowledge and actual requirements, thereby being used as the screened characteristic.

Optionally, the training process of the EBM model further includes:

after the EBM model training is completed, generating a global interpretation;

wherein the global interpretation includes: the importance of each feature, and/or the functional relationship of each feature to the first result.

Through global interpretation, the EBM model can be continuously adjusted, so that the emphasis direction of the model is more in line with the requirements of users.

The credit evaluation device based on the EBM model provided by the present invention is described below, and the credit evaluation device based on the EBM model described below and the credit evaluation method based on the EBM model described above can be referred to correspondingly to each other. The apparatus described herein includes a virtual apparatus on a program running device such as a computer or a processor.

Fig. 2 is a schematic structural diagram of a credit assessment device based on an EBM model according to an embodiment of the present invention, and as shown in fig. 2, the technical solution of the present invention further provides a credit assessment device based on an EBM model, where the device includes:

the acquisition module is used for acquiring first data of the object to be predicted, wherein the first data is credit risk related data of the object to be predicted;

the prediction module is used for inputting the first data into the EBM model, obtaining a prediction result output by the EBM model, wherein the prediction result comprises a first result which comprises whether an object to be predicted breaks down after obtaining a loan;

The training process of the EBM model comprises the following steps:

The embodiment can more accurately predict whether the borrower can default.

In yet another embodiment, the present invention provides a credit risk assessment and interpretation method based on an EBM model, for a financial institution to predict credit risk breach probability to reduce loan risk reduction loss, comprising the steps of: step 1), collecting and preprocessing data from a data source related to credit risk, wherein the data comprises missing value processing, abnormal value processing, data balancing and data normalization processing, so as to obtain a preprocessed data set; step 2), feature screening is carried out on the feature data of the preprocessed data set by using Lasso, the screened features are reserved, and the screened feature data in the preprocessed data set is used as the data set; step 3), inputting the credit risk data into an interpretable model EBM, and learning and selecting important features from the data and interaction between the data by the EBM model to generate an interpretable prediction result; step 4), evaluating and classifying credit risks of borrowers according to the prediction results of the EBM model, and outputting evaluation indexes such as default probability, prediction accuracy and the like of the borrowers; and step 5) according to the information such as global interpretation and local interpretation (namely a second result) of the EBM model, interpreting and analyzing the credit risk assessment result, and outputting the information such as credit risk factors, credit risk trends, credit risk suggestions and the like of the borrower.

The invention collects the real data of borrowers from the bank loan business, performs feature selection based on Lasso after preprocessing the data, and then adopts an EBM method to realize credit risk monitoring and identification work and performs experimental verification. Experimental results show that the method has good effect in bank credit violation prediction application, has good prediction robustness, and provides reasonable explanation for key factors of borrower violations and influence conditions of the factors.

Further, in step 1):

according to the information collection requirement of the actual loan business of the financial institution, the relevant characteristic data are collected in the bank, the collected information comprises an explanatory variable, the information mainly comprises three aspects of personal basic information, repayment capability and repayment willingness, the personal basic information mainly comprises characteristics of age, occupation, academic and the like, the repayment capability mainly comprises characteristics of property, wages and social relations, the repayment willingness mainly checks whether the person has corresponding default time, advances repayment times and accumulates repayment amount.

Specifically, the explanatory variables are the current loan interest rate, the type of company, the working period, whether there is a house, the auditing condition, the loan usage category, the initial list state of the loan, the number of repayment times in advance by the borrower, the accumulated amount of repayment in advance by the borrower, the working type, the number of mortgage accounts, the education type, the number of family members, the repayment month number of the loan, and the like, respectively.

Finally, there is a target variable for determining whether the borrower is default.

The set corresponding to the explanatory variable is expressed as x= { X ₁ ，x ₂ ，……，x _M And (2) wherein M is the number of explanatory variables, and the tag information of the target variable is expressed as Y= {0,1}, wherein 0 indicates no default, and 1 indicates default.

And in the missing value processing step, the missing sample size of the data set is less and only accounts for 9% of the total sample data size, so that the larger data size can be still maintained after the missing sample data is directly deleted, and 9142 pieces of data are finally reserved.

And an outlier processing step, namely, a sample with obvious outliers appears in the data set due to recording errors and other reasons in the information collecting process, wherein in order to avoid interference of abnormal data on default judgment, a quantile method is adopted to screen and adjust each characteristic data in the original data set, and the data is divided into four equal parts, namely a first quartile (Q1, 25% quantile), a second quartile (Q2, 50% quantile) and a third quartile (Q3, 75% quantile). IQR is the difference between Q3 and Q1, which represents the distribution range of the middle 50% of the data in the dataset. The datasets are arranged in order from small to large, and then Q1 and Q3 are calculated. Q1 is the value at the 25% position in the dataset and Q3 is the value at the 75% position in the dataset. IQR is obtained by calculating the difference between Q3 and Q1. The formula is: iqr=q3-Q1. Outliers are defined as values in the dataset that lie below q1—1.5×iqr or above q3+1.5×iqr. For data points that exceed the outlier boundary, they are considered outliers. And deleting the abnormal data.

A data balance processing step, after the missing value and the abnormal value are processed, the serious unbalance-like problem exists in the data, the data fitting process of the EBM model is seriously influenced, the data is balanced by adopting an SMOTE (Synthetic Minority Over-sampling Technique, synthetic minority group oversampling technology) method, and each minority sample x in the original data is processed _i Optionally selecting one sample x from among samples in the nearest neighbor of (a) _j Then, in determining sample x _i And nearest neighbor sample x _j And a point is arbitrarily selected on the connecting line between the two types of data, is used as a newly synthesized minority class sample, and is added into the original data set so as to improve the data unbalance problem in the data set. Wherein a small number of samples in the original data set are subjected to data balance processing to obtain new data x _new The specific processing formula is as follows:

x _new ＝x+rand(0，1)×(x-x _n )

wherein x is _n Sample values representing randomly selected samples from k neighbor samples for which sample x is found from a minority of classes of samples;

the normalization processing step, because different dimensions exist among different interpretation variables, the variation interval is also in different orders of magnitude, so that certain variables are ignored, and the result of data analysis is affected. Normalizing each characteristic data in the original data set Numerical value, replacing the numerical value with normalized data of each characteristic data, wherein the normalized numerical value x 'of the ith characteristic data of the jth sample in the original data set' _(ij) The specific calculation formula is as follows:

wherein x is _ij Representing the ith characteristic data, x, of the jth sample in the original dataset _min Representing the minimum value, x, of the ith column of characteristic data in the original data set _max Representing a maximum value of the ith column of feature data in the original dataset;

further, in step 2):

the collected original data contains explanatory variables of borrowers, and features with weaker performance for judging whether default exists are not lacking, so that the feature selection method of Lasso is adopted in the content, and features in the original data set are selected, on one hand, the prediction effect of a model is improved, and on the other hand, the calculation complexity is reduced.

The main idea of Lasso for feature selection is simple linear regression plus L1 regularization. In normal linear regression, the optimization objective is to minimize the square error between the predicted value and the actual value. Whereas in Lasso, the optimization objective is to minimize the loss function, which consists of two parts: square error terms (similar to linear regression) and L1 regularization terms. The L1 regularization term is the sum of the absolute values of the model coefficients multiplied by a regularization parameter λ (lambda). The existence of this term constrains the coefficients of the model during the optimization process, tending to compress some coefficients to zero.

The optimization problem of Lasso can be expressed in the following form: minimization: the square error term +λ Σ|β|. Wherein the square error term measures the difference between the model predicted value and the actual value, λ is the regularization parameter, and β represents the coefficient vector of the model. As λ increases, the optimization process may cause the coefficients of the partial features to gradually shrink to zero. This is because regularization term penalizes the absolute value of the coefficients, making the optimization process more prone to select fewer features and compress the coefficients of less important features to zero. Due to the influence of L1 regularization, the coefficients of certain features become zero, thus enabling feature selection. Features with non-zero coefficients are considered to be more important in interpreting the target variables and are therefore preserved. The mathematical expression of Lasso may be equivalent to the following formula:

the first half part in the formula represents the fitting of the original objective function, the second half part represents the punishment item of the parameter, the smaller the punishment coefficient lambda epsilon [0, + ], the smaller the punishment force of the model, the more the reserved characteristic variables, the larger the lambda, the larger the punishment force, and the more the parameter with the coefficient of 0. In a word, lasso can realize shrinkage and feature selection of model coefficients in an optimization process by introducing an L1 regularization term and adjusting a regularization parameter lambda, so that the aims of reducing model complexity and preventing overfitting are achieved.

Further, in step 3):

EBM model training process: the EBM model is an additive model that fits the features with a lifting tree model and looks for interaction term effects between the features. The personal credit data is input into an EBM model, single characteristics in the EBM model and second-order interaction characteristics in the EBM model are determined, output results of the single characteristics and output results of the second-order interaction characteristics are determined, and the single characteristics and the second-order interaction characteristics are added to obtain a final prediction result.

Identifying the second-order interaction characteristics of the EBM model specifically comprises the following steps:

the EBM model is characterized in that a lifting tree model is fused into a generalized additive model, and the EBM model structure is as follows:

wherein,is the main effect, sigma _i，j f _ij (x _i ，x _j ) Is a second order interaction effect, g (E (y|x)) represents how each feature affects the output of the model.

Given any pair of features (x _i ，x _j ) Training the model to obtain a shape function f _ij (x _i ，x _j ) The corresponding individual feature has a shape function f _i (x _i ) Training a first part of parameters corresponding to the main effect as the main effect, calculating residual errors after the main effect of all the characteristics is trained, and training the interaction f by taking residual error reduction as a target _ij (x _i ，x _j ) A corresponding second partial parameter. If (x) _i ，x _j ) Has stronger correlation with the interaction f _ij (x _i ，x _j ) The residual error is greatly reduced.

Binary function f for each pair of features _ij (x _i ，x _j ) Screening by using a FAST (Features from accelerated segment test, an algorithm for corner detection) algorithm, and selecting a remarkable second-order interaction item; and linearly adding the output result of the single feature and the output result of the second-order interaction feature to obtain a final prediction result of the EBM model.

Further, in step 4):

in order to compare the prediction effect of the EBM model on credit risk assessment, 5 common comparison models, namely a decision tree, a K neighbor model, XGBoost, a logistic regression model and a random forest, are selected. Accuracy, recall, precision, and F1 score (i.e., F1-score) were used as criteria for the model.

And when the credit risk is evaluated and classified, the borrowers are classified into different credit risk categories according to the prediction result of the EBM model. Meanwhile, the difference between the predicted value and the true value of the EBM model is evaluated, so that the predicted performance of the EBM model is evaluated, and a confusion matrix is commonly used for performance evaluation, specifically shown in the following table 1

TABLE 1

Accuracy rate:describing the proportion of all the correct judgment results of the EBM model to the observed values of the data set.

Precision ratio:describing that the EBM model prediction is Positive, the EBM model predicts the correct specific gravity.

Sensitivity:describing that the true classification result in the dataset is Positive, the EBM model predicts the correct specific gravity.

F1-score：Representing the average of the reconciliation of the precision and recall rates, with a maximum of 1 and a minimum of 0, with a closer to 1 representing better results. Wherein, recovery=tp/(tp+fn), TP represents True instance (True Positive), i.e. the number of samples that the model correctly recognizes as Positive instance; FN represents False Negative (False positive), i.e. the number of cases where the model erroneously identifies a sample that is actually a positive case as a Negative case.

Further, in step 5):

and analyzing the credit risk decision from the two angles of global interpretation and local interpretation according to the structural characteristics of the EBM model. Global interpretation is the interpretation of the EBM model results based on the feature variables in the dataset. Local interpretation refers to the prediction result for each input sample, analyzing which features and interactions it is affected by, and the extent and direction of the impact. The shape function f of each feature of the EBM model _u The weights of the single features and the second-order interaction features are calculated, and then the single features or the interaction features with the earlier importance are found by sequencing. For the calculation of the weight to be calculated, EBM model useMetric f _u Formula of->Regarded as f _u Is not considered in the EBM model, thus let E (f _u ) =0, then-> Therefore, after regularization of the data, the standard deviation is used as a weight to measure the importance of each item. Therefore, the credit risk assessment result is interpreted and analyzed, and information such as credit risk factors, credit risk trends, credit risk suggestions and the like of the borrower are output. The step 4) and the step 5) are in parallel relationship, and the result of the step 4) does not affect the step 5).

Fig. 3 is a schematic diagram of data flow of a credit risk assessment and interpretation system based on an EBM model according to an embodiment of the present invention, where as shown in fig. 3, the embodiment of the present invention further provides a credit risk assessment and interpretation system based on an EBM model, where the system includes:

the data acquisition module is used for acquiring the related data of the bank personal loan business and sending the related data to the data preprocessing module, wherein the characteristic information comprises three aspects of personal basic information, repayment capability and repayment willingness, and mainly comprises the current loan interest rate, the type of a company, the working age, whether a house exists, the auditing condition, the loan application type, the initial list state of the loan, the early repayment times of borrowers, the early repayment accumulated amount of borrowers, the working type, the number of mortgage loan accounts, the education type, the number of family members, the repayment month of the loan and whether the borrowers are illegal or not;

The data preprocessing module is used for receiving the loan data set sent by the data acquisition module, carrying out missing value processing, abnormal value processing, data balance processing and data normalization processing on the characteristic data in the loan data set, and sending the preprocessed loan data set to the EBM model training module;

the feature extraction module is used for carrying out feature screening by using Lasso, retaining the screened features, and taking the screened feature data in the preprocessed data set as a data set;

the EBM model training module is used for receiving the preprocessed loan data set sent by the data preprocessing module, training and verifying the data, automatically learning and selecting important features from the data and interaction between the important features and the data, and generating an interpretable prediction result;

the credit risk assessment and classification module is used for receiving the prediction result generated by the EBM model training and verifying module and comparing the prediction result with the real result of the data set, and assessing and classifying the credit risk of the borrower;

the credit risk interpretation and analysis module is used for receiving the prediction result generated by the EBM model training and verifying module, and interpreting and analyzing the credit risk assessment result according to the information such as global interpretation and local interpretation of the EBM model;

The user interface module is used for receiving the prediction result generated by the EBM model training and verifying module, the performance evaluation generated by the credit risk evaluation and classification module and the result interpretation generated by the credit risk interpretation and analysis module, displaying the credit risk evaluation and interpretation result to the user and providing an interactive function.

In order to verify the performance of the method in the credit risk assessment and interpretation system of the EBM model, a verification experiment was performed, from which a real loan data set was collected from the original bank. The data sample size is 15216 default loans and non-default loan records.

The invention provides a credit risk assessment and interpretation method based on an EBM model, wherein the EBM model is an interpretable machine learning model, and credit risk of borrowers can be predicted more accurately by selecting and weighting key evidence; compared with other machine learning methods, the method has the best recognition result in the aspects of accuracy, precision, recall rate, flexibility and the like; this helps to reduce the risk of false positives and missed positives, thereby optimizing loan decisions.

The invention adopts the EBM model to carry out credit risk assessment and interpretation, can fully utilize the credit data of clients, and improves the accuracy and reliability of credit risk assessment. Meanwhile, the invention can also determine the influence factors and the influence degree of the credit risk according to the credit grade of the client output by the EBM model, and explain why a certain borrower is classified into a specific credit risk category. This interpretation helps the financial institution to better understand the decision basis of the model and provides transparency to the customer, increasing trust.

The invention can flexibly adjust the parameters and the feature selection of the EBM model according to different scenes and requirements, thereby realizing personalized credit risk assessment and interpretation of different types of clients. In addition, the invention can realize dynamic optimization of customer credit risk assessment and interpretation by continuously updating the EBM model.

Five common machine learning classification methods are selected as comparison methods for verifying performance, namely Decision tree (Precision), K nearest neighbor model (KNN), XGBoost, logistic regression (Logistic) and Random forest (Random forest) are respectively compared with the methods provided by the invention, and adopted judgment indexes comprise Accuracy (Accuracy), recall rate (Precision), precision rate (Sensitivity) and F1-score.

In order to provide visual insight into the loan data set, the data samples used for the loan data set samples are shown in Table 2 below:

TABLE 2

In table 2, corresponding explanatory variables (current loan interest rate, company type, working period, whether there is room, audit, loan usage category, initial list status of loan, number of advance repayment by borrower, accumulated amount of advance repayment by borrower) indicating 9 Lasso choices of borrower with higher influence degree are shown, y represents a target variable (y=1 indicates default, y=0 indicates non-default)

According to 7: the ratio of 3 divides the training samples and the test samples. The performance comparisons (training set: test set=70:30) of the EBM method and the comparison method (precision tree, KNN, XGBoost, logic and Random forest) of the present invention are shown in table 3 below:

TABLE 3 Table 3

As can be seen from the classification results in Table 3, the EBM model achieves the best prediction effect, and has very good classification effects in accuracy, recall, precision and F1 score compared with the other five methods. The accuracy rate reaches 86.6%, which is improved by about 6% compared with other models, and it is known that in banking loan business, the loss of nearly millions can be reduced when the accuracy rate is improved by 1%.

The EBM model is used for global interpretation of the data set, the data set subjected to feature selection comprises 9 feature variables, the target variables are used for predicting whether loan violations exist, the EBM model is used for global interpretation of the feature variables, and the EBM model comprises two layers, namely feature importance ranking and functional relation between each feature and the target variables. Fig. 4 is a ranking chart of feature importance provided by the embodiment of the present invention, as shown in fig. 4, the global interpretation can understand the influence degree of the EBM model on different features, and assist the bank loan personnel to grasp the key factors influencing the decision making, so as to provide more comprehensive insight.

Fig. 5 is a schematic diagram of a shape function between a "current loan interest rate" feature and a target variable and a distribution situation of values of the feature itself, where, as shown in fig. 5, the relationship between the "current loan interest rate" feature and the target variable is shown above, and the probability distribution situation of the "current loan interest rate" feature in a dataset is shown below. As can be seen from the figure, when the loan interest rate is 0 to 0.5, the default probability gradually increases, and when the loan interest rate exceeds 0.7, the default probability gradually decreases along with the increase of the loan interest rate

The local interpretation of the EBM model is mainly shown by interpreting each feature in the sample according to the action condition of the result, that is, the score condition of each feature in the sample and the influence degree of each feature on the predicted result of the sample, fig. 6 is a schematic diagram of the local interpretation of a certain sample in the data set according to the embodiment of the present invention, as shown in fig. 6, the true result of the sample is 0, and the predicted result is also 0, for why the model gives such predicted result, the bar chart gives a specific score condition, and orders the influence degree from top to bottom, the features on the right side, such as "current loan interest rate", "working year", "audit condition", "whether there is a house" and "the type of company where there is, act negatively on the predicted result, and the predicted probability of the model is the result of adding the scores of each feature. The local interpretation can assist the bank loan staff in analyzing the specific reasons of the borrower loan failing (if the predicted result is that the borrower can violate the rules, the borrower loan fails), and the method can accurately reach a specific condition and is convenient for adjusting the decision result.

The invention discloses a credit risk assessment and interpretation method and a credit risk assessment and interpretation system based on an EBM model, wherein the method comprises the following steps: collecting and preprocessing data from credit risk-related data sources including borrower personal information, loan information, repayment records, credit scores, etc.; inputting the data into an interpretable machine learning Explainable Boosting Machine (EBM) model, and generating rules with strong interpretation by using the EBM; and (3) carrying out credit risk assessment and prediction by using the EBM, and carrying out interpretation output according to prediction interpretation so that a decision maker can understand the judgment basis of the model. Experimental results show that the invention has good prediction performance and interpretability, and can be widely applied to credit risk assessment and decision support in the financial field.

Fig. 7 is a schematic physical structure diagram of an electronic device according to the present invention, where, as shown in fig. 7, the electronic device may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a credit assessment method based on an EBM model, the method comprising:

the training process of the EBM model comprises the following steps:

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method of credit assessment based on an EBM model provided by the above methods, the method comprising:

the training process of the EBM model comprises the following steps:

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided EBM model-based credit assessment method, the method comprising:

the training process of the EBM model comprises the following steps:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A credit assessment method based on an EBM model, the method comprising:

the training process of the EBM model comprises the following steps:

2. The EBM model-based credit evaluation method of claim 1, wherein the predicted outcome further comprises a second outcome comprising a degree of influence of the plurality of features on the first outcome for analyzing a reason for predicting the first outcome.

3. The EBM model-based credit assessment method according to claim 1, wherein a credit risk-related dataset is obtained, each sample of the dataset comprising a plurality of features, further comprising:

4. The EBM model-based credit assessment method according to claim 3, wherein obtaining a credit risk-related sample set, further comprises:

acquiring personal credit data revealed by a bank;

preprocessing the personal credit data to form the sample set;

wherein each sample of the sample set includes at least two items including:

5. The EBM model-based credit assessment method according to claim 4, wherein the preprocessing includes at least one of:

6. The EBM model-based credit assessment method according to claim 5, wherein the formula of the Lasso algorithm is:

7. The EBM model-based credit assessment method according to claim 1, wherein the EBM model training process further comprises:

after the EBM model training is completed, generating a global interpretation;

8. A credit assessment device based on an EBM model, the device comprising:

the training process of the EBM model comprises the following steps:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the EBM model-based credit assessment method according to any one of claims 1-7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor, implements the steps of the EBM model-based credit assessment method according to any of claims 1-7.