CN112686446A

CN112686446A - Machine learning interpretability-oriented credit default prediction method and system

Info

Publication number: CN112686446A
Application number: CN202011606395.1A
Authority: CN
Inventors: 吴金迪
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-04-20

Abstract

The invention relates to the technical field of credit default prediction, in particular to a machine learning interpretability-oriented credit default prediction method, which comprises the following steps: s1, collecting data; s2, preprocessing data; s3, dividing and training data; and S4, verifying the model. The invention also discloses a system for predicting the credit default oriented to the machine learning interpretability, which comprises a data acquisition module, wherein the data acquisition module is connected with a cleaning and screening module through a signal line, the cleaning and screening module cleans input data, if a certain variable of the data is lost, a few non-core data are deleted, if the deleted data are excessive, the data are filled in by a method of overall distribution sampling and a method of performing maximum likelihood estimation according to other information, and the cleaning and screening module is connected with a data dividing module through a signal.

Description

Machine learning interpretability-oriented credit default prediction method and system

Technical Field

The invention relates to the field of credit default prediction, in particular to a machine learning interpretability-oriented credit default prediction method and system.

Background

With the maturity of financial loan markets, the demand of small and micro enterprises for loans is larger and larger, the requirements on loan approval efficiency, loan issuance time, loan issuance management and the like are continuously improved, under the existing conditions, how to save the audit time, improve the audit accuracy and optimize the loan pool management becomes a great challenge at present, and how to scientifically and reasonably price various risks, so that the method is an important link for realizing high-efficiency operation management, reducing the operation cost and ensuring the customer service quality and level by a bank credit department. At present, no good evaluation standard exists on credit examination of clients, and misjudgment is easy to occur.

Disclosure of Invention

The invention aims to solve the defect that credit cannot be accurately evaluated in the prior art, and provides a credit default prediction method and system oriented to machine learning interpretability.

In order to achieve the above purposes, the technical scheme adopted by the invention is as follows: a machine learning interpretability-oriented credit default prediction method, comprising the steps of:

s1, collecting data, wherein the collected data sources comprise business statistical data, credit investigation data provided by a bank and big data provided by a third party;

s2, preprocessing of data: the cleaning and screening module cleans the input data, if a certain variable of the data is lost, a few non-core data are deleted, if the deleted quantity is excessive, the data are filled by a method of overall distributed sampling and a method of performing maximum likelihood estimation according to other information;

s3, data division and training: dividing the cleaned data into a plurality of groups, respectively performing logistic regression, random forest, XGboost and deep learning on the data of the plurality of groups, and adding the prediction results of the plurality of groups to calculate an average value;

s4, model verification: and verifying the established model through a verification module, respectively evaluating the prediction results of each group by introducing new data, and then evaluating the average value to find the optimal model.

Preferably, in S4, the model is verified by the quantitative analysis module for a large amount of data, and the accuracy of the model prediction is evaluated by the stereotype analysis module

The invention also discloses a system for predicting the credit default oriented to the machine learning interpretability, which comprises a data acquisition module, wherein the data acquisition module is connected with a cleaning and screening module through a signal line, the cleaning and screening module cleans input data, if a certain variable of the data is lost, a few non-core data are deleted, if the deleted data are too much, the data are filled in by a method of overall distribution sampling and a method of performing maximum likelihood estimation according to other information, the cleaning and screening module is connected with a data dividing module through signals, the cleaned data are divided into a plurality of groups through the data dividing module, the data dividing module is connected with a system control center through signals, the system control center is used for controlling and managing a prediction system, and the system control center is connected with a model establishing module and a comprehensive evaluation module through the signal line, an evaluation model is established through a model establishing module, the credit risk of the client is scored through a comprehensive evaluation module, the model building module is connected with a training module through a signal line, the training module respectively carries out logistic regression, random forest, XGboost and deep learning on a plurality of groups of data, and the prediction results of a plurality of groups are added to calculate the average value, the model building module is connected with a verification module through a signal wire, the established model is verified through a verification module, the average value is evaluated after the prediction results of each group are respectively evaluated by introducing new data, the optimal model is found, the verification module is connected with a quantitative analysis module and a qualitative analysis module through a signal line, and verifying a large amount of data of the model through a quantitative analysis module, and evaluating the accuracy of model prediction through a sizing analysis module.

Preferably, the data sources collected by the data collection module include data of business statistics, credit investigation data provided by banks and big data provided by third parties.

Preferably, the training module is connected with the supervised learning module and the unsupervised learning module through signal lines, trains the noise reduction gradient lifting tree, performs unsupervised learning by using historical network personal credit information to obtain a first data characteristic, performs supervised learning by using the first data characteristic, and completes the noise reduction gradient lifting tree model training.

Preferably, the system control center comprises a system management host, the system management host is connected with a database through a local area network, the database is used for storing data in the system, the system management host is connected with the data dividing module, the comprehensive evaluation module and the model establishing module through signal lines, and the local area network is provided with a firewall.

Preferably, the database is connected with a timing backup module through a signal line, and timing backup can be carried out on data in the database through the timing module, so that the data can be conveniently and timely recovered when the data in the database is lost, loss is recovered, the bronze drum signal line of the database is connected with an automatic updating module, the data in the database can be conveniently and regularly updated through the automatic updating module, the database is connected with a chart display module through the signal line, and the data in the database can be statistically displayed for a chart through the chart display module, so that a manager of the system can conveniently and visually know the chart.

Preferably, the comprehensive evaluation module is internally provided with a history evaluation module, an industry evaluation module, a position evaluation module, a region evaluation module, a certificate evaluation module, a income evaluation module and an authenticity evaluation module, and comprehensive evaluation is performed by evaluating the history credit records of the client, evaluating the industry in which the client is engaged, evaluating the duties in which the client is engaged, evaluating the region in which the client is located, evaluating the provided certificates, evaluating the income level and evaluating the authenticity of the provided data.

Preferably, the comprehensive evaluation module is connected with a data input module through a data line, the data input module inputs data of the client, the comprehensive evaluation module is connected with a grading establishment module through a signal line, the grading establishment module evaluates the credit risk of the client through the generated model, the comprehensive evaluation module is connected with a risk prediction module through a signal line, the risk prediction module grades the risk level of the client through grading, the comprehensive evaluation module is connected with a feedback module through a signal line, and the grading result can be fed back to the system control center through the feedback module.

Preferably, the system control center is connected with a safety warning module through a signal wire, the safety warning module is connected with a default judgment module and a warning notification module through a signal wire, the default judgment module predicts the default condition of the customer according to the grading result of the customer, and the warning notification module notifies the default condition when the default judgment module judges that the default probability of the customer exceeds a set value.

Compared with the prior art, the invention has the following beneficial effects:

1. the prediction method disclosed by the invention is based on big data analysis, can realize automatic model learning by using a machine learning algorithm to compare with a traditional scoring card mode, is more sensitive to client data change, has higher prediction accuracy, can be used for quickly and effectively predicting whether the default is violated in a credit life cycle of a credit model automatically, and further can be used for carrying out quick approval treatment;

2. the invention adopts a prediction model and an optimized logistic regression algorithm, meets complex credit constraint, and obtains more accurate default probability prediction and risk premium results.

3. The credit approval method is based on the conversion of default probability and risk premium results, so that auditors can be liberated from heavy credit risk assessment auditing and pricing, and the efficiency of credit approval is improved.

Drawings

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a system block diagram of a system control center of the present invention;

FIG. 3 is a system block diagram of the comprehensive assessment module of the present invention.

Detailed Description

The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments in the following description are given by way of example only, and other obvious variations will occur to those skilled in the art.

A machine learning interpretability-oriented credit default prediction method, comprising the steps of:

In S4, the model is verified through the quantitative analysis module, and the accuracy of model prediction is evaluated through the stereotype analysis module

As shown in fig. 1-3, the invention also discloses a system for predicting credit default oriented to machine learning interpretability, which comprises a data acquisition module, wherein the data acquisition module is connected with a cleaning and screening module through a signal line, the cleaning and screening module cleans input data, if a certain variable of the data is lost, a few non-core data are deleted, if the deleted data are too much, the data are filled in by a method of overall distribution sampling and a method of performing maximum likelihood estimation according to other information, the cleaning and screening module is connected with a data dividing module through a signal, the cleaned data are divided into a plurality of groups through the data dividing module, the data dividing module is connected with a system control center through a signal, the system control center is used for controlling and managing a prediction system, and the system control center is connected with a model establishing module and a comprehensive evaluation module through a signal line, the method comprises the steps of establishing an evaluation model through a model establishing module, scoring the credit risk of a client through a comprehensive evaluation module, connecting the model establishing module with a training module through a signal line, respectively performing logistic regression, random forests, XGboost and deep learning on data of a plurality of groups, adding prediction results of the groups to obtain an average value, connecting the model establishing module with a verification module through the signal line, verifying the established model through the verification module, respectively evaluating the prediction results of each group by introducing new data, then evaluating the average value to find the best model, connecting the verification module with a quantitative analysis module and a qualitative analysis module through the signal line, verifying a large amount of data of the model through the quantitative analysis module, and evaluating the accuracy of model prediction through a sizing analysis module.

The data sources collected by the data collection module comprise business statistical data, credit investigation data provided by a bank and big data provided by a third party.

The training module is connected with the supervised learning module and the unsupervised learning module through signal lines, trains the noise reduction gradient lifting tree, uses historical network personal credit information to perform unsupervised learning to obtain first data characteristics, uses the first data characteristics to perform supervised learning, and completes noise reduction gradient lifting tree model training.

The system control center comprises a system management host, the system management host is connected with a database through a local area network, the database is used for storing data in the system, the system management host is connected with a data dividing module, a comprehensive evaluation module and a model establishing module through signal lines, and the local area network is provided with a firewall.

The database is connected with a timing backup module through a signal line, timing backup can be carried out on data in the database through the timing module, and therefore the data in the database can be conveniently recovered in time when the data are lost, loss is recovered, the bronze drum signal line of the database is connected with an automatic updating module, the data in the database can be conveniently updated regularly through the automatic updating module, the database is connected with a chart display module through the signal line, the data in the database can be statistically displayed as a chart through the chart display module, and therefore managers of the system can conveniently know the chart visually.

The comprehensive evaluation module is internally provided with a history evaluation module, an industry evaluation module, a position evaluation module, a region evaluation module, a certificate evaluation module, a income evaluation module and a reality evaluation module, and comprehensive evaluation is carried out by evaluating the history credit records of the clients, evaluating the engaged industry, evaluating the occupied tasks, evaluating the regions, evaluating the provided certificates, evaluating the income level and evaluating the authenticity of the provided data.

The comprehensive evaluation module is connected with a data input module through a data line, the data input module inputs data of the client, the comprehensive evaluation module is connected with a grading establishment module through a signal line, the grading establishment module evaluates the credit risk of the client through the generated model, the comprehensive evaluation module is connected with a risk prediction module through a signal line, the risk prediction module grades the risk grade of the client through grading, the comprehensive evaluation module is connected with a feedback module through a signal line, and the grading result can be fed back to the system control center through the feedback module.

The system control center is connected with a safety warning module through a signal wire, the safety warning module is connected with a default judgment module and a warning notification module through the signal wire, the default condition of a customer is predicted through the default judgment module according to the grading result of the customer, and when the default prediction judges that the default probability of the customer exceeds a set value, the warning notification module is used for notifying.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A machine learning interpretability-oriented credit default prediction method, comprising the steps of:

2. The machine-learning interpretable credit default prediction method of claim 1, wherein the model is subjected to mass data verification by the quantitative analysis module in S4, and the accuracy of model prediction is evaluated by the stereotype analysis module

3. The system applied to the machine learning interpretable credit default prediction method of any one of claims 1-2, comprising a data acquisition module, wherein the data acquisition module is connected with a cleaning screening module through a signal line, the cleaning screening module cleans input data, if a variable of the data is lost, a few non-core data are deleted, if the deleted data is excessive, the data are sampled in a global distribution mode and filled in according to other information by a maximum likelihood estimation method, the cleaning screening module is connected with a data dividing module through a signal, the cleaned data are divided into a plurality of groups through the data dividing module, the data dividing module is connected with a system control center through a signal, the system control center is used for controlling and managing the prediction system, and the system control center is connected with a model establishing module and a comprehensive evaluation module through a signal line, an evaluation model is established through a model establishing module, the credit risk of the client is scored through a comprehensive evaluation module, the model building module is connected with a training module through a signal line, the training module respectively carries out logistic regression, random forest, XGboost and deep learning on a plurality of groups of data, and the prediction results of a plurality of groups are added to calculate the average value, the model building module is connected with a verification module through a signal wire, the established model is verified through a verification module, the average value is evaluated after the prediction results of each group are respectively evaluated by introducing new data, the optimal model is found, the verification module is connected with a quantitative analysis module and a qualitative analysis module through a signal line, and verifying a large amount of data of the model through a quantitative analysis module, and evaluating the accuracy of model prediction through a sizing analysis module.

4. The system for machine learning interpretable credit default prediction method according to claim 3, wherein the data collection module collects data sources including data of business statistics, credit investigation data provided by banks and big data provided by third parties.

5. The system of claim 3, wherein the training module is connected with the supervised learning module and the unsupervised learning module through signal lines, trains the noise reduction gradient elevated tree, performs unsupervised learning using historical network personal credit information to obtain first data features, performs supervised learning using the first data features, and completes noise reduction gradient elevated tree model training.

6. The system for predicting the machine-learning interpretable credit default according to claim 3, wherein the system control center comprises a system management host, the system management host is connected with a database through a local area network, the database is used for storing data in the system, the system management host is connected with the data dividing module, the comprehensive evaluation module and the model building module through signal lines, and the local area network is provided with a firewall.

7. The system for predicting the credit default oriented to machine learning interpretability as claimed in claim 6, wherein the database is connected with a timing backup module through a signal line, the timing backup module can perform timing backup on the data in the database, so that the data in the database can be recovered in time and lost conveniently when lost, the automatic updating module is connected with the bronze drum signal line of the database, the automatic updating module can perform periodic updating on the data in the database conveniently, the database is connected with a chart display module through a signal line, and the chart display module can display the data in the database counted as a chart, so that a manager of the system can know the chart conveniently and visually.

8. The system for machine learning interpretable credit default prediction according to claim 3, wherein the comprehensive evaluation module is internally provided with a history evaluation module, an industry evaluation module, a position evaluation module, a region evaluation module, a certificate evaluation module, a income evaluation module and a reality evaluation module, and comprehensive evaluation is performed by evaluating a history credit record of a client, evaluating an engaged industry, evaluating an assumed position, evaluating an area where the client is located, evaluating a provided certificate, evaluating an income level and evaluating the authenticity of provided data.

9. The system for predicting the credit default oriented to the machine learning interpretability as claimed in claim 3, wherein the comprehensive assessment module is connected with a data input module through a data line, the data input module inputs data of a customer, the comprehensive assessment module is connected with a score establishment module through a signal line, the score establishment module evaluates the credit risk of the customer through the generated model, the comprehensive assessment module is connected with a risk prediction module through a signal line, the risk prediction module grades the risk level of the customer through scoring, the comprehensive assessment module is connected with a feedback module through a signal line, and the scoring result can be fed back to the system control center through the feedback module.

10. The machine-learning interpretable credit default prediction method system of claim 3, wherein the system control center is connected with a security alarm module through a signal line, the security alarm module is connected with a default judgment module and an alarm notification module through a signal line, the default judgment module predicts the default condition of the customer according to the grading result of the customer, and the alarm notification module notifies the customer when the default prediction judges that the default probability of the customer exceeds the set value.