CN111192140A - Method and device for predicting customer default probability - Google Patents

Method and device for predicting customer default probability Download PDF

Info

Publication number
CN111192140A
CN111192140A CN202010000782.4A CN202010000782A CN111192140A CN 111192140 A CN111192140 A CN 111192140A CN 202010000782 A CN202010000782 A CN 202010000782A CN 111192140 A CN111192140 A CN 111192140A
Authority
CN
China
Prior art keywords
default
data
probability prediction
default probability
customer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010000782.4A
Other languages
Chinese (zh)
Inventor
刘鹏飞
耿少华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202010000782.4A priority Critical patent/CN111192140A/en
Publication of CN111192140A publication Critical patent/CN111192140A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Biology (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The embodiment of the invention discloses a method and a device for predicting the default probability of a customer, wherein the method comprises the following steps: selecting corresponding index data from a pre-stored data set according to the client identifier to be identified; the metric data includes at least one of: the enterprise basic information, enterprise credit data, enterprise financial data, enterprise business data and enterprise social security data; and taking the index data as the input of a pre-trained default probability prediction model to predict the default probability of the customer to be identified. Thus, the customer default probability can be directly predicted, and the prediction accuracy of the customer default probability can be improved.

Description

Method and device for predicting customer default probability
Technical Field
The embodiment of the invention relates to big data analysis technology, in particular to a method and a device for predicting customer default probability.
Background
In credit business, before borrowing enterprises borrow, banks cannot accurately judge which enterprises can not be violated from all the borrowing enterprises. And the default of the borrowing enterprise can cause the bank not to obtain due income according to the terms of the original contract, if a large number of customers default, the commercial bank can generate more bad accounts, and the income of the commercial bank is further influenced, so that the profit is reduced and even the bank is bankruptcy. Thus, for effective risk management, estimating the expected default probability of a borrowing enterprise becomes a key and starting point for measuring and estimating the risk of a credit default for the enterprise.
The default probability measurement and calculation is the first condition for the commercial bank to carry out credit risk management, and is also the important driving force for improving the risk management quality of the commercial bank to carry out effective credit risk management, and the default probability of the borrower and the loss rate of interest of the borrower which may possibly occur must be carefully considered. The borrowing person default probability measurement is a basic work for credit risk analysis, and the final purpose of the method is to truly reflect the default possibility of the borrower, so that the scientificity and the effectiveness of the credit risk management of the commercial bank are guaranteed.
At present, the commercial banks in China evaluate the default probability of customers, based on enterprise financial index data, obtain risk values through an internal rating system, and then convert the risk values into initial default probability. However, when the method is adopted to determine the weight among various default indexes, the subjectivity is strong, the risk value needs to be converted into default probability, and the default probability cannot be directly measured.
Disclosure of Invention
In view of this, an embodiment of the present invention provides a method for predicting a customer default probability, including:
selecting corresponding index data from a pre-stored data set according to the client identifier to be identified; the metric data includes at least one of: the enterprise basic information, enterprise credit data, enterprise financial data, enterprise business data and enterprise social security data;
and taking the index data as the input of a pre-trained default probability prediction model to predict the default probability of the customer to be identified.
The embodiment of the invention also provides a device for predicting the default probability of the customer, which comprises the following steps:
the selection unit is used for selecting corresponding index data from a pre-stored data set according to the client identifier to be identified;
and the prediction unit is used for taking the index data as the input of a pre-trained default probability prediction model and predicting the default probability of the customer to be identified.
The embodiment of the invention also provides a device for predicting the default probability of the customer, which comprises the following steps: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the above-described method of customer violation probability prediction.
The embodiment of the invention also provides a computer readable storage medium, wherein an information processing program is stored on the computer readable storage medium, and when the information processing program is executed by a processor, the steps of the method for predicting the default probability of the customer are realized.
The technical scheme provided by the embodiment of the invention can directly predict the default probability of the customer and can improve the prediction precision of the default probability of the customer.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the present application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification and the drawings.
Drawings
The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.
FIG. 1 is a flow chart illustrating a method for predicting a customer default probability according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a method for predicting a customer default probability according to another embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for generating a default probability prediction model according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for predicting a customer default probability according to another embodiment of the present invention;
FIG. 5 is a graphical representation of AUC obtained during validation of a test set in accordance with an embodiment of the present invention;
fig. 6 is a flowchart illustrating an apparatus for predicting a customer default probability according to an embodiment of the present invention.
Detailed Description
The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.
The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.
Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.
Fig. 1 is a schematic flowchart of a method for predicting a customer default probability according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101, selecting corresponding index data from a pre-stored data set according to a client identifier to be identified; the metric data includes at least one of: the enterprise basic information, enterprise credit data, enterprise financial data, enterprise business data and enterprise social security data;
and 102, using the index data as an input of a pre-trained default probability prediction model to predict the default probability of the customer to be identified.
Optionally, the method further comprises:
and carrying out model training on the sample data by utilizing a model training algorithm to obtain the default probability prediction model.
Optionally, the performing model training on the sample data by using a model training algorithm to obtain the default probability prediction model includes:
preprocessing sample data in the sample training set;
carrying out correlation analysis of default variables on the preprocessed sample data by using a characteristic correlation analysis method to obtain a characteristic candidate set;
and carrying out model training on the feature candidate set by utilizing a model training algorithm to obtain the default probability prediction model.
Optionally, the preprocessing the sample data in the sample training set includes:
and carrying out missing value processing, abnormal value processing and normalization processing on the sample data.
Optionally, the performing a correlation analysis of default variables on the preprocessed sample data by using a feature correlation analysis method to obtain a feature candidate set includes:
constructing default characteristics for the sample data according to the default service definition;
analyzing the correlation degree of each default characteristic and the default variable by using a characteristic correlation analysis method, and selecting default characteristics with high correlation;
and taking the selected default features as a feature candidate set.
Optionally, the performing, by using a model training algorithm, model training on the feature candidate set to obtain the default probability prediction model includes:
carrying out model training on the feature candidate set by utilizing an Xgboost algorithm to obtain the default probability prediction model;
optionally, after obtaining the default probability prediction model, the method further includes:
carrying out default probability prediction on sample data in a sample test set prepared in advance according to the default probability prediction model;
and evaluating the default probability prediction model by utilizing a ROC curve, an AUC or a KS value.
The technical scheme provided by the embodiment of the invention can directly predict the default probability of the customer based on the default probability prediction model, and can improve the prediction precision of the default probability of the customer without being limited to enterprise financial data.
Fig. 2 is a schematic flow chart of a method for predicting a customer default probability according to another embodiment of the present invention, as shown in fig. 2, the method includes:
step 201, performing model training on sample data by using a model training algorithm to obtain a default probability prediction model;
optionally, the performing model training on the sample data by using a model training algorithm to obtain a default probability prediction model includes:
preprocessing sample data in the sample training set;
carrying out correlation analysis of default variables on the preprocessed sample data by using a characteristic correlation analysis method to obtain a characteristic candidate set;
and carrying out model training on the feature candidate set by utilizing a model training algorithm to obtain the default probability prediction model.
Optionally, the preprocessing the sample data in the sample training set includes:
and carrying out missing value processing, abnormal value processing and normalization processing on the sample data.
Wherein the missing value processing comprises: for each field in the sample data, firstly determining the range of the missing value, and counting the proportion of the missing value of the field. For fields with higher deletion rates (e.g., greater than 90% of the deletion rate), i.e., non-important fields, the field can be deleted directly. For other numeric fields, values with practical significance are populated. For example, since the business credit data and the business financial data have practical significance, 0-complementing operations may be applied to missing data in the business credit data and the business financial data.
Wherein the outlier processing comprises: for each field in the sample data, determining the range of the abnormal value, and counting the proportion of the abnormal value of the field. For fields with higher exception rates (greater than 90% exception rate), i.e., non-significant fields, the field is deleted. For other numerical fields, filling values with practical significance, and because data such as financial loan and the like have practical significance, 0 complementing operation can be adopted for abnormal data at this time.
Wherein the normalization process comprises: the data of date, time and the like in the sample data are uniformly formatted, the numerical data are normalized, and the condition that the dimensions are inconsistent is eliminated.
The method comprises the steps of obtaining sample data of a certain amount in advance, wherein the sample data covers basic enterprise information, enterprise financial data, enterprise credit data, non-financial data such as tax, industry and commerce and the like. The sample may be composed of enterprise-related data and an enterprise tag. In addition, sample data prepared in advance can be randomly divided into a sample training set and a sample testing set according to a preset proportion (for example, 7: 3), wherein the samples can also be divided into two types, namely default enterprise samples and non-default enterprise samples.
In addition, before the sample data is processed, time window alignment can be carried out on the data, for example, the default enterprise sample data is subjected to time window alignment based on the first default time of the default customer, and the non-default enterprise sample data is subjected to window alignment based on the majority of the default time of the default customer.
Optionally, the performing a correlation analysis of default variables on the preprocessed sample data by using a feature correlation analysis method to obtain a feature candidate set includes:
constructing default characteristics for the sample data according to the default service definition;
analyzing the correlation degree of each default characteristic and the default variable by using a characteristic correlation analysis method, and selecting default characteristics with high correlation;
and taking the selected default features as a feature candidate set.
The default business definition is the most important definition of the internal rating law of the new capital agreement basel and is the basis for estimating the credit risk parameters such as default Probability (PD), default damage rate (LGD), default risk Exposure (EAD) and the like.
Wherein the default variable is based on whether the customer violates.
Optionally, the constructing the default feature for the sample data according to the default service definition includes:
constructing a breach feature based on the nominal attributes, and constructing a breach feature based on the business.
An attribute refers to a data field, which represents a feature of a data object. The type of an attribute is determined by the set of values that the attribute may have. And nominal in a nominal attribute means "related to a name", it can be said that the value of the nominal attribute is some symbol or the name of an object; each value represents a certain class, code or state, and although the nominal attribute has a numerical value, it cannot be considered a "numerical attribute" because the nominal attribute value is not quantitative and it makes no sense to find the mean, median value for the dataset of the nominal attribute. However, one thing of interest is the value that makes the attribute appear most often, called the mode, which is a central trend measure.
Specifically, the nominal attribute data may be numerically quantized in a one-hot encoding manner.
And constructing the default characteristics based on the business only by aiming at different index data, for example, for the financial data index, converting the default characteristics into more expressive attributes according to business knowledge. Statistical features are constructed for credit class indicators in time slices. Statistical characteristics are also constructed for data such as industry and commerce.
Optionally, the feature correlation analysis method is to perform preliminary correlation analysis on the constructed default features and default variables, and remove default features with low correlation. And taking the remaining default features as a feature candidate set.
Specifically, for example, a pearson correlation coefficient, a regularization method, and a model-based sorting algorithm are combined to analyze the degree of correlation between each default feature and the default variable, and remove features with low correlation. And analyzing the correlation among the features, and combining the attributes with high correlation. The LDA method can also be adopted to reduce the dimension of the characteristic dimension. And taking the default features after the selection or dimensionality reduction as a feature candidate set, namely the input of a subsequent model training algorithm.
Optionally, the performing, by using a model training algorithm, model training on the feature candidate set to obtain the default probability prediction model includes:
and carrying out model training on the feature candidate set by utilizing an Xgboost algorithm to obtain the default probability prediction model.
Alternatively, the model training algorithm may be any one of the existing model training algorithms other than the Xgboost algorithm.
Parameters of the Xgboost algorithm can be set, and then the feature candidate set is trained based on the parameters to obtain a default probability prediction model. For example, according to cross validation, selecting the parameter with the optimal comprehensive effect, and setting the parameter of the Xgboost algorithm as: the iteration number is 200, and the classifier type is: the model is obtained by training, wherein local is used, the learning rate eta is 0.1, and the maximum depth max _ depth of the tree is 5.
In the step, whether the client violates or not is used as a target learning variable, and the xgboost is used for training the extracted sample to obtain a violation probability prediction model.
Step 202, selecting corresponding index data from a pre-stored data set according to the client identifier to be identified;
wherein the metric data includes at least one of: the system comprises basic enterprise information, enterprise credit data, enterprise financial data, enterprise business data and enterprise social security data.
In the step, on the basis of the financial data of the enterprise, non-financial data (such as enterprise business data, tax data and the like) of the enterprise are brought into default prediction, so that the accuracy of default probability prediction can be effectively improved.
Step 203, using the index data as the input of the default probability prediction model to predict the default probability of the customer to be identified.
In the step, the default probability is predicted by adopting the default probability prediction model, so that the problem that the current commercial bank cannot directly calculate the default probability is effectively solved.
Optionally, the selected index data may be aligned in time windows, that is, the index data before the customer default prediction time point is selected, the index data before the customer default prediction time point is used as the input of the default probability prediction model, and the customer default probability at the prediction time point is calculated.
When the constraint probability prediction model obtained by training with the Xgboost algorithm is selected, the performance problem of the traditional method facing mass data can be effectively solved because the Xgboost is an algorithm supporting distribution. And the model obtained by the training of the xgboost algorithm improves the prediction precision of the default of the enterprise customer.
In another embodiment of the present invention, on the basis of the previous embodiment, after obtaining the default probability prediction model, the method further includes: carrying out default probability prediction on sample data in a sample test set prepared in advance according to the default probability prediction model; and evaluating the default probability prediction model by utilizing a ROC curve, an AUC or a KS value.
Specifically, all sample data may be divided into a training set and a test set, model training is performed using the training set, and the obtained model is used to measure the prediction performance capability of the model, i.e., the accuracy (good/bad) of the entire model, using the test set.
Among them, ROC (Receiver Operating characterization) curve and AUC (area under current) are often used to evaluate the quality of a binary classifier (binary classifier).
The abscissa of the ROC curve is the specificity of a negative positive class rate (FPR), and the classifier predicts the proportion of actual negative examples in the positive class to all negative examples; the ordinate is True Positive Rate (TPR) which represents the proportion of actual positive instances to all positive instances in the positive class predicted by the classifier.
Wherein KS (Kolmogorov-Smirnov): KS is used for evaluating the risk discrimination capability of the model, and indexes measure the difference between the quality sample accumulation subsections. The greater the cumulative difference of good and bad samples, the greater the KS index, and the stronger the risk discrimination ability of the model. For example, the KS value obtained during a test set validation process is: 0.3674.
in the embodiment of the invention, the default probability prediction model is evaluated by using the test set, a basis is provided for subsequent model training, and the prediction precision of the default probability prediction model is further improved.
In another embodiment of the present invention, based on the above embodiments, a method for generating a default probability prediction model is provided, as shown in fig. 3, the method includes:
1. default variable definitions;
wherein the default variable is whether the customer violates the variable.
Specifically, defining the default variable means performing business definition on the target problem, and taking whether the client violates as a standard. The target problem is whether the client will violate, and the business definition is whether the client violates, for example, if the client is overdue for more than 90 days, the violation is defined.
2. Collecting and sorting data;
the data collection and arrangement refers to obtaining a certain amount of sample data, and the sample covers non-financial indexes such as enterprise basic information, enterprise financial indexes, enterprise credit data, enterprise tax data, enterprise industrial and commercial data and the like.
Wherein the sample may be composed of the enterprise-related attribute index and the enterprise tag.
3. Data time window alignment;
the data time window alignment refers to performing time window alignment on the sample data by taking the first default time of the default customer as a reference, and performing window alignment on the non-default customer by taking most default times of the default customer as a reference.
4. Preprocessing data;
the data preprocessing refers to data preprocessing of the obtained sample data: missing value processing, abnormal value processing and normalization processing.
5. Dividing data;
the data division refers to dividing the sample data in the step 4 into a training set and a test set according to a certain proportion.
6. Constructing default characteristics;
default feature construction can be performed based on business knowledge, time slicing and other manners, for example, the main construction features are as follows:
1) financial characteristics;
the financial characteristics represent the operating and liability conditions of the enterprise in the observation period, represent the dependence degree of the enterprise on loan, and mainly start from the financial statement of the client to extract the relevant financial characteristics. Examples are as follows:
A. the total liability rate is total liability/total asset, and represents the long-term liability condition of the enterprise;
B. mobile liability ratio — mobile liability/mobile asset, characterizing enterprise short term liability conditions.
2) Loan and property features;
the loan and property characteristics represent loan-related conditions of the enterprise M at bank A, are mainly extracted from financial statements and credit data, and are constructed on the basis of time slices. Examples are as follows:
A. the loan rate in the last three months is equal to the loan amount in the bank A/the total liability amount in the current period in the last three months;
B. the loan rate of nearly 6 months is nearly 6 months, and the loan amount of A bank/the total liability amount of the current period;
C. the near march asset rate is the near march in bank asset limit/current total asset.
3) Community characteristics;
the community feature base detects a map formed by the guarantee relation of the corresponding point of each time slice, calculates each feature of the obtained guarantee chain ring, and represents the influence of the loan condition of the member of the guarantee chain ring to which the enterprise M belongs in the current bank A on the enterprise. The characteristics are exemplified as follows:
A. the number of overdue clients in bank A in the chain circle is guaranteed in the time slice;
B. the current average number of overdue in bank a within the security chain circle in the time slice.
7. Default feature selection;
and performing preliminary correlation analysis on the default features and default variables constructed in the step 6 by using a feature correlation analysis method, removing default features with low correlation, and taking the remaining default features as a feature candidate set.
The characteristic correlation analysis method is a comprehensive consideration method, and mainly comprises the following steps: pearson correlation coefficient algorithm, regularization method, model-based feature selection method, etc. The feature selection method based on the model is to apply a machine learning model to the feature selection method, wherein the machine learning model comprises a regression model, a random forest model and the like.
8. And (5) training a model.
The model training is to perform model training on samples in the training set extracted in the step 5 by using xgboost to obtain a default probability prediction model, with whether the client violates a target learning variable (i.e., a default variable).
Wherein, the model training further comprises: model evaluation and model monitoring;
wherein, the model evaluation is a part of the model training process, that is, the sample data in the test set in 4 is used for model verification and evaluation, for example, the default probability prediction model can be evaluated by using ROC curve, AUC or KS value. Model training is a process of multiple iterations, which requires multiple iterations of model training and model evaluation. In addition, the model can have a monitoring process of the model after the model is operated on line, and the monitoring process is used for monitoring the performance of the model.
Fig. 4 is a flowchart illustrating a method for predicting a customer default probability according to an embodiment of the present invention, as shown in fig. 4, the method includes:
step 401, sample data preprocessing;
in this embodiment, data of a certain commercial bank is used as sample data, and the data includes information such as basic enterprise information, enterprise credit data, enterprise financial data, enterprise business data, and enterprise social security data. These samples are divided into two categories, namely default businesses and non-default businesses. Summary of data see table 1 below:
table 1:
Figure BDA0002353311510000121
specifically, the sample data prepared in advance is randomly divided into a training set and a test set according to a ratio of 7: 3. For each field, determining the range of the missing value, and counting the proportion of the missing value of the field. For fields with higher deletion rates (e.g., greater than 90% of the deletion rate), i.e., non-important fields, the field is deleted. For other numerical fields, filling values with practical significance, and because data such as financial loan and the like have practical significance, 0 complementing operation is adopted for missing data at this time. And (4) carrying out abnormal value processing on all samples, and processing according to the missing value. For fields whose values are outside the business requirements, the fields are processed according to missing values, for example, many fields (characteristics) are related to related businesses, such as certain loan businesses, the age limit range is 20-50 years, and if one 70 years appears, the fields are regarded as abnormal values and are processed according to the missing values. And the data such as date, time and the like are formatted uniformly. And normalizing the numerical data to eliminate the condition of inconsistent dimension.
Step 402, constructing default characteristics for the sample data in the preprocessed training set;
wherein constructing the default feature may include:
1. constructing features based on the nominal attributes;
and the nominal attribute data is subjected to numerical quantization by adopting a one-hot coding mode. For example: in the basic information of the client, there is a variable of the owner in the office place, and the value of the variable can be: owned, leased, or otherwise. Constructing characteristics on the basis of the variables: whether the office space is owned or not is judged, if the office space is owned, the value is 1 (yes), and if the office space is not owned, the value is 0 (lease or other).
2. Constructing features based on the services;
and converting the financial data indexes into attributes with higher expressive power according to business knowledge. The financial information characterizes the resistance of the enterprise to risks, and the characteristics of the financial related characteristics are constructed and selected based on business understanding, such as: and constructing the rate of the assets liability as total liability/total assets.
Wherein statistical features are constructed for credit-type indicators in time slices. Statistical characteristics are also constructed for data such as industry and commerce. For example, the statistical characteristics include average value, variance, and the like.
Step 403, default feature selection;
the default feature selection is to perform feature correlation analysis on the constructed default features and default variables by using Pearson correlation coefficients, a regularization method, a model-based sorting algorithm and other modes, select default features with high correlation, and remove default features with low correlation.
Specifically, the correlation degree of each default feature and the default variable can be analyzed in a mode of combining a pearson correlation coefficient, a regularization method and a model-based sorting algorithm, and the features with low correlation are removed. And analyzing the correlation among the features, and combining the features with high correlation. The LDA method can also be adopted to reduce the dimension of the characteristic dimension. And taking the selected or dimension-reduced characteristic as an input of Xgboost.
Step 404, performing model training by using an Xgboost algorithm;
specifically, according to cross validation, selecting a parameter with the optimal comprehensive effect, and finally setting the parameter of the Xgboost algorithm as: the iteration number is 200, and the classifier type is: and (4) training a binary, namely, training a learning rate eta to be 0.1, and training a default probability prediction model with the maximum depth max _ depth of the tree to be 5.
The cross validation is a means in the model training process in order to select the most reasonable parameters of the algorithm.
Step 405, evaluating the performance of the default probability prediction model;
the evaluation of the model performance is an essential part in the model training process and is a necessary condition for evaluating the model availability, and the evaluation method comprises a KS value, an AUC and the like.
Specifically, the default probability prediction model is predicted by using the sample data in the preprocessed test set, and the performance of the default probability prediction model is evaluated.
Alternatively, the ROC curve/AUC can be chosen as the evaluation criterion. The abscissa of the ROC curve is the specificity of a negative positive class rate (FPR), and the classifier predicts the proportion of actual negative examples in the positive class to all negative examples; the ordinate is True Positive Rate (TPR) which represents the proportion of actual positive instances to all positive instances in the positive class predicted by the classifier. For example, as shown in fig. 5, the AUC values obtained during the validation of a test set are: 0.8552.
alternatively, a KS (Kolmogorov-Smirnov) value may also be selected as the evaluation criterion. KS is used for evaluating the risk discrimination capability of the model, and indexes measure the difference between the quality sample accumulation subsections. The greater the cumulative difference of good and bad samples, the greater the KS index, and the stronger the risk discrimination ability of the model. For example, the KS value obtained during a test set validation process is: 0.3674.
step 406, selecting corresponding index data from a pre-stored data set according to the client identifier to be identified;
wherein the metric data includes at least one of: the system comprises basic enterprise information, enterprise credit data, enterprise financial data, enterprise business data and enterprise social security data.
In the step, on the basis of the financial data of the enterprise, non-financial data (such as enterprise business data, tax data and the like) of the enterprise are brought into default prediction, so that the accuracy of default probability prediction can be effectively improved.
Step 407, using the index data as an input of the default probability prediction model, and predicting the default probability of the customer to be identified.
In the embodiment, the probability of default is predicted by adopting an xgboost method, so that the problem that the current commercial bank cannot directly calculate is effectively solved. And the xgboost is an algorithm supporting the distribution, so that the performance problem of the traditional method facing mass data can be effectively solved. And the model obtained by the training of the xgboost algorithm improves the prediction accuracy of the customer default.
Fig. 6 is a schematic flowchart of an apparatus for predicting a customer default probability according to an embodiment of the present invention, as shown in fig. 6, the method includes:
the selection unit is used for selecting corresponding index data from a pre-stored data set according to the client identifier to be identified;
and the prediction unit is used for taking the index data as the input of a pre-trained default probability prediction model and predicting the default probability of the customer to be identified.
Optionally, the apparatus further comprises:
and the model training unit is used for carrying out model training on the sample data by utilizing a model training algorithm to obtain the default probability prediction model.
Optionally, the model training unit is specifically configured to pre-process sample data in a sample training set;
carrying out correlation analysis of default variables on the preprocessed sample data by using a characteristic correlation analysis method to obtain a characteristic candidate set;
and carrying out model training on the feature candidate set by utilizing a model training algorithm to obtain the default probability prediction model.
Optionally, the preprocessing the sample data in the sample training set includes:
and carrying out missing value processing, abnormal value processing and normalization processing on the sample data.
Optionally, the performing a correlation analysis of default variables on the preprocessed sample data by using a feature correlation analysis method to obtain a feature candidate set includes:
constructing default characteristics for the sample data according to the default service definition;
analyzing the correlation degree of each default characteristic and the default variable by using a characteristic correlation analysis method, and selecting default characteristics with high correlation;
and taking the selected default features as a feature candidate set.
Optionally, the performing, by using a model training algorithm, model training on the feature candidate set to obtain the default probability prediction model includes:
carrying out model training on the feature candidate set by utilizing an Xgboost algorithm to obtain the default probability prediction model;
optionally, the apparatus further includes an evaluation unit, configured to, after obtaining the default probability prediction model, perform default probability prediction on sample data in a sample test set prepared in advance according to the default probability prediction model;
and evaluating the default probability prediction model by utilizing a ROC curve, an AUC or a KS value.
The technical scheme provided by the embodiment of the invention can directly predict the default probability of the customer based on the default probability prediction model, and can improve the prediction precision of the default probability of the customer without being limited to enterprise financial data.
The embodiment of the invention also provides a device for predicting the default probability of the customer, which comprises the following steps: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing any of the methods of customer breach probability prediction.
The embodiment of the present invention further provides a computer-readable storage medium, where an information processing program is stored on the computer-readable storage medium, and when the information processing program is executed by a processor, the information processing program implements the steps of any one of the methods for predicting the customer default probability.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (10)

1. A method for customer breach probability prediction, comprising:
selecting corresponding index data from a pre-stored data set according to the client identifier to be identified; the metric data includes at least one of: the enterprise basic information, enterprise credit data, enterprise financial data, enterprise business data and enterprise social security data;
and taking the index data as the input of a pre-trained default probability prediction model to predict the default probability of the customer to be identified.
2. The method of claim 1, further comprising:
and carrying out model training on the sample data by utilizing a model training algorithm to obtain the default probability prediction model.
3. The method of claim 1, wherein model training sample data using a model training algorithm to obtain the default probability prediction model comprises:
preprocessing sample data in the sample training set;
carrying out correlation analysis of default variables on the preprocessed sample data by using a characteristic correlation analysis method to obtain a characteristic candidate set;
and carrying out model training on the feature candidate set by utilizing a model training algorithm to obtain the default probability prediction model.
4. The method of claim 3, wherein preprocessing sample data in the sample training set comprises:
and carrying out missing value processing, abnormal value processing and normalization processing on the sample data.
5. The method of claim 3, wherein the performing a correlation analysis of default variables on the preprocessed sample data by using the feature correlation analysis method to obtain a feature candidate set comprises:
constructing default characteristics for the sample data according to the default service definition;
analyzing the correlation degree of each default characteristic and the default variable by using a characteristic correlation analysis method, and selecting default characteristics with high correlation;
and taking the selected default features as a feature candidate set.
6. The method of claim 3, wherein the model training of the feature candidate set using a model training algorithm to obtain the default probability prediction model comprises:
and carrying out model training on the feature candidate set by utilizing an Xgboost algorithm to obtain the default probability prediction model.
7. The method of claim 6, wherein after obtaining the default probability prediction model, the method further comprises:
carrying out default probability prediction on sample data in a sample test set prepared in advance according to the default probability prediction model;
and evaluating the default probability prediction model by utilizing a ROC curve, an AUC or a KS value.
8. An apparatus for customer breach probability prediction, comprising:
the selection unit is used for selecting corresponding index data from a pre-stored data set according to the client identifier to be identified;
and the prediction unit is used for taking the index data as the input of a pre-trained default probability prediction model and predicting the default probability of the customer to be identified.
9. An apparatus for customer breach probability prediction, comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing a method of customer violation probability prediction according to any of claims 1-7.
10. A computer-readable storage medium, having stored thereon an information processing program which, when executed by a processor, performs the steps of the method of customer default probability prediction according to any one of claims 1 to 7.
CN202010000782.4A 2020-01-02 2020-01-02 Method and device for predicting customer default probability Withdrawn CN111192140A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010000782.4A CN111192140A (en) 2020-01-02 2020-01-02 Method and device for predicting customer default probability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010000782.4A CN111192140A (en) 2020-01-02 2020-01-02 Method and device for predicting customer default probability

Publications (1)

Publication Number Publication Date
CN111192140A true CN111192140A (en) 2020-05-22

Family

ID=70708093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010000782.4A Withdrawn CN111192140A (en) 2020-01-02 2020-01-02 Method and device for predicting customer default probability

Country Status (1)

Country Link
CN (1) CN111192140A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037006A (en) * 2020-07-21 2020-12-04 苏宁金融科技(南京)有限公司 Credit risk identification method and device for small and micro enterprises
CN112270553A (en) * 2020-11-09 2021-01-26 浪潮软件股份有限公司 Malicious registered enterprise behavior identification method and system based on isolated forest algorithm
CN112308294A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112308293A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112434886A (en) * 2020-12-17 2021-03-02 北京环信简益科技有限公司 Method for predicting client mortgage loan default probability
CN113282886A (en) * 2021-05-26 2021-08-20 北京大唐神州科技有限公司 Bank loan default judgment method based on logistic regression

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037006A (en) * 2020-07-21 2020-12-04 苏宁金融科技(南京)有限公司 Credit risk identification method and device for small and micro enterprises
CN112308294A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112308293A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112270553A (en) * 2020-11-09 2021-01-26 浪潮软件股份有限公司 Malicious registered enterprise behavior identification method and system based on isolated forest algorithm
CN112434886A (en) * 2020-12-17 2021-03-02 北京环信简益科技有限公司 Method for predicting client mortgage loan default probability
CN113282886A (en) * 2021-05-26 2021-08-20 北京大唐神州科技有限公司 Bank loan default judgment method based on logistic regression

Similar Documents

Publication Publication Date Title
CN111192140A (en) Method and device for predicting customer default probability
Engelmann et al. The Basel II risk parameters: estimation, validation, and stress testing
US20100257092A1 (en) System and method for predicting a measure of anomalousness and similarity of records in relation to a set of reference records
US20170270546A1 (en) Service churn model
CN110866832A (en) Risk control method, system, storage medium and computing device
CN112488496A (en) Financial index prediction method and device
CN112950383B (en) Financial risk monitoring method based on artificial intelligence and related equipment
CN112085593B (en) Credit data mining method for small and medium enterprises
CN116883153A (en) Pedestrian credit investigation-based automobile finance pre-credit rating card development method and terminal
CN116629998A (en) Automatic information counting method and device, electronic equipment and readable storage medium
CN115619539A (en) Pre-loan risk evaluation method and device
CN114757594A (en) Network security risk monetization method, device, terminal and medium
CN112561713A (en) Method and device for anti-fraud recognition of claim settlement in insurance industry
CN111401329A (en) Information flow direction identification method, device, equipment and storage medium
CN110570301A (en) Risk identification method, device, equipment and medium
CN117853254B (en) Accounting platform testing method, device, equipment and storage medium
CN117579329B (en) Method for predicting security exposure risk of organization network, electronic equipment and storage medium
CN113723710B (en) Customer loss prediction method, system, storage medium and electronic equipment
CN113282886B (en) Bank loan default judgment method based on logistic regression
CN116523627A (en) Credit line prediction method, device, equipment, medium and product
CN117474409A (en) Evaluation index weight determination method, apparatus, device and computer storage medium
Nasution et al. Credit Risk Detection in Peer-to-Peer Lending Using CatBoost
CN117829975A (en) Method, device, storage medium and processor for constructing risk identification model
CN115186514A (en) Credit risk model prediction method and device
CN117172910A (en) Credit evaluation method and device based on EBM model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200522

WW01 Invention patent application withdrawn after publication