CN117994017A - Method for constructing retail credit risk prediction model and online credit service Scoredelta model - Google Patents

Method for constructing retail credit risk prediction model and online credit service Scoredelta model Download PDF

Info

Publication number
CN117994017A
CN117994017A CN202211325815.8A CN202211325815A CN117994017A CN 117994017 A CN117994017 A CN 117994017A CN 202211325815 A CN202211325815 A CN 202211325815A CN 117994017 A CN117994017 A CN 117994017A
Authority
CN
China
Prior art keywords
credit
past
months
value
month
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211325815.8A
Other languages
Chinese (zh)
Inventor
刘志玲
王乐
李松润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCB Finetech Co Ltd filed Critical CCB Finetech Co Ltd
Priority to CN202211325815.8A priority Critical patent/CN117994017A/en
Publication of CN117994017A publication Critical patent/CN117994017A/en
Pending legal-status Critical Current

Links

Landscapes

  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The application relates to an online credit service Scoredelta model and a method for predicting retail credit risk of an online credit service by using the model, which comprises the following steps: a data acquisition step of acquiring retail credit prediction data of a sample to be predicted; classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating credit violation probability; and a credit violation probability calculating step, wherein retail credit prediction data are substituted into a credit violation probability submodel to calculate the credit violation probability of the sample to be predicted.

Description

Method for constructing retail credit risk prediction model and online credit service Scoredelta model
Technical Field
The application relates to a credit risk management and control system and a credit risk management and control method, which can assist a financial institution to make more accurate risk decisions and accelerate the digital transformation process of the financial institution. In particular, the present application relates to a method of constructing a retail credit risk prediction model for an online credit service and a retail credit risk prediction model for an online credit service.
Background
In the current large environment of aggressive consumer credit development, the manual approval mechanisms of some financial institutions have failed to cope with the increasing credit demand, and thus it is highly desirable to promote the intelligent pneumatic control capability of financial institutions. Financial institutions wish to build a credit full-flow risk slow release mechanism from customer prescreening, pre-credit review, mid-credit approval, post-credit management to early credit-harvest stages.
If the scoring system can be developed based on the principles of early recognition, early warning, early discovery and early treatment, credit business can be monitored and managed quickly and conveniently, and the business volume, competitive advantage and asset quality of a financial institution can be improved on the basis of controllable risk.
However, the construction of the scoring system has extremely strong dependence on data and technical aspects, and the diversity and coverage of data dimensions, modeling skills and methodology directly influence the final stability and ordering of the scoring system. Some financial institutions accumulate less for intelligent wind control experience of retail business and have weaker wind control capability. In the actual application process, the management and control technical problems that the financial institutions cannot fully exert the internal data value, the accuracy and stability of the model cannot be effectively improved and the like due to factors such as the lack of data mining analysis capability, the weak risk modeling technology and the like exist. This is also one of the major obstacles facing small and medium-sized financial institutions in the direction of digital transformation.
Disclosure of Invention
The online credit service (also called as internet credit service) is a convenient and quick online loan service, customers can carry out full-flow online self-service loan service through electronic channels, and the online loan service comprises real-time application, approval, signing, payment, repayment and the like, and can be handled through online channels without any third party mechanism or other people. Although banks have the obvious advantage of internetworking: firstly, experience of accumulation of bank loan business and customer resources and system advantages, and secondly, coverage rate and knowledge of the online business development of the bank form a certain scale. However, to realize the conversion from these functional advantages to network loan service, the online service is not only to register a window for network transplantation, but also to realize that pre-examination and pre-submission are performed on the network, and under the condition that the real information of the clients cannot be verified face to face, the service development needs to be assisted by means of a tool for predicting the credit risk of the clients, and not only before the loan, but also comprehensive intelligent risk management tools need to be introduced after the loan.
Under the background, the credit risk prediction method and the credit risk prediction system of the application are based on massive data of a large bank, specially screen client data of online credit service from the massive data, take online credit service clients in retail credit client groups as samples, construct a model for predicting the credit risk of the online credit service, and can be applied to the online credit service in a targeted manner.
Other popular scoring model construction processes in the current market are often limited by the disadvantages of small modeling sample number, single data source, high data dimension homogeneity and the like. Meanwhile, the risk assessment model on the market at present mostly uses data with weak financial properties, namely, models are built based on non-credit transaction data such as intelligent terminal equipment data, social platform data, online shopping mall data and non-overdue prediction targets, and a prediction result and an actual credit overdue situation often have larger deviation.
The method and the system are developed based on high-stability and high-coverage data samples, systematically innovate a relatively mature credit wind control system at present, and simultaneously can effectively predict potential credit risks of retail businesses of financial institutions, particularly online credit businesses in the retail businesses by using Internet credit clients in the retail businesses as modeling samples during sample selection, and the credit risk prediction effect is more remarkable.
In particular, the present application may further provide for (1) a probability of whether a credit expiration will occur in the future for a group of customers having stabilized revenues who have applied for credit services; (2) A new customer group for which no stable income has been applied for credit services predicts the probability of occurrence of credit overdue; (3) The stock customers who have applied for credit business and have no stable income and belong to the customer base in the economically developed area predict the probability of occurrence of overdue credit; (4) The customers who have applied for credit business and have no stable income stock and belong to the customer base in the economic mid-developed region predict the probability of occurrence of overdue credit; (5) The customers who have applied for credit business and have no stable income stock and belong to the less developed economic areas predict the probability of occurrence of overdue credit; (6) The customer base who did not apply for the credit service predicts the probability of its occurrence of credit overdue, and these 6 credit violation probabilities are developed for the prediction targets (i.e., target variables).
Compared with the existing models in the market, the method and the system of the application have great improvement in the aspects of distinguishing degree and stability of predicting the occurrence of overdue credit of the internet credit guest group for 30 days and more.
The application relates to the following technical scheme:
1. A method of calculating retail credit risk, comprising:
A data acquisition step of acquiring retail credit prediction data of a sample to be predicted;
Classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating credit violation probability;
And a credit violation probability calculating step, wherein retail credit prediction data are substituted into a credit violation probability submodel to calculate the credit violation probability of the sample to be predicted, and preferably the credit violation probability is the credit violation probability for an online credit service.
2. The method of item 1, further comprising:
after calculating the credit violation probability, calculating a credit score of the sample to be predicted, for calibrating the calculated credit violation probability to a normalized score of 0-1000 points.
3. The method according to item 1 or 2, wherein,
The retail credit prediction data comprises original retail credit prediction data of a sample to be predicted and derived retail credit prediction data is processed based on the original retail credit prediction data;
preferably, the original retail credit prediction data includes:
credit card based data, which is based on all available data in the sample user's credit card creation process and use process,
Personal loan class base data, which is all available data based on the sample user's loan application case and usage behaviors,
Customer base information class base data, which is data based on the properties of the sample user itself, but not directly related to behavior at the financial institution,
Personal financial property type base data, which is sample user data of all other financial properties and financial transaction types not related to credit cards and loans at a financial institution.
4. The method according to any one of items 1 to 3, wherein,
Processing derived retail credit prediction data based on the original retail credit prediction data refers to data obtained by processing the collected original retail credit prediction data based on a time dimension, a space dimension, a frequency dimension, and a statistical information dimension;
preferably, the derived retail credit prediction data includes, but is not limited to:
derived retail credit prediction data processed based on sample relationship lengths,
Derived retail credit forecast data processed based on time interval class variables,
Derived retail credit forecast data processed based on the degree of sample behavioral frequency,
Derived retail credit prediction data processed based on the sample current point in time conditions,
Derived retail credit forecast data processed based on sample duration behavior,
Derived retail credit prediction data processed from the sample data based on the statistical information dimension.
5. The method according to any one of items 1 to 4, wherein,
The retail credit prediction data is selected from one or two or three or four or five or six or seven or eight of the following:
The maximum expected number of past 12 month credit account, the current credit card residual amount, the current month AUM total value, the maximum wage value of past 12 months, the average use rate of the past 3 months of personal loan, the continuous increase of the month number of the past 12 month account of personal loan, the month number of the past 12 month interest of credit card greater than 0, the minimum balance of the past 12 month deposit account time point and the month number, the average of the cyclic credit past 3 month account use rate, the minimum AUM value of past 3 months, the average balance of the past 3 month investment financial account, the minimum balance of the past 3 month deposit account time point, the current time point account balance, the maximum value of the past 12 month expiration date of credit card, the minimum use rate of the past 12 month account of credit, the difference between the minimum account of past 3 months monthly credit and monthly AUM average value of 12 month repayment amount of the personal loan, month account number ratio of continuous increment of the repayment amount of the personal loan for 3 months, minimum value of the utilization rate of the credit line of 3 months in the past, minimum balance of the deposit account time point of 6 months in the past, average balance of the deposit account time point of 6 months in the past, continuous increment of the bill balance of 6 months in the past of a credit card, minimum value of the utilization rate of the credit line of 6 months in the past, AUM of the repayment amount/month of monthly in the past, month number of >80% of the repayment rate of 12 months in the past, maximum continuous month number of >90% of the utilization rate of the credit line of 12 months in the past, total sum of the repayment amount of the personal loan, credit account number of which the current use rate exceeds 90%, account holding month number of 12 months in the past, wage number of 6 months in the past, and maximum wage value of 12 months in the past.
6. The method according to any one of items 1 to 5, wherein,
The step of classifying the samples to be predicted comprises the sub-steps of:
Whether the sample to be tested is a customer who has applied for credit services at a financial institution;
Whether the sample to be tested is a customer with stable income;
Whether the sample to be tested is newly registered in m months;
the sample to be tested handles the geographical area to which the business belongs;
Classifying samples to be predicted based on the substeps to determine a submodel for calculating credit violation probability, wherein the sequence of the substeps can be arbitrarily set on the premise of ensuring reasonable business logic;
The samples to be predicted are preferably classified in the following order:
firstly, judging whether the sample to be tested is a customer who applies for credit business at a financial institution;
then judging whether the sample to be tested is a customer with stable income;
then judging whether the sample to be tested is newly registered in m months;
and then judging the geographical area to which the sample to be tested handles business.
7. The method according to any one of items 1 to 6, wherein,
And substituting the feature conversion of the retail credit prediction data into a credit violation probability submodel to calculate the credit violation probability of the sample to be predicted, wherein the feature conversion step comprises the following steps:
and selecting a WOE mode or a continuous mode for feature conversion based on the feature type of the retail credit prediction data which is required to be substituted into the credit violation probability submodel.
8. The method according to item 7, wherein,
The feature conversion in the continuous mode comprises the following steps: the continuous feature conversion is carried out in a mode of directly selecting an original value, calculating the square of original data, calculating the square root of the original data, calculating the cube root of the original data or calculating the natural logarithm of the original data.
9. The method according to any one of items 1 to 8, wherein,
The credit violation probability submodel is a model constructed based on existing user populations using logistic regression based on sample retail credit prediction data and credit violation probabilities.
10. The method according to any one of the claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: the maximum expiration number of the past 12 month credit account, the current credit card residual amount, the current month AUM total value, the past 12 month wage maximum value, the past 3 month average credit usage of the personal loan, the past 12 month refund amount of the personal loan continuously increases by one, two, three, four, five, six or seven months of credit card past 12 month interest > 0.
11. The method of item 10, wherein the credit violation probability calculating step comprises:
Feature converting the maximum expiration number of the past 12 month credit account, the current credit card residual amount, the current month AUM total value, the past 12 month wage maximum value, the past 3 month average credit usage of the personal loan, the continuous increase of the past 12 month refund amount of the personal loan for the month number of the credit card for the past 12 month interest >0,
Preferably, the maximum overdue number of the credit account of the past 12 months is converted by adopting a WOE mode; adopting a continuous conversion mode for the residual credit card; adopting a continuous conversion mode for the AUM total value of the previous month; adopting a continuous conversion mode for the maximum wage value of the past 12 months; adopting a continuous conversion mode for the average limit use rate of the personal loan for the past 3 months; continuously increasing the amount of the personal loan for the past 12 months and converting the number of the months by adopting a WOE mode; converting the month number of the credit card with the past 12 months interest of >0 by adopting a WOE mode;
further preferably, the continuous conversion mode is adopted for the current credit card residual amount to be a calculation mode taking natural logarithm of the current credit card residual amount; adopting a continuous conversion mode for the total AUM value of the previous month to obtain a square root of the total AUM value of the previous month; adopting a continuous conversion mode for the wage maximum value of the past 12 months to be a calculation mode for taking a cube root for the wage maximum value of the past 12 months; the average credit usage rate of the person loan for the past 3 months is calculated by squaring the average credit usage rate of the person loan for the past 3 months by adopting a continuous conversion mode.
12. The method according to item 11, wherein,
Substituting the converted values of seven characteristics, namely the maximum overdue number of the credit account of the past 12 months, the residual amount of the current credit card, the total value of AUM of the current month, the maximum payroll value of the past 12 months, the average use rate of the credit amount of the personal loan of the past 3 months, the continuous increase of the amount of the credit of the personal loan of the past 12 months and the number of months of interest of the credit card of the past 12 months >0, into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
13. The method of item 12, wherein,
Wherein the submodel is shown in formula 1 below:
Where k is the number of features of the model entered, preferably k is 7,
Alpha is the intercept term, preferably the range of values is (4.5954,4.9098), most preferably 4.7526;
β1 is the maximum expiration number corresponding to the past 12 months credit account, preferably in the range (-0.7309, -0.7094), most preferably-0.7201;
β2 is the corresponding coefficient of the residual credit card, the preferred value range is (-0.3564, -0.3502), and most preferably-0.3533;
beta 3 is the corresponding coefficient of the total value of AUM (financial asset) of the current month, and the preferable value range is (-0.241, -0.223), and the most preferable value range is-0.232;
beta 4 is the maximum payroll corresponding coefficient for the past 12 months, the preferred range of values is (-0.3124, -0.2744), and most preferably-0.2934;
beta 5 is the corresponding coefficient of average rate of usage of the loan for the past 3 months of the person, and the preferred value range is (0.2713,0.3168), and the most preferred value range is 0.294;
beta 6 is the corresponding coefficient of the number of months of the person loan, which is added continuously for the past 12 months, the value range is (-1.2154, -1.0571), and the most preferable value is-1.1363;
Beta 7 is the month number corresponding coefficient of the credit card with the interest of 12 months >0, the value range is (-0.2852, -0.2503), and the most preferable range is-0.2678;
x1 is the WOE conversion value of the maximum overdue number of the past 12 month credit account generated by the feature conversion step;
x2 is the natural logarithmic conversion value of the residual usable credit of the current credit card generated in the feature conversion step;
x3 is a square root converted value of the total value of the current month AUM (financial asset) generated by the feature conversion step;
x4 is the cube root conversion value of the wage maximum value of the past 12 months generated in the feature conversion step;
x5 is the square conversion of the average rate of usage of the credit line of the person loan generated by the feature conversion step for the past 3 months;
x6 is a WOE conversion value of the month number of the personal loan generated in the feature conversion step, which is continuously increased by the addition amount of the past 12 months;
x7 is the WOE conversion value for the number of months for which 12 months interest >0 for the credit card generated by the feature conversion step.
14. The method according to item 13, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
15. The method according to any one of the claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: one, two, three, four or five of the minimum balance of the past 12 month deposit account time points, the average value of the usage rate of the line of 3 months in the past of the cyclic credit, the minimum value of AUM in the past 3 months, the average balance of the investment and financial account in the past 3 months, and the minimum balance of the deposit account time points in the past 3 months.
16. The method of item 15, wherein the credit violation probability calculating step comprises:
Performing characteristic conversion on the minimum balance of the past 12 month deposit account time point from the current month, the average value of the use rate of the credit line of the past 3 months, the AUM minimum value of the past 3 months, the average balance of the past 3 month investment financial account and the minimum balance of the past 3 month deposit account time point,
Preferably, the minimum balance of the past 12 month deposit account time point is converted from the current month number by adopting a WOE mode; converting the average value of the use rate of the line of the circulation credit for the past 3 months in a WOE mode; converting AUM minimum value of the past 3 months in a continuous mode; converting the average balance of the investment financial account in the past 3 months in a continuous mode; converting the minimum balance of the past 3 month deposit account time point by adopting a WOE mode;
further preferably, the continuous mode conversion of the AUM minimum value for the past 3 months is a square root calculation mode for the AUM minimum value for the past 3 months, and the continuous mode conversion of the average balance of the investment and financial account for the past 3 months is a square root calculation mode for the average balance of the investment and financial account for the past 3 months.
17. The method according to item 15, wherein,
Substituting the values obtained by converting the five characteristics of the minimum balance of the past 12 month deposit account time points, the number of present months, the average value of the use rate of the line of 3 months of the cyclic credit, the AUM minimum value of the past 3 months, the average balance of the past 3 month investment financial account and the minimum balance of the past 3 month deposit account time points into a submodel constructed by adopting logistic regression based on sample retail credit prediction data to calculate the default probability of the sample to be predicted.
18. The method according to item 17, wherein,
The submodel is shown in the following formula 2:
Where k is the number of features of the model entered, preferably k is 5;
alpha is the intercept term, preferably in the range (-0.6781, -0.4868), optimally-0.582;
β1 is the corresponding coefficient of the minimum balance of the past 12 month deposit account time point to the number of present months, and the preferred value range is (-0.7058, -0.529), and most preferably-0.617;
β2 is the average corresponding coefficient of the usage rate of the line of 3 months in the past of the cyclic credit, and the preferred value range is (0.0666,0.114), and the most preferred value range is 0.090;
β3 is the AUM minimum corresponding coefficient for the last 3 months, preferably in the range (-0.6899, -0.4916), most preferably-0.591;
Beta 4 is the corresponding coefficient of the average balance of the investment financial account in the past 3 months, and the preferable value range is (-0.7065, -0.3619), and the most preferable value range is-0.534;
Beta 5 is the minimum balance corresponding coefficient at the time of the deposit account of the past 3 months, and the preferred value range is (-0.118, -0.0557), and the most preferred value range is-0.087;
x1 is a WOE conversion value of the minimum balance of the past 12 month deposit account time points generated in the feature conversion step from the current month number;
x2 is the square conversion value of the average value of the usage rate of the line of the past 3 months of the cyclic credit generated in the feature conversion step;
x3 is the natural logarithmic conversion value of the AUM minimum value of the past 3 months generated in the feature conversion step;
x4 is the square root conversion value of the average balance of the investment financial account of the past 3 months generated in the characteristic conversion step;
x5 is the cube root conversion value of the minimum balance of the past 3 month deposit account timepoints generated by the feature conversion step.
19. The method of item 18, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
20. The method according to any one of the claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: the present time point deposit account balance, the maximum value of the past 12 month expiration date of the credit card, the minimum value of the credit usage rate of the past 12 month line, the difference between the minimum amount due for the past 3 month monthly credit and monthly AUM, the average value of the past 12 month repayment amount of the personal loan, and one, two, three, four, five or six of the present credit card residual line.
21. The method of item 20, wherein the credit violation probability calculating step comprises:
performing characteristic conversion on the balance of the deposit account at the current time point, the maximum value of the overdue amount of the past 12 months of credit, the minimum usage rate of the credit amount of the past 12 months of credit, the difference between the minimum payable amount of the past 3 months monthly of credit and monthly AUM of the present month, the average value of the payable amount of the past 12 months of personal loan and the residual amount of the current credit card,
Preferably, converting the balance of the deposit account at the current time point in a continuous mode; converting the maximum value of the expiration number of the past 12 months of the credit card by adopting a WOE mode; converting the minimum value of the credit use rate of the past 12 months in a continuous mode; converting the difference value between the lowest payable amount of the present month of the past 3 months monthly credit and monthly AUM in a continuous mode; converting the average value of the repayment amount of the person loan for the past 12 months in a continuous mode; converting the current credit card residual amount in a continuous mode;
Further preferably, the current time point deposit account balance is converted into a calculation mode of taking natural logarithm, the minimum value of the credit use rate of the past 12 months is converted into a calculation mode of taking square root for the minimum value of the credit use rate of the past 12 months, the difference between the minimum payable amount of the past 3 months monthly credit and monthly AUM is converted into a calculation mode of taking cube root for the difference between the minimum payable amount of the past 3 months monthly credit and monthly AUM, the average value of the personal loan repayment amount of the past 12 months is converted into a calculation mode of taking original value for the average value of the personal loan repayment amount of the past 12 months, and the current credit card residual amount is converted into a calculation mode of taking cube root for the residual credit card by adopting the continuous mode.
22. The method according to item 20, wherein,
Substituting the converted values of the six characteristics of the balance of the deposit account at the current time point, the maximum value of the past 12 month expiration date of the credit card, the minimum value of the credit usage rate of the past 12 month line, the difference value between the minimum amount of the credit due to the past 3 months monthly credit and monthly AUM, the average value of the repayment amount of the personal loan in the past 12 months and the current credit card residual line into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
23. The method of item 22, wherein,
The submodel is shown in equation 3 below:
Where k is the number of features of the model entered, preferably k is 6;
alpha is the intercept term, preferably in the range (-0.5296, -0.394), most preferably-0.462;
β1 is the corresponding coefficient of the deposit account balance at the current time point, and the preferred value range is (-0.1997, -0.1867), and the most preferred value range is-0.193;
β2 is the maximum value corresponding to the expiration number of the past 12 months of the credit card, and the preferred range is (-0.4853, -0.4469), and most preferably-0.466;
beta 3 is the minimum value corresponding coefficient of the credit limit usage rate of 12 months in the past, the preferred value range is (0.1058,0.1187), and the most preferred value range is 0.112;
β4 is the coefficient corresponding to the difference between the lowest payable amount of the present month of the past 3 months monthly credit and monthly AUM, preferably in the range of values (0.0053,0.0073), most preferably 0.006;
Beta 5 is the average corresponding coefficient of the repayment amount of the person loan for the past 12 months, and the preferable value range is (-0.0096, -0.0088), and the most preferable value range is-0.009;
beta 6 is the corresponding coefficient of the residual credit card, the preferable value range is (-0.1637, -0.1519), and the most preferable value range is-0.158;
x1 is a natural logarithmic conversion value of the balance of the deposit account at the current time point generated in the feature conversion step;
x2 is the WOE conversion value of the maximum value of the past 12 month expiration number of the credit card generated by the feature conversion step;
x3 is a square root conversion value of a minimum value of the credit line usage rate of the past 12 months of the credit generated by the feature conversion step;
x4 is the cube root conversion value of the difference between the lowest payable amount of the past 3 months monthly credit present month and monthly AUM generated by the feature conversion step;
x5 is the original converted value of the average value of the repayment amount of the person loan for the past 12 months, which is generated in the feature conversion step;
x6 is the cube root conversion value of the current credit card residual amount generated by the feature conversion step.
24. The method of item 23, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
25. The method according to any one of the claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: one, two, three, four, five or six of an average value of the amount paid over 12 months of the personal loan, a ratio of the number of months by which the amount to be paid over 3 months of the personal loan is continuously increased, a minimum usage rate of the amount of 3 months of the recurring credit over the past, a minimum balance of the account points of the deposit over 6 months of the past, a current credit card remaining amount, and a maximum value of the number of overdue credit cards over 12 months.
26. The method of item 25, wherein the credit violation probability calculating step comprises:
The average value of the repayment amount of the personal loan in the past 12 months, the continuously increased month number proportion of the repayment amount of the personal loan in the past 3 months, the minimum usage rate of the credit line in the past 3 months of the cyclic loan, the minimum balance of the deposit account in the past 6 months, the current credit card residual amount and the maximum value of the overdue amount of the credit card in the past 12 months are subjected to characteristic conversion,
Preferably, the average value of the repayment amount of the personal loan for the past 12 months is converted in a continuous mode; the month number duty ratio of continuously increasing the amount of the personal loan added for the past 3 months is converted by adopting a WOE mode; converting the minimum value of the utilization rate of the line of 3 months in the past of the cyclic credit in a continuous mode; converting the minimum balance of the past 6 month deposit account time point in a continuous mode; converting the current credit card residual amount in a continuous mode; converting the maximum value of the expiration number of the past 12 months of the credit card by adopting a WOE mode;
Further preferably, the average value of the repayment amount of the person loan for the past 12 months is converted into a calculation mode using natural logarithm for the average value of the repayment amount of the person loan for the past 12 months, the minimum value of the utilization rate of the credit for the 3 months in the past of the circulation is converted into a calculation mode using square for the minimum value of the utilization rate of the credit for the 3 months in the past of the circulation, the minimum balance of the credit account for the past 6 months is converted into a calculation mode using original value for the minimum balance of the credit account for the past 6 months, and the residual credit of the present credit card is converted into a calculation mode using cube root for the residual credit card by adopting the continuous mode.
27. The method according to item 25, wherein,
The value converted from six characteristics of the average value of the repayment amount of the past 12 months of the personal loan, the continuously increased month number proportion of the repayment amount of the past 3 months of the personal loan, the minimum usage rate of the credit line of the past 3 months of the cyclic loan, the minimum balance of the past 6 months of deposit account time points, the current credit card residual amount and the maximum value of the overdue amount of the past 12 months of the credit card is substituted into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
28. The method of item 27, wherein,
The submodel is shown in equation 4 below:
Where k is the number of features of the model entered, preferably k is 6;
Alpha is the intercept term, preferably the range of values is (3.5781,3.7431), most preferably 3.661;
β1 is the average corresponding coefficient of the repayment amount of the person loan for the past 12 months, and the preferred value range is (-0.5156, -0.496), and the most preferred value range is-0.506;
β2 is a month number corresponding coefficient of continuously increasing the amount of the person loan added over the past 3 months, and the preferred value range is (0.6685,0.6991), and the most preferred value range is 0.684;
Beta 3 is the minimum value corresponding coefficient of the limit usage rate of the cyclic credit for the past 3 months, and the preferable value range is (0.0088,0.0096), and the most preferable value range is 0.009;
beta 4 is the minimum balance ratio corresponding coefficient of the past 6 month deposit account time point, the preferred value range is (-0.2122, -0.2012), and the most preferred value range is-0.207;
beta 5 is the corresponding coefficient of the residual credit card, the preferable value range is (-0.0588, -0.0419), and the most preferable value range is-0.050;
beta 6 is the maximum value corresponding to the expiration number of the past 12 months of the credit card, and the preferred value range is (-0.4024, -0.3742), and most preferably-0.388;
x1 is WOE conversion value of average value of repayment amount of the person loan for past 12 months generated in the feature conversion step;
x2 is a WOE conversion value of a month number ratio of a continuous increase in the amount of the person loan generated in the feature conversion step for the past 3 months;
x3 is WOE conversion value of minimum value of the use rate of the line of 3 months in the past of the cyclic credit generated in the feature conversion step;
x4 is the WOE conversion value of the minimum balance of the past 6 month deposit account time points generated in the feature conversion step;
x5 is the WOE conversion value of the residual credit card amount generated in the feature conversion step;
x6 is the WOE conversion value for the maximum number of expiration dates for the past 12 months for the credit card generated by the feature conversion step.
29. The method of item 28, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
30. The method according to any one of the claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: average balance of past 6 month deposit account points, continuous increase of credit card past 6 month bill balance, minimum usage of credit line of past 6 months, AUM of past 6 months monthly credit to be paid back per month, month number of 12 months of credit past 12 months of repayment rate >80%, maximum continuous month number of >90% credit card past 12 months of credit line usage rate, sum of individual loan repayment amount, one, two, three, four, five, six, seven or eight of the credit account numbers with current credit line usage rate exceeding 90%.
31. The method of item 30, wherein the credit violation probability calculating step comprises:
The average balance of the past 6 month deposit account time points, the past 6 month bill balance of the credit card are continuously increased by the month number, the utilization rate of the credit line of the past 6 months is minimum, the past 6 months monthly of credit should be paid back per month is AUM, the 12 month payment rate of the credit is >80 percent of the month number, the maximum continuous month number of the credit line of the past 12 months of the credit card is >90 percent, the sum of the payment rates of the current month of the personal loan is equal to the month number, the utilization rate of the credit line exceeds 90 percent,
Preferably converting the average balance continuous conversion method at the time of the deposit account of the past 6 months; continuously increasing the balance of the past 6 months bill of the credit card for months, and converting the balance of the past 6 months bill by adopting a WOE mode; converting the minimum value of the utilization rate of the line of the past 6 months of the cyclic credit by adopting a continuous conversion method; converting the amount to be refund per month AUM of the past 6 months monthly credit present month by adopting a continuous conversion method; converting the month number of which the repayment rate is more than 80% in the past 12 months of credit by adopting a continuous conversion method; converting the maximum continuous month number of the credit card with the use rate of more than 90% in the past 12 months by adopting a WOE mode; converting the sum of the repayment amounts of the person loans in the month by adopting a continuous conversion method; the credit account number with the current limit use rate exceeding 90% is converted by adopting a continuous conversion method;
Further preferably, the method for continuously converting the average balance at the time of the past 6 month deposit account is converted into a method for calculating the average balance at the time of the past 6 month deposit account by taking the natural logarithm, the method for continuously converting the minimum value of the usage rate of the line of the past 6 months of the cyclic credit is converted into a method for calculating the square root of the minimum value of the usage rate of the line of the past 6 months of the cyclic credit, the continuous conversion method is adopted for the AUM of the payoff amount/month of the last 6 months monthly credit and the AUM of the payoff amount/month of the last 6 months monthly credit is converted into the calculation mode of taking natural logarithms, the month number of 12 months repayment rate >80% of the past credit is converted into the calculation mode of taking cube roots for the month number of 12 months repayment rate >80% of the past credit, the total of the current month repayment amount of the personal loan is converted into the calculation mode of taking original values for the total of the current month repayment amount of the personal loan, and the calculation mode of taking squares for the account number of the credit with the current line usage rate exceeding 90% is converted into the calculation mode of the account number of the credit with the current line usage rate exceeding 90% by adopting a continuous conversion method.
32. The method of item 30, wherein,
The method comprises the steps of substituting values converted by eight characteristics, namely, average balance of past 6 month deposit account points, continuous increase of the balance of a credit card past 6 month bill, minimum usage rate of a credit line of the past 6 months, minimum usage rate of the credit line of the credit of the cyclic credit past 6 months, total AUM of the amount to be paid per month of the credit of the past 6 months monthly, month number of 12 months of repayment rate >80%, maximum continuous month number of >90% of the credit line usage rate of the credit card past 12 months, sum of repayment amount of the person loan in month, and number of credit account of the current use rate of the credit line exceeding 90%, into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
33. The method of item 32, wherein,
The submodel is shown in equation 5 below:
Where k is the number of features of the model entered, preferably k is 8;
Alpha is the intercept term, preferably the range of values is (0.2614,0.4061), most preferably 0.334;
β1 is the average balance corresponding coefficient at the time of the deposit account of the past 6 months, and the preferred value range is (-0.2041, -0.1896), and most preferably-0.197;
β2 is a corresponding coefficient of continuous increase of the balance of the bill for the past 6 months of the credit card, and the preferred value range is (-0.4365, -0.3966), and most preferably-0.417;
beta 3 is the minimum value corresponding coefficient of the limit usage of the cyclic credit for the past 6 months, the preferable value range is (0.1747,0.1912), and the most preferable value range is 0.183;
β4 is the corresponding coefficient of the amount of money to be refund per month AUM for the past 6 months monthly of credit, the preferred range of values is (0.2261,0.2516), most preferably 0.239;
Beta 5 is a month number corresponding coefficient of which the repayment rate of the past 12 months of credit is more than 80 percent, and the preferable numerical range is (-0.5716, -0.51), and the most preferable numerical range is-0.541;
Beta 6 is the corresponding coefficient of the credit card with the use rate of the credit card for the past 12 months of >90% of the maximum continuous month number, and the preferred value range is (-0.3114, -0.2683), and the most preferred value range is-0.290;
beta 7 is the sum corresponding coefficient of the payment amount of the person loan in the month, and the preferable value range is (-0.3661, -0.281), and the most preferable value range is-0.324;
β8 is the corresponding coefficient of the number of credit accounts with the current limit usage rate exceeding 90%, and the preferred value range is (-0.4713, -0.3529), and the most preferred value range is-0.412;
x1 is a natural logarithmic conversion value of the average balance of the time point of the deposit account of the past 6 months generated in the characteristic conversion step;
x2 is a continuous WOE conversion value of the continuous mode in which the balance of the past 6 months bill of the credit card generated in the feature conversion step is continuously increased by the number of months;
x3 is the square root conversion value of the minimum value of the usage rate of the line of the past 6 months of the cyclic credit generated by the feature conversion step;
x4 is the natural logarithmic conversion value of the AUM of the amount/month average of the last 6 months monthly of credit generated by the feature conversion step;
x5 is a cube root conversion value for the month number of which the past 12 months repayment rate of the credit generated by the feature conversion step is > 80%;
x6 is a WOE conversion value of up to 90% of the credit card generated in the feature conversion step for the past 12 months of credit usage;
x7 is the original conversion value of the sum of the monthly repayment amounts of the personal loan generated in the feature conversion step;
x8 is the square conversion value of the number of individual credit accounts with the current limit usage rate exceeding 90% generated by the feature conversion step.
34. The method of item 33, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
35. The method according to any one of the claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: one, two, three, four or five of the past 12 month gold account holding month number, the past 3 month investment financial account average balance, the past 3 month deposit account timepoint minimum balance, the past 6 month shipping wages, the past 12 month wage maximum.
36. The method of item 35, wherein the credit violation probability calculating step comprises:
Performing characteristic conversion on the number of holding months of the gold account in the past 12 months, the average balance of the investment financial account in the past 3 months, the minimum balance of the deposit account in the past 3 months, the number of sending wages in the past 6 months and the maximum wages in the past 12 months,
Preferably, the number of the past 12 months of gold account holding months is converted by adopting a WOE mode; converting the average balance of the investment financial account in the past 3 months by adopting a continuous conversion method; converting the minimum balance of the past 3 month deposit account time point by adopting a continuous conversion method; converting the past 6-month wage times by adopting a WOE mode; converting the maximum wage value of the past 12 months by adopting a continuous conversion method;
Further preferably, the average balance of the investment financial account in the past 3 months is converted into a calculation mode of taking square root for the average balance of the investment financial account in the past 3 months by adopting a continuous conversion method; converting the minimum balance of the past 3 month deposit account time point into a calculation mode of taking natural logarithm for the minimum balance of the past 3 month deposit account time point by adopting a continuous conversion method; the continuous conversion method is adopted for converting the wage maximum value of the past 12 months into a calculation mode using the original value for the wage maximum value of the past 12 months.
37. The method according to item 35, wherein,
Substituting the values converted by five characteristics of the past 12 months of gold account holding month number, the past 3 months of investment financial account average balance, the past 3 months of deposit account time minimum balance, the past 6 months of wage times, and the past 12 months of wage maximum value into a sub-model constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
38. The method of item 37, wherein,
The submodel is shown in equation 6 below:
Where k is the number of features of the model entered, preferably k is 5;
Alpha is an intercept term, preferably in the range of (0.2884,1.9611), most preferably 1.125;
β1 is the corresponding coefficient for the number of months the gold account has been held for the past 12 months, with a preferred range of values (-0.721, -0.6281), optimally-0.675;
β2 is the corresponding coefficient of the average balance of the investment financial account for the past 3 months, and the preferred value range is (-0.0646, -0.0125), and optimally-0.039;
β3 is the minimum balance corresponding coefficient at the time of the deposit account of the past 3 months, and the preferred value range is (-0.1275, -0.0891), and the most preferred value range is-0.108;
Beta 4 is the corresponding coefficient of the past 6 months of wage times, the preferable value range is (-0.4612, -0.2386), and the most preferable value range is-0.350;
Beta 5 is the maximum payroll corresponding coefficient for the past 12 months, the preferred range of values is (-0.3679, -0.1707), and most preferably-0.269;
x1 is the WOE conversion value of the past 12 months of the gold account holding month number generated in the feature conversion step;
x2 is the square root conversion value of the average balance of the investment financial account of the past 3 months generated in the characteristic conversion step;
x3 is a natural logarithmic conversion value of the minimum balance of the past 3 month deposit account time points generated in the feature conversion step;
x4 is the WOE conversion value of the past 6 months of the wage times generated in the feature conversion step;
x5 is the original converted value of the payroll maximum for the past 12 months generated by the feature conversion step.
39. The method of item 38, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
40. An apparatus for calculating retail credit risk, comprising:
The data acquisition module is used for acquiring retail credit prediction data of a sample to be predicted;
A module for classifying the sample to be predicted, for classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating a credit violation probability;
A credit breach probability calculation module for substituting retail credit prediction data into a credit breach probability submodel to calculate a credit breach probability of the sample to be predicted, preferably the credit breach probability is a credit breach probability for an online credit service.
41. The apparatus of item 40, wherein the apparatus performs the steps of the method of calculating retail credit risk of any one of items 1 to 39.
42. A system for calculating retail credit risk, the system for calculating retail credit risk comprising: a memory, a processor, and a program stored on the memory and executable on the processor for the method of calculating retail credit risk, the program for calculating retail credit risk when executed by the processor implementing the steps of the method of calculating retail credit risk as recited in any one of items 1 to 39.
43. A computer storage medium having stored thereon a program for a method of calculating retail credit risk, the program for a method of calculating retail credit risk when executed by a processor implementing the steps of the method of calculating retail credit risk as recited in any one of items 1 to 39.
44. A method of constructing a retail credit risk prediction model for an online credit service, comprising:
a data acquisition step of acquiring raw retail credit prediction data of a sample used for constructing a model;
a data deriving step of processing derived retail credit prediction data based on the original retail credit prediction data;
A feature preliminary screening step of preliminarily screening all categories, i.e., all features, including the original retail credit prediction data and the derived retail credit prediction data to obtain preliminarily screened features;
a preliminary screening data conversion step of judging a conversion mode of the preliminarily screened features to confirm that one of a WOE conversion mode, a dummy feature conversion mode and a continuous conversion mode is adopted to perform feature conversion, and performing feature conversion by adopting a judged optimal mode for each preliminarily screened feature;
A feature fine screening step, namely performing deep screening on the features subjected to the primary screening after feature conversion to obtain the features subjected to the fine screening;
A credit violation probability modeling step, namely selecting a logistic regression mode for the combination of the characteristics after fine screening and the probability relation between the credit violations to carry out model construction, and confirming a mode for calculating the credit violation probability;
wherein the data sample obtained in the data collection step is a customer sample using an online credit service.
45. The method of item 44, wherein,
In the data acquisition step, the raw retail credit prediction data of the sample for constructing the model obtained includes:
credit card based data, which is based on all available data in the sample user's credit card creation process and use process,
Personal loan class base data, which is all available data based on the sample user's loan application case and usage behaviors,
Customer base information class base data, which is data based on the properties of the sample user itself, but not directly related to behavior at the financial institution,
Personal financial property type base data, which is sample user data of all other financial properties and financial transaction types not related to credit cards and loans at a financial institution.
46. The method of item 44, wherein,
In the data deriving step, processing derived retail credit prediction data based on the original retail credit prediction data refers to data obtained by processing the collected original retail credit prediction data based on a time dimension, a space dimension, a frequency dimension, and a statistical information dimension;
preferably, the derived retail credit prediction data includes, but is not limited to:
derived retail credit prediction data processed based on sample relationship lengths,
Derived retail credit forecast data processed based on time interval class variables,
Derived retail credit forecast data processed based on the degree of sample behavioral frequency,
Derived retail credit prediction data processed based on the sample current point in time conditions,
Derived retail credit forecast data processed based on sample duration behavior,
Derived retail credit prediction data processed from the sample data based on the statistical information dimension.
47. The method according to any one of items 44 to 46, wherein,
The feature primary screening step comprises the following steps:
A first preliminary screening step of screening the features based on the data missing for each feature of the sample used to construct the model,
A second preliminary screening step of screening the features based on the condition that the single value of a certain feature sample is too high,
A third preliminary screening step, namely calculating the information IV value of each feature to carry out preliminary screening on the features;
The order of the first preliminary screening step, the second preliminary screening step and the third preliminary screening step may be any order,
A fourth preliminary screening step, wherein the features subjected to the first to third preliminary screening are subjected to preliminary screening by adopting a gradual discrimination algorithm;
And a fifth preliminary screening step, wherein the features after the fourth preliminary screening step are subjected to preliminary screening based on the coincidence condition of the risk characteristics of the features and the actual and real results of the samples for model construction.
48. The method of any one of items 44 to 47, further comprising:
a sample selection step for screening all users and acquiring samples for model construction using an online credit service prior to the data acquisition step,
Preferably, the sample selection step includes classifying all users of the sample based on a decision tree, the classification basis including but not limited to:
Whether a user is a customer who has applied for registering credit services at a financial institution;
Whether a certain user is a income-stable client;
A certain user being a new customer or an stock customer
A geographic area to which a user handles business.
49. The method according to any one of items 44 to 48, wherein in the preliminary screening data conversion step, the determination of the conversion pattern of the preliminarily screened features is made based on the concentration and the data type of the preliminarily screened features.
50. The method of item 49, wherein,
The primary screening data conversion step based on the judgment of the concentration and the data type comprises the following steps:
classifying the data type for each feature classifies each feature into character-type variables and numerical-type variables,
The character type variable is subjected to preliminary screening data conversion by adopting a dummy feature conversion mode,
The process of further classifying the numerical variables comprises the following sub-steps:
If the numerical variable has less than n values, performing primary screening data conversion by adopting a WOE conversion mode,
If the value of the numerical variable is more than n, further judging that if the numerical variable is converted into continuous variable with more values and the concentration degree of single value is more than m%, adopting a WOE conversion mode, if the concentration degree of single value is less than or equal to m%, adopting a continuous conversion mode,
Preferably, n and m are both positive integers, where n=5 to 10 and m=90 to 99.
51. The method of item 50, further comprising:
An optimal conversion method is selected for confirming the feature adopting the continuous conversion mode based on the correlation of the feature with credit violations under different continuous conversion modes to perform continuous feature conversion of the feature,
The continuous feature transformation is preferably performed by directly selecting the original value, calculating the square of the original data, calculating the square root of the original data, calculating the cube root of the original data, or calculating the natural logarithm of the original data.
52. The method of any one of claims 44 to 51, wherein the feature fine screening step comprises:
A first fine screening step, based on a stepwise regression algorithm, of screening features based on the significance of the features by the F test and the T test,
A second fine screening step of calculating a variance expansion factor based on each feature and eliminating features having a higher variance expansion factor to screen the features,
And a third fine screening step of analyzing whether the characteristic coefficient accords with the trend of the prediction result aiming at credit violations or not based on the logistic regression pair after the first fine screening step and the second fine screening step so as to further perform characteristic screening.
53. The method of any one of items 44 to 52, wherein the credit violation probability modeling step substitutes the features filtered by the feature fine screening step into a Sigmoid function for logistic regression to calculate a model of the credit violation probability.
54. An apparatus for constructing a retail credit risk prediction model for an online credit service, the apparatus comprising:
A data acquisition module for acquiring raw retail credit prediction data for a sample used to build a model;
a data derivation module for processing derived retail credit prediction data based on the original retail credit prediction data;
the feature primary screening module is used for carrying out primary screening on all categories comprising original retail credit prediction data and derived retail credit prediction data, namely all features, so as to obtain features after primary screening;
The primary screening data conversion module is used for judging a conversion mode of the features after primary screening so as to confirm that one of a WOE conversion mode, a dummy feature conversion mode and a continuous conversion mode is adopted for carrying out feature conversion, and carrying out feature conversion by adopting an optimal mode for judging each feature after primary screening;
The feature fine screening module is used for carrying out deep screening on the features subjected to the primary screening after feature conversion so as to obtain the features subjected to the fine screening;
A credit violation probability modeling module for model construction by selecting a logistic regression mode for the probability relation between the feature combination after fine screening and the credit violation, and confirming the mode for calculating the credit violation probability,
Wherein the data sample acquired by the data acquisition module is a customer sample using an online credit service.
55. The apparatus of item 54, wherein the apparatus performs the steps of the method of constructing a retail credit risk prediction model of any one of items 44 to 53.
56. A system for constructing a retail credit risk prediction model for an online credit service, the system comprising: a memory, a processor, and a program stored on the memory and executable on the processor to construct a retail credit risk prediction model method, the program to construct a retail credit risk prediction model method when executed by the processor implementing the steps of constructing a retail credit risk prediction model method as recited in any one of items 44 to 53.
Effects of the invention
The method and the system for constructing the retail credit risk prediction model combine a large number of samples of a large financial institution when constructing the model, screen client data of online credit business from the samples as original data, deeply process and derive the original data obtained by the samples, and construct the retail credit general scoring model according to the self characteristics of the original data and the derived data and strong financial attribute information which is difficult to acquire in the market and is contained in the data by utilizing an advanced statistical analysis method and an interpretable machine learning technology.
In addition, the method firstly utilizes a decision tree method to reasonably classify the samples before constructing the model, and constructs a sub-classification model based on the classified samples. The financial model is constructed by combining the decision tree, samples can be effectively classified according to executable categories, so that credit risk characteristics of different guest groups are better covered, and the problem that the model lacks subgroup representativeness due to the fact that all samples are used for constructing the model is avoided.
Further, when the risk prediction model is constructed, the model features are primarily screened by adopting a step-by-step discriminant analysis method, so that the overall efficiency of model development is fully improved. The progressive discrimination method can better discriminate more important characteristic variables under the same dimension, greatly reduces the workload of judging screening one by one according to the variable trend by a developer in the next step, fully improves the model development efficiency on the premise of not influencing the overall model effect, and ensures that the primary screening of the variables is more efficient and accurate.
The application combines three modes of continuous conversion, WOE conversion and dummy feature conversion in model construction, reprocesses partial features after initial screening, combines the advantages and disadvantages of the three conversion modes, designs a conversion judging method in an original way, and selects the optimal feature conversion mode according to parameters such as feature data attribute, deletion rate, concentration degree and the like assisted by judgment on business logic.
In addition, the method and system for calculating credit risk probability or scoring credit risk constructed by the method are that a Delta model (Dleta series of scoring cards) (comprising Dleta-Dleta submodels or submodels) is divided firstly based on a decision tree when the model is constructed, so that the credit risk is calculated firstly to effectively classify and select the best suitable submodel or submodel for processing the client sample, and meanwhile, the model construction method used in the submodel or the submodel card is the method, so that the technical defects that the model is constructed by using a complete WOE variable and the model is constructed by using a complete continuous variable, namely the overfitting of the model is avoided in the construction process, and the model is constructed by using the complete continuous variable, and the model is not well adapted to the classified variable, so that the method has an effect which is obviously superior to the existing model in predicting the credit risk. In addition, the application aims at the online credit service, the sample data is the client data of the online credit service screened from the massive data, not only the predicted class variable of the internet service is covered, but also the overdue performance data of the internet service is contained, so that the application has very remarkable suitability and differentiation in the aspect of credit risk aiming at the online credit service by utilizing the risk law which is proposed by a statistical principle through specific service scene data.
Drawings
FIG. 1 example 1 of the present application is an age distribution of an initial sample population for modeling.
Fig. 2 is a graph illustrating the result of the model discrimination effect of embodiment 1 of the present application.
FIG. 3 shows a typical flow chart for classifying all users of a sample based on a decision tree.
Fig. 4 shows an example of a binning confirmation of whether the data meets business trends.
Detailed Description
Credit risk is the risk that the borrower will have reduced or no way to perform on the borrowing contract due to economic capacity changes, rather than the risk of default due to intentional fraud by the borrower. Credit violations occur in various types of retail credit business scenarios, which are particularly relevant to borrower personal economies. The reasons for the borrower to overdue credit can be largely categorized into 4 categories: 1. the credit history is short, and the borrower has less experience in managing the financial condition; 2. the borrower forgets to repay temporarily; 3. excessive borrowers of this type have relatively low repayment capacity due to the large number of liabilities; 4. such borrowers are subject to significant negative factors that have a long-term impact on repayment capabilities due to significant factors such as reduced income, lost business, or divorced. The above reasons are more or less overdue for borrowers, and more serious default conditions may be caused, and the credit risk score is to mine the inherent mathematical relationship between various historical information of the clients and the probability of occurrence of future default, and convert the relationship into a score to quantify the probability of occurrence of default.
The credit risk score in the prior art mainly uses credit history data and multi-head data (multi-head data refers to behavior statistical data of borrowers for providing borrowing demands to a plurality of financial institutions, and generally considers that the larger the number of times of short-time multi-head is, the larger the future overdue probability is), so that payment behavior and payment willingness of the borrowers are reflected. The data of the sample used for constructing the model not only comprises the transaction behavior information of the credit service, but also increases the asset data which are difficult to acquire generally, and the application can comprehensively measure the repayment capacity, the personal qualification and the like of the borrower while reflecting the payment behavior and the payment willingness of the borrower, thereby providing more accurate prediction results.
< General description of model construction method >
In particular, the present application relates to a method of constructing a retail credit risk prediction model for an online credit service, comprising: a data acquisition step of acquiring raw retail credit prediction data of a sample used for constructing a model; a data deriving step of processing derived retail credit prediction data based on the original retail credit prediction data; a feature preliminary screening step of preliminarily screening all categories, i.e., all features, including the original retail credit prediction data and the derived retail credit prediction data to obtain preliminarily screened features; a preliminary screening data conversion step of judging a conversion mode of the preliminarily screened features to confirm that one of a WOE conversion mode, a dummy feature conversion mode and a continuous conversion mode is adopted to perform feature conversion, and performing feature conversion by adopting a judged optimal mode for each preliminarily screened feature; a feature fine screening step, namely performing deep screening on the features subjected to the primary screening after feature conversion to obtain the features subjected to the fine screening; and a credit violation probability modeling step, wherein a mode of logistic regression is selected for the combination of the characteristics after fine screening and the probability relation between the credit violations, so that a model is built, and a method for calculating the credit violation probability is confirmed.
In a specific manner, the method for constructing a retail credit risk prediction model for an online credit service according to the present application may further comprise a sample selection step, before the data acquisition step, for screening all users to obtain samples for model construction before the data acquisition step. Those skilled in the art understand that based on the total amount of users and the data situation that can build the model, those skilled in the art can choose whether to cull user data that does not conform to the built model. In addition, one skilled in the art may first select a part of users as samples, and then increase the modeled samples according to the corresponding rules.
In a particular embodiment, the selected sample is a customer sample of a credit on-line service among customer samples of a registered credit service applied for a certain period of time.
< Sample selection step >
In a specific embodiment, the sample selection step is to screen all users to obtain samples for model construction prior to the data acquisition step. Specifically, the sample selection step includes classifying all users of the sample based on a decision tree, the classification basis including but not limited to: whether a user is a customer who has applied for registering credit services at a financial institution; whether a certain user is a income-stable client; whether a certain user belongs to a customer registered in a financial institution and not having a credit application; a certain user is a new client or stock client; a geographic area to which a certain user handles business; whether a user has occurred a financial institution risk event (e.g., whether a default is currently occurring, whether a house is currently on hold); whether a user holds a credit card issued by a financial institution and/or whether the credit card is in continuous use and/or whether credit cards or personal loans are being used for circulation of credit.
Further, all the users of the sample are classified based on the decision tree, and the analysis sample is divided into whether the credit service is registered in the related bank or not because the analysis sample is required to have the performance variable during modeling, namely, the users can be divided into two parts of the applied registration and the non-applied registration.
Credit business is basic and important property business of banks, and benefits are obtained by paying bank loans to collect principal and interest and deducting cost. In general, banking credit businesses are an important means of banking profit, and can be classified into legal credit businesses and personal credit businesses in terms of classification of banking credit businesses. Wherein the corporate credit business includes project loans, mobile funds loans, small business loans, real estate business loans, etc.; the personal credit business includes a personal housing loan, a personal consumption loan, a personal business loan, and the like.
The Decision Tree (Decision Tree) is a Decision analysis method for evaluating the risk of an item and judging the feasibility of the item by constructing the Decision Tree to obtain the probability that the expected value of the net present value is greater than or equal to zero on the basis of the occurrence probability of various known situations, and is a graphical method for intuitively applying probability analysis. The application classifies samples for modeling based on the characteristics of large financial institution data, after dividing the samples into two parts of applied credit service samples and non-applied credit service, the samples of applied credit service can be further classified into two types of new clients and stock clients according to whether the samples of applied credit service are new clients or not, wherein the decision of whether the samples of applied credit service are new clients or not can be determined based on the needs of the technicians in the field, for example, users applying for credit service within m months are decided as new clients, and clients applying for credit service for more than m months are stock clients.
In a specific embodiment, m may be set to, for example, 3, may be set to, for example, 6, may be set to, for example, 9 or 12, and m may also be referred to as the account number in the present application.
It is well understood by those skilled in the art that other splitting methods may be selected when splitting the sample, for example, all samples may be split into a customer with a credit card and a customer group without a credit card at the time of splitting, and the customer sample splitting method at the time of model construction may be fully considered and performed according to the modeling requirement.
In the construction process of the model of the application, since the model is required to be constructed for the online credit service, the selected sample is a customer sample of the online credit service in the customer samples applying for the registration credit service in a certain period of time.
In one particular embodiment, 910 ten thousand customers from 1.2 hundred million applied for registering credit services to a credit over wire service are screened as a sample of modeling. Such a sample is more targeted and representative in predicting credit risk for an online credit service.
< Original retail Credit data >
In constructing the model for the present application, about 120 kinds of original retail credit prediction data (i.e., 120 kinds of basic features) can be obtained based on basic data of a financial institution, and retail credit risk points are split or processed to the greatest extent possible on the premise of optimal effect according to different dimensions, so that 3000 derivative features are generated.
When the model is built, firstly, credit card application and behavior data, personal loan application and behavior data, personal financial property transaction data and personal customer information data of all retail customers in 4 years are initially collected during the model building of the application based on all historical data of a large financial institution, and 7.3 hundred million people are in total, wherein each person has corresponding data to be processed every month, so that a data system for building the model of the application is comprehensive and has very large data volume, a modeling method is considered during the model building based on the data system, otherwise, the special sample group needing to be concerned is covered in the huge data volume and cannot be effectively identified, and the computer program is slow to run or even cannot run, so that the most suitable prediction model cannot be accurately built.
In the data acquisition step, the raw retail credit prediction data of the sample for constructing the model obtained includes: credit card type base data, which is data that is fully available during sample-based credit card creation and use (i.e., credit card application and behavioral data of retail customers); the personal loan-type base data, which is all available data based on the loan application situation and the usage behavior of the sample (i.e., personal loan application and behavior data), the customer base information-type base data, which is data based on the properties of the sample itself but not directly related to the behavior at the financial institution (i.e., personal customer information data), or the personal financial property-type base data, which is all other financial property and financial transaction-type data (personal financial property transaction data) of the sample not related to credit cards and loans at the financial institution.
In one particular embodiment, the credit card type base data includes, but is not limited to: credit card billing, credit card cash withdrawal, credit card stage status, credit card generated interest and other information in different dimensions (such as time dimension, space dimension, frequency dimension). The above basic data is not limited to the specific types listed, and as credit card business changes, those skilled in the art can also fully cover the newly appeared data types in implementation, namely, all types of available data in the process of creating and using the credit card based on the user can be used as credit card basic data.
In one particular embodiment, the personal loan class base data includes, but is not limited to: information such as an account of the personal loan, overdue condition of the personal loan, balance of the personal loan, total amount of the personal loan, refund of the personal loan, actual refund, and the like. The above basic data is not limited to the specific types listed, and as the loan business of the individual varies, those skilled in the art can also fully cover the newly appeared data types, namely, the types of all available data based on the loan application situation and behavior of the user, which can be used as the basic data of the loan class of the individual.
In a specific embodiment, the basic data of the customer basic information class only comprises: the customer base information includes sex, age and administrative area information to which the business belongs. The above basic data is not limited to the specific categories listed, and with the development of customer situations and social relationships, those skilled in the art can also fully further cover other or new data types in implementation within the scope of the service application scenario, that is, all kinds of data based on the attributes of the user sample itself, but not directly related to the behaviors at the financial institution, can be used as the basic information base data of the customer.
In one particular embodiment, the personal financial asset class base data includes, but is not limited to: AUM (i.e., asset management size), deposit, financing, and shipping payroll information. The above basic data is not limited to the specific categories listed, and as the financial properties change, those skilled in the art can also fully and further cover the emerging data types in practice, i.e., all other financial properties and all types of financial transaction type data not related to credit cards and loans at the financial institution based on the sample, can be used as personal financial property type basic data.
In one particular embodiment, the raw retail credit forecast data for the sample taken to construct the model is based on more than 120 data types (i.e., base variables or base characteristics) taken by 7.3 million people, including but not limited to: information such as account, overdue, balance, amount, response, actual return and the like under different dimensions (such as time dimension, space dimension and frequency dimension) of credit card bill, credit card cash taking condition, credit card stage condition, interest generated by credit card and the like; information such as an account of the personal loan, overdue condition of the personal loan, balance of the personal loan, total amount of the personal loan, refund of the personal loan, actual refund, and the like; client basic information including sex, age and administrative region to which the business belongs; AUM (i.e., asset management size), deposit, financing, and shipping wages.
< Derived retail credit forecast data >
In the present application, in the data deriving step, the derived retail credit prediction data is processed based on the original retail credit prediction data, which is the data obtained by processing the collected original retail credit prediction data based on the time dimension, the space dimension, the frequency dimension, and the statistical information dimension.
In one embodiment, the derived retail credit prediction data includes, but is not limited to: derived retail credit prediction data processed based on sample relationship lengths, derived retail credit prediction data processed based on time interval class variables, derived retail credit prediction data processed based on sample behavior frequency levels, derived retail credit prediction data processed based on sample current time point conditions, derived retail credit prediction data processed based on sample sustained behaviors, or derived retail credit prediction data processed based on statistical information dimensions. For example, month data of the customer may be acquired, and processing may be performed based on the month data. Processing based on statistical information dimensions in the present application includes taking the maximum, minimum, average, etc. of the data to describe the data situation.
In a specific embodiment, for example, from a time dimension, for example, a customer relationship length class variable: such as customer account opening time, customer maximum account age, etc., as a category of derived retail credit prediction data, i.e., as a derived feature or derived variable.
In a specific embodiment, for example, starting from the time dimension, consider the case of a time interval, such as the number of months from the current time point of the customer's last payment, the number of months from the current time of the customer's last overdue, etc., as derivative features or derivative variables.
In a specific embodiment, for example, from the frequency level, consider the behavior frequency level variable: if the number of repayment > N in the last X months of the client, the number of the limit usage > N in the last X months of the client and the like are used as derivative features or derivative variables, X and N are not limited, and the service logic can be ensured to be a positive integer above any 0.
In a specific embodiment, consider the current time point variable, for example, from the time dimension: the current month limit of the customer, the current month balance of the customer and the like are used as derivative characteristics or derivative variables.
In one particular approach, from the time dimension, a continuous behavior variable is considered, such as the maximum number of consecutive times that the customer has exceeded > N for the last X months, the number of times that the customer has exceeded > N for the last X months, etc., as a derivative feature or derivative variable.
In a specific manner, from the dimension of the statistical information, a statistical class variable, such as the maximum expiration number of the last X months of the customer, the average credit usage within the last X months of the customer, etc., is considered as the derivative feature or derivative variable.
It will be apparent to those skilled in the art that the methods of processing the derivative variables described above are merely illustrative and may be arbitrarily chosen. For example, 120 types of base data may be processed into more than 3000 derived data. In the present application, there are sometimes cases where data, data types, variables or characteristics are mixed, and those skilled in the art can understand them based on the common knowledge of statistics.
In the application, the derivative data can be derived data obtained by simple processing of original retail credit prediction data or derived data obtained by complex processing, and the current variables such as current month wages and current month balances can be directly used after the data are summarized. Complex process derivative data is based on time slicing and logic processing based on current class variables, and derivative variables such as maximum payroll of approximately 3 months and minimum balance of approximately 12 months can be generated.
< Feature preliminary screening >
In the primary screening data conversion step, the conversion mode of the features subjected to primary screening is judged based on the concentration degree and the data type of the features subjected to primary screening. The primary screening data conversion step based on the judgment of the concentration and the data type comprises the following steps: classifying the data types of each feature to classify each feature into character type variable and numerical variable, performing primary screening data conversion on the character type variable by adopting a dummy feature conversion mode, and further classifying the numerical variable, wherein the process of classifying the numerical variable comprises the following sub-steps: if the value of the numerical variable is less than n, performing primary screening data conversion by adopting a WOE conversion mode, if the value of the numerical variable is more than n, further judging that if the conversion is that the continuous variable has more values and the concentration degree of a single value is more than m%, adopting the WOE conversion mode, and if the concentration degree of the single value is less than or equal to m%, adopting the continuous conversion mode, preferably, n and m are positive integers, wherein n=5-10, and m=90-99.
For example n is 5, 6,7, 8, 9 or 10 and m is 90, 91, 92, 93, 94, 95, 96, 97, 98 or 99.
Specifically, in the application, a plurality of different feature preliminary screening modes are adopted to screen a plurality of features, so that the screening can be effectively carried out on the features with the largest dimension. Existing credit risk scores are often feature-prescreened using miss-rate, concentration, and information value IV.
In a specific embodiment, the features are screened based on the data loss of each feature of the sample used for constructing the model, and the deletion rate screening is generally performed by deleting a variable having a deletion rate of more than 90% or 91% or 92% or 93% or 94% or 95%, for example, features having a data deletion rate of more than 95% of the features may be deleted, or features having a data deletion rate of more than 90% may be deleted.
In a specific embodiment, features are screened based on the condition that a single value of a certain feature sample is too high, and in concentration screening, variables with the single value accounting for 99% or 98% or 97% or 96% or 95% or more are generally considered to be deleted, for example, features with the single value exceeding 99% can be removed, and features with the single value exceeding 95% can be removed.
In one embodiment, the information IV value for each feature is calculated to perform a preliminary screening of the feature. The IV value can be used for measuring the feature prediction capability in IV screening, and the larger the IV value is, the stronger the feature prediction capability is, and the calculation method of the single feature IV value is as follows:
wherein k is the number of packets after the feature is discretized; y i is the number of outstanding clients in the i-th group; y s is the total number of customers without violations; n i is the total number of default clients in the i-th group; n s is the total number of offending customers. The quantization index of IV has the following meaning: the characteristic is extremely weak when the calculated value of IV is less than 0.02, is weak when the calculated value of IV is less than 0.1 or more than 0.02, is good when the calculated value of IV is less than 0.3 or more than 0.1, and is strong when the calculated value of IV is more than 0.3.
Of course, the calculated value deletion threshold value of IV may be set to 0.03, 0.04, 0.05, or the like.
In the model construction method of the present application, the step of performing the feature preliminary screening using the deletion rate, the concentration degree, and the information value IV may be performed in any order, for example, the screening may be performed based on the deletion rate, the screening may be performed based on the concentration degree, and the feature preliminary screening may be performed based on the information value IV. Or screening based on concentration, screening based on the deletion rate, and finally performing feature primary screening based on the information value IV. Or screening based on the deletion rate, screening based on the information value IV, and finally performing feature primary screening based on the concentration degree. Or screening based on the information value IV, screening based on the deletion rate, and finally performing feature primary screening based on the concentration degree. Or screening based on the information value IV, then screening based on the concentration degree, and finally performing feature primary screening based on the deletion rate. Or screening based on concentration, screening based on the information value IV, and finally performing feature primary screening based on the deletion rate. The person skilled in the art can select based on the data condition of the sample, so that the characteristic with obvious defects in a certain aspect can be effectively removed based on the three methods, the data dimension can be effectively reduced, and the effect of model construction can be improved.
In a specific mode, the features after the screening by the utilization of the deletion rate, the concentration degree and the information value IV are respectively subjected to primary screening by adopting a gradual discrimination algorithm, and then the primary screening of the features is performed based on the coincidence condition of the risk characteristics of the features and the actual real results of the samples for model construction.
On the basis of the technical scheme, a gradual distinguishing method is introduced to improve the overall efficiency of model development, so that the primary screening of variables is more efficient and accurate. In actual data, there may be near distribution of good and bad samples on a certain variable, and the good and bad discrimination capability is weak. There may also be a class of variables, each able to distinguish good and bad samples well, but if all are included in the model, redundancy will be due to the duplication of the data dimensions covered by the variables. In contrast, the application uses a gradual discriminant analysis method, adopts WILKS' S LAMBDA value as the statistic criterion of admission and removal, and deletes the variables with weaker discriminant effect or redundancy in the data.
The gradual judging method adopts WILKS' S LAMBDA criterion to measure the strength of the characteristic capability and eliminates the characteristics which do not meet the set threshold value in the residual characteristics after three rounds of screening. In the gradual discriminating process, firstly, the variable with the strongest discriminating capability is added, along with the gradual increase of the variable in the model, the discriminating capability of the variable introduced earlier can also change, if the discriminating capability of a certain variable in the model is smaller than a threshold value, the variable is removed, and then the process is repeated until the variables contained in the model all meet WILKS' S LAMBDA similarity ratio criterion and other variables do not reach the standard of entering the model.
In the application, the step of screening by using a step-by-step discrimination method is a more critical step. The models involved in the present application use a huge amount of data and have numerous data dimensions for the purpose of capturing as much as possible all credit risk points, so variable screening must be performed to further reduce the time cost of model development. In the existing financial risk model evaluation, variable importance methods are generally adopted to perform variable deletion, such as calculating the importance of variables through algorithms such as a base index, an information entropy and the like, selecting variables with the front importance, and rarely using a modeling method for screening by a gradual discrimination method. Compared with a variable importance screening scheme commonly used in the industry, the method can retain a large number of variables with relatively weak importance and relatively independent information dimension.
In one particular approach, more than 3000 derived data may be processed based on, for example, 120 types of base data. Through three rounds of screening of the deletion rate, the concentration degree and the information value IV, the characteristics with about 20-30% of poorer data can be deleted, but more than 2000 characteristics can still be reserved, and if the follow-up variable fine screening is performed based on all the characteristics, the development efficiency is seriously affected. Therefore, the step-by-step discrimination is finally determined to be used as the optimal selection through comparison experiments of a plurality of different variable pruning schemes.
Aiming at the important features screened by the gradual discrimination algorithm, the further screening of the features is carried out based on the risk characteristics (system preset) of various risk points and the actual bad account rate condition of the sample. Judging whether the actual bad account rate distribution situation of the residual features accords with the business trend, and eliminating the features of the sample that the actual bad account rate distribution does not accord with the business trend. Specifically, (1) dividing the residual target features into 10-20 boxes according to the dividing points, and calculating the median of the values of each box and the corresponding bad account rate. (2) The change rate (slope) is calculated using the median of the values of each bin and the preceding bin and the corresponding bad account rate. (3) And counting the number of the boxes which are larger than 0 and the non-0 boxes (not larger than 0) in the change rate between two adjacent boxes, and calculating the percentage of the number of the boxes which are larger than 0 in the change rate to the number of the non-0 boxes in the change rate. (4) And acquiring the risk characteristic preset by the characteristic, and according to the percentage of the number of boxes larger than 0 in the calculated change rate to the number of boxes other than 0 in the change rate, judging whether the actual bad account rate change trend is approximately consistent with the business logic bad account rate change trend. The criteria for approximate agreement are as follows: first, if the business logic trend of this feature should be: as this feature value increases, the bad account rate increases (e.g., credit line usage, etc.). And in the module, eliminating the characteristics, of which the number of boxes which are larger than 0 in the calculated slope accounts for less than 70 percent of the non-0 boxes, namely eliminating the characteristics of which the characteristics show no business trend. Second, if the business logic trend of this feature should be: as this feature value increases, the bad account rate decreases (e.g., deposit amount, etc.). And in the module, eliminating the characteristics, of which the number of boxes which are larger than 0 in the calculated slope accounts for more than 30 percent of the non-0 boxes, namely eliminating the characteristics of which the characteristics show no business trend.
For example, the following table gives an example of binning, in which feature values are split into 10 bins, the range of values for the bins is also summarized in the following table, while a graph of the feature bin bad account rate as a function of the number of feature bin medians is plotted based on the data in the following table, as shown in fig. 4. In the example of fig. 4, the slope is greater than 7 for a total of 0 and less than 2 for a total of 0. By the method, the features can be further screened based on the risk characteristics (preset by the system) of various risk points and the actual bad account rate condition of the sample. The feature screening is further carried out by using the box division method, so that the features most conforming to the business development trend can be effectively screened out, and the features suitable for modeling are further obtained.
< Conversion of eigenvalues >
The original feature processing of the existing credit risk scoring method mainly adopts a mode of WOE conversion (multi-classification) and dumb feature conversion (bi-classification) to carry out discretization processing on continuous features (such as age, account age and the like). The WOE conversion mode is an optimal binning scheme based on modeling samples, continuous features are discretized according to optimal cutting points, and data of good and bad samples are embedded into WOE values, so that the performance of the WOE conversion mode is better in model construction, but the binning result can be excessively fitted with the modeling samples, so that the problem that the model effect is seriously reduced when the WOE conversion mode is applied to the whole (the generalization capability is poor) is solved. Meanwhile, original features falling into different boxes are converted into single values corresponding to the boxes by normalization operation used in the boxes, so that the risk distinguishing capability of people falling into the same interval is lost. The conversion mode of the dummy feature is mainly applied to the grouping feature, and has the advantages that the quality difference between different values of the feature can be eliminated, but the continuous variable is abnormally complex and redundant in processing.
In the model construction method, the conversion mode of the features after preliminary screening is judged to confirm that one of the WOE conversion mode, the dummy feature conversion mode and the continuous conversion mode is adopted for carrying out feature conversion, and the optimal mode of judgment is adopted for carrying out feature conversion on each feature after preliminary screening.
The primary screening data conversion step based on the judgment of the concentration and the data type comprises the following steps: classifying the data types of each feature to classify each feature into character type variable and numerical variable, performing primary screening data conversion on the character type variable by adopting a dummy feature conversion mode, and further classifying the numerical variable, wherein the process of classifying the numerical variable comprises the following sub-steps: if the value of the numerical variable is less than n, performing primary screening data conversion by adopting a WOE conversion mode, if the value of the numerical variable is more than n, further judging that if the conversion is that the continuous variable has more values and the concentration degree of a single value is more than m%, adopting the WOE conversion mode, and if the concentration degree of the single value is less than or equal to m%, adopting the continuous conversion mode, preferably, n and m are positive integers, wherein n=5-10, and m=90-99.
Specifically, taking a character type variable as an academic example, the value of the characteristic variable can be primary school, middle school, university, research student, etc. For a numerical variable, if the numerical variable is the past 3 months overdue month number, the value is 0, 1, 2, 3.
In the present application, n may be 5, 6, 7, 8, 9 or 10, and m may be 90, 91, 92, 93, 94, 95, 96, 97, 98 or 99.
In a specific embodiment, m is selected to be 5 and n is selected to be 95.
Specifically, the manner of WOE conversion is: searching an optimal cutting point of the feature, dividing the value domain of the original feature to obtain a plurality of sub-boxes, calculating WOE conversion values corresponding to each divided box according to the quality condition in each sub-box, outputting the WOE conversion values, dividing the original feature according to the sub-box result, and outputting the WOE conversion values. For each bin, the WOE value is calculated as follows:
Wherein y i is the number of outstanding clients in the i-th group; y s is the total number of customers without violations; n i is the total number of default clients in the i-th group; n s is the total number of offending customers.
The WOE binning process is described by way of example with respect to age. The original characteristic age comprises a numerical value of 18 to 50 years old, 5 sub-boxes are obtained after sub-boxes, namely 18 to 24 years old, 25 to 30 years old, 31 to 35 years old, 36 to 42 years old and 43 to 50 years old respectively, then the WOE conversion value of each sub-box is calculated according to the number of good or bad clients of each box, and finally each piece of user data divided into each sub-box is output according to the corresponding conversion value of each sub-box, for example, the corresponding WOE conversion value of 18 to 24 sub-boxes is output at 23 years old, and the corresponding WOE conversion value of 43 to 50 years old sub-boxes is output at 46 years old.
The conversion mode of the dummy features is as follows: and converting the single classification feature into dummy features with the same quantity according to the quantity of the included numerical values, wherein if a certain client belongs to the generated dummy feature corresponding value, the corresponding dummy feature value is 1, and the rest dummy feature values are 0.
The manner in which dummy features are converted is described by way of example in terms of gender. The original features include: after the dummy features are converted, two dummy features of 'sex-male' and 'sex-female' are generated, if the sex of the client is male, the sex-male 'is marked as 1, and the sex-female' is marked as 0. Taking the learning example, the original features include: big proprietary and below, big book, master and above, generating three dummy features of 'academy-big proprietary and below', 'academy-big book', 'academy-master and above' after dummy feature conversion, if the client's academy is the great book, it is marked as 0 in' academy-great expert and below, 'academy-great book' is marked as 1, and 'academy-great person and above' is marked as 0.
The continuous conversion mode is as follows: the original features are subjected to continuous conversion in a plurality of ways (the ways of continuous conversion include but are not limited to, directly selecting original values, calculating squares of the original data, calculating square roots of the original data, calculating cube roots of the original data, and calculating natural logarithms of the original data). And calculating the correlation coefficient r (Correlation Coefficient) between the continuous converted characteristic value and the overdue label, selecting the conversion mode with the maximum absolute value of the correlation coefficient for conversion, and outputting the conversion mode corresponding to the original characteristic. The calculation formula of the correlation coefficient is as follows:
Wherein Σ is the sum symbol in mathematics; n is the total number of observations; x i is the converted value of the original feature of the ith observation after continuous conversion; an average value of the conversion values for this purpose; wherein y i is the ith observation that represents a classification feature of whether or not to violate; /(I) For this purpose, the average value of the classification characteristic. The closer the absolute value of the correlation coefficient is to 1, the more relevant the converted conversion value is to the default condition, and the better the effect of the conversion mode is. The quantization index meaning of the correlation coefficient r is as follows: when the absolute value of the correlation coefficient is 0 or more and less than 0.3, a low degree of correlation is represented, when the absolute value of the correlation coefficient is 0.3 or more and less than 0.8, a medium degree of correlation is represented, and when the absolute value of the correlation coefficient is 0.8 or more and less than 1, a high degree of correlation is represented.
In the model construction of the application, aiming at confirming the feature adopting the continuous conversion mode, an optimal conversion method is selected to carry out continuous feature conversion of the feature based on the correlation between the feature and credit violations under different continuous conversion modes, and preferably, the continuous feature conversion is carried out by adopting the modes of directly selecting an original value, calculating the square of original data, calculating the square root of the original data, calculating the cube root of the original data or calculating the natural logarithm of the original data.
The continuous conversion mode is described by way of example with respect to age. The original characteristic age comprises a numerical value of 18 to 50 years old, the original value, square root, cube root, natural logarithm and the like of the age are obtained through continuous conversion, then the absolute value of the WOE value and the correlation coefficient of each conversion value and a quality label (prediction result) is calculated, and a conversion mode with the maximum correlation coefficient absolute value is selected for conversion and then output. If the cube root of the age characteristic has the largest absolute value of the correlation coefficient with the quality label compared with other conversion modes, outputting the cube root of the age as a conversion value.
In the prior art, one feature conversion mode of WOE and dummy features is mainly selected for model construction. For example, in the chinese patent CN112686749B, the characteristic value is converted by WOE method. The WOE conversion mode is an optimal binning scheme based on modeling samples, and continuous features are discretized according to an optimal cutting point, so that the performance of the WOE conversion mode is better when a model is built, but the binning result can be excessively fitted with the modeling samples, so that the model effect is seriously reduced (the generalization capability is poor) when the WOE conversion mode is applied to the whole. Meanwhile, original features falling into different boxes are converted into single values corresponding to the boxes by normalization operation used in the boxes, so that the risk distinguishing capability of people falling into the same interval is lost.
The conversion mode of the dummy feature is mainly applied to the grouping feature, and has the advantage of eliminating the difference between different values of the feature. Such as: for the industries where clients are located, the method is more suitable for the processing mode of the dummy features because the retail industry and the wholesale industry have no obvious advantages and disadvantages; there is a hierarchical difference between specialty and family, so although dummy features can be used, in practice the treatment with WOE is more appropriate.
On the other hand, the continuous conversion mode can avoid the situation of overfitting the modeling sample in the WOE conversion mode, has stronger generalization capability on the overall sample, and has fewer situations of entering a single value for most guest communities because interval mapping is not performed. But cannot be applied to features that are partially monotonically poor or discrete (e.g., occupation, job, etc.).
As described above, the technical solution of the present application is to develop a new way to combine three ways of continuous conversion, WOE conversion and dummy feature conversion, reprocess part of the features after careful screening, combine the advantages and disadvantages of the three conversion ways, and design a conversion judgment method originally, and select the optimal feature conversion way by assisting with judgment on business logic according to parameters such as feature data attribute, deletion rate, concentration degree, etc.
< Logistic regression and feature depth screening based on logistic regression >
In the existing credit risk score, for the requirement of model interpretability, a logistic regression model is mainly used for model development, and software can be used as follows: SAS, R, python, etc.
In a specific embodiment, the present application is based on SAS software for model development.
Specifically, the probability predicted to be default is fitted in logistic regression using a Sigmoid function:
Wherein Z is the linear combination of the model coefficient and the feature conversion value, and Z is defined as follows:
Z=α+β1x12x2+...+βk-1xk-1kxk
The probability of predicted default is:
P=P(Y=1|x1,x2,x3,...,xk-1,xk)
the probability of the predicted default after fitting is:
From the above equation it can be further deduced that:
Substituting the Z value into the above formula can calculate the probability P predicted to be default.
The core of the logistic regression model construction is feature screening, wherein the feature screening steps are as follows: first, the features are screened in batches according to the deletion rate, the concentration degree and the information value IV. And secondly, screening all the remaining features one by one according to whether the features accord with the business trend, and reserving the features with correct business trend. For example: if the bad account rate of the guest group is found to be reduced along with the increase of the balance of the feature loan, the feature is considered to be not in accordance with the business trend. In credit business understanding, the higher the loan balance, the higher the customer's level of exposure to default risk (EAD) and the greater the risk. This feature is removed from the feature list at this point. Third, the stepwise regression function using logistic regression eliminates features that are not very important and highly correlated with other features. Fourth, the coefficient is screened according to the positive and negative signs of the training coefficient and the business trend of the feature conversion value, and the feature that the sign of the feature coefficient accords with the business logic is reserved. Defining 0 as a good customer and 1 as a bad customer in the Y tag, wherein for a feature (such as loan balance) with monotonically increasing bad account rate along with the increase of feature value, the training coefficient sign of the feature (such as loan balance) is positive; on the contrary, for a feature (e.g. deposit balance) whose bad account rate decreases monotonically with the increase of the feature value, the training coefficient sign should be negative, and if the training coefficient sign does not meet the above criteria, the training coefficient sign should be removed. Fifth, features with higher correlation are further eliminated by using a variance expansion factor (VIF), a correlation coefficient, and the like: for the variance expansion factors, the features with the highest VIF and greater than 4 are eliminated one by one; for the features with higher correlation, the features with lower IV values in the feature group with the highest correlation coefficient and larger than 0.80 are eliminated one by one. Sixth, the characteristic that the distribution difference of the group stability coefficient (PSI) is large at different time points so as to cause instability is removed, the characteristic of PSI >0.25 is directly removed, and the characteristic of PSI >0.1 is removed according to the influence of removing the characteristic on the model distinguishing capability.
The application screens the characteristics strictly following the rules, and ensures the interpretation and stability of the model and the distinguishing capability of good and bad clients.
The characteristic fine screening step of the application comprises the following steps: a first fine screening step of screening features based on a stepwise regression algorithm, a second fine screening step of calculating a variance expansion factor based on each feature and eliminating features with higher variance expansion factors to screen the features, a third fine screening step of analyzing whether the feature coefficients conform to the trend of the prediction result for credit violations based on a logistic regression on the features after the first fine screening step and the second fine screening step to further screen the features.
And a feature fine screening step 1, which is based on a stepwise regression algorithm and on an F test and T test method, wherein features are sequentially introduced from high to low on the basis of significance, and each time a feature is introduced, the selected features are tested one by one. The features originally introduced are rejected when they become no longer significant due to the introduction of the following features. This process is cycled until no features above the significance threshold are selected into the equation, or no features below the significance threshold are removed from the regression equation.
And (2) a feature fine screening step, which is based on a method for eliminating features with higher variance expansion factors, and further reduces multiple collinearity in the model.
And 3, comparing the risk characteristics (preset by the system) of various risk points with the positive and negative signs of the model training coefficients, judging whether the characteristic coefficients of the residual characteristics in the model in the step 3 accord with the business trend, eliminating the characteristics of the model coefficients not accord with the business trend, and iterating again. The specific embodiment of the feature fine screening module 3 is as follows: 1. for the characteristics of WOE class in the characteristic conversion mode, the corresponding model training coefficient should be negative, and the WOE conversion class characteristics of which the training coefficient is positive should be removed. 2. For the continuous conversion mode, if the feature value is increased in service logic, the bad account rate should be increased (such as credit line usage rate, etc.), the corresponding model training coefficient should be positive, and the continuous conversion type feature with the training coefficient being negative should be removed; if the feature value increases in service logic and the bad account rate should decrease (e.g. deposit amount, etc.), the corresponding model training coefficient should be negative, and the continuous conversion type feature with the training coefficient being positive should be removed. 3. For the dummy feature conversion mode, the bad account rate with the value of 1 and the value of 0 needs to be judged, if the bad account rate with the value of 1 is larger than the bad account rate with the value of 0, the coefficient is positive, otherwise, the coefficient is negative.
And a feature fine screening step 4, namely a data stability monitoring step, wherein the data stability monitoring step is used for evaluating whether obvious deviation occurs to the distribution situation of single features and integral scores of different points, the features of PSI >0.25 are directly removed, and the features of 0.25 PSI >0.1 are removed with cautions according to the influence of removing the features on the model distinguishing capability.
And (3) carrying out multiple iterations on the feature list based on the steps (3) and (4), stopping the iteration until no new feature is added or removed from the model finally, and obtaining a final feature list and a conversion value thereof. Through the steps, the modeling variable used for constructing the model of the application can be finally obtained.
The credit violation probability modeling step of the application substitutes the characteristics screened by the characteristic fine screening step into a Sigmoid function to carry out logistic regression calculation on the credit violation probability model.
In the prior art, the core of all scoring models constructed is whether the data used is representative. For the scoring models existing in the prior art, the sample size and bad labels of most models are smaller due to information safety and cost, the stability of the models cannot be guaranteed, and independent modeling of a certain type of clients cannot be achieved. Meanwhile, because the scoring model in the market is mostly multi-head lending data and weak financial attribute data (such as non-credit transaction data including intelligent terminal equipment data, social platform data, online shopping mall data and the like) in the information dimension in the modeling process, the asset condition and repayment capability of a customer cannot be accurately reflected. The data source used by the application is the full traffic data of a large bank, and the data magnitude of the modeling sample and the bad label is huge. According to the application, modeling samples are designed based on different sample groups, so that risk differences among clients can be distinguished more finely, and meanwhile, stability of data and a model is ensured based on cross-time verification and PSI verification. The application uses the personal credit history data and the asset data to carry out model development, thereby better embodying the repayment willingness and repayment capability of borrowers.
For the internal self-built scoring model of a part of small and medium-sized financial institutions, the accumulated historical data is insufficient to develop a credit risk model with stable data and strong distinguishing capability due to the fact that retail credit business volume is small or business development is late, so that the dependency on manual approval is high when the model is used for credit approval business. The efficiency limit of manual approval restricts the development of retail credit business, and meanwhile, the operation risk in the credit approval process is increased due to subjectivity of the manual approval. The model constructed by the application can assist the financial institutions to make digital decisions, enhance the approval accuracy and the approval speed, and reduce the adverse effects.
In the aspect of feature conversion, compared with the traditional WOE conversion mode, continuous features need to be subjected to rough box division and discretization according to data representation and experience of modeling staff, and the result of rough box division is greatly influenced by subjective factors of the modeling staff; and the process of discretizing the continuous features may cause too many clients to fall into the same interval, thereby resulting in a large number of calculated score single values. The application combines the continuous conversion mode, the WOE conversion mode and the dummy feature conversion mode to encode the original features, thereby reducing the influence caused by human factors and single values and enhancing the grading distinguishing capability.
< Method of calculating retail Credit Risk >
The application further relates to a method for calculating the retail credit risk of the sample to be tested by using the retail credit risk model constructed by the application.
In the present application, retail credit risk refers to the probability of credit violations for the sample under test. In a specific manner, the probability of credit violation refers to the probability of occurrence of a credit overdue, specifically, for example, the probability of occurrence of a credit overdue for 30 days or more, for example, the discrimination and stability of occurrence of a credit overdue for a certain guest group for 30 days or more.
The application relates to a method for calculating retail credit risk, comprising: a data acquisition step of acquiring retail credit prediction data of a sample to be predicted; classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating credit violation probability; and a credit violation probability calculating step, wherein retail credit prediction data are substituted into a credit violation probability submodel to calculate the credit violation probability of the sample to be predicted.
In the present application, the credit breach probabilities encompass credit breach probabilities for various types of credit services. In a specific embodiment, the credit breach probability is a credit breach probability for an online credit service.
In the present application, the sample to be predicted may be a sample used to build a model, i.e., the sample is already a customer of a financial institution holding credit services (including but not limited to credit cards or personal loan services) at the time of the model build, and the method of calculating retail credit risk using the present application may be used to calculate the potential retail credit risk in the future. In the present application, the sample to be predicted may be a sample not used for modeling, i.e. not already a customer of a financial institution holding a credit service at the time of modeling, but now already a customer of a financial institution holding a credit service, and the method for calculating retail credit risk according to the present application may be used for calculating the potential retail credit risk in the future. In the present application, the sample may also be a customer who has not, and now is, a model built in the past itself, a credit business held by the financial institution, but a customer who has already held other financial asset-like businesses, and the method of calculating retail credit risk of the present application may be used to calculate the potential retail credit risk in the future.
The application relates to a method for calculating retail credit risk, comprising: a data acquisition step of acquiring retail credit prediction data of a sample to be predicted; classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating credit violation probability; a credit violation probability calculating step of substituting retail credit prediction data into a credit violation probability submodel to calculate the credit violation probability of the sample to be predicted; and after calculating the credit violation probability, calculating the credit score of the sample to be predicted.
The step of calculating the credit score of the sample to be predicted is used for calibrating the calculated credit violation probability to a normalized score of 0-1000 points.
In the method of the present application, the retail credit prediction data includes raw retail credit prediction data for the sample to be predicted and derived retail credit prediction data is processed based on the raw retail credit prediction data. Wherein the description of the original retail credit prediction data and the derived retail credit prediction data are identical to the description of the model building portion.
In the present application, the step of classifying the sample to be predicted includes the following sub-steps: whether the sample to be tested is a customer who has applied for a credit registering service at a financial institution; whether the sample to be tested is a customer with stable income; whether the sample to be tested is newly registered in the geographic area where the sample to be tested handles business attribution within m months or not, classifying the sample to be predicted based on the substeps to determine a submodel for calculating credit violation probability, wherein the sequence of the substeps can be arbitrarily set.
In a specific embodiment, m is, for example, 3, or 6, or 9, or 12.
In the present application, fig. 3 shows a flow chart for classifying samples to be predicted, and based on the classification of this step, a submodel for calculating the credit violation probability can be selected. The samples to be predicted are classified as shown in fig. 3 in the following order: firstly, judging whether the sample to be tested is a customer with stable income; then judging whether the sample to be tested is newly registered in m months; and then judging the geographical area to which the sample to be tested handles business. Using such step classification, the best fit sub-model for predicting the sample to be tested can be identified.
In one particular embodiment, the stabilized revenue may include payroll revenue, labor revenue, and other various types of revenue. In one particular embodiment, the stabilized revenue may also be referred to as stabilized payroll revenue.
In a specific embodiment, a stable revenue customer refers to a customer who has continuously sent out for the last 6 months and whose monthly average sending out payroll is greater than 3000.
According to the application, after feature conversion is carried out on the retail credit prediction data, the feature conversion is substituted into the credit violation probability submodel to calculate the credit violation probability of the sample to be predicted, and the feature conversion step comprises the step of selecting a WOE mode or a continuous mode to carry out feature conversion based on the feature type of the retail credit prediction data which needs to be substituted into the credit violation probability submodel.
Methods for feature transformation for WOE mode or continuous mode are well known to those skilled in the art, and reference is made to the description of the application in the model building section for a specific transformation mode. However, the existing model in the field only adopts the WOE mode to perform parameter conversion.
The WOE conversion mode is an optimal binning scheme based on modeling samples, and continuous features are discretized according to an optimal cutting point, so that the performance of the WOE conversion mode is better when a model is built, but the binning result can be excessively fitted with the modeling samples, so that the model effect is seriously reduced (the generalization capability is poor) when the WOE conversion mode is applied to the whole. Meanwhile, original features falling into different boxes are converted into single values corresponding to the boxes by normalization operation used in the boxes, so that the risk distinguishing capability of people falling into the same interval is lost. The conversion mode of the dummy feature is mainly applied to the grouping feature, and has the advantage of eliminating the difference between different values of the feature. Such as: for the industries where clients are located, the method is more suitable for the processing mode of the dummy features because the retail industry and the wholesale industry have no obvious advantages and disadvantages; there is a hierarchical difference between specialty and family, so although dummy features can be used, in practice the treatment with WOE is more appropriate. The continuous conversion mode can avoid the situation of overfitting the modeling sample in the WOE conversion mode, has strong generalization capability for the overall sample, and has fewer situations of entering a single value for most guest groups because interval mapping is not performed. But cannot be applied to features that are partially monotonically poor or discrete (e.g., occupation, job, etc.). The application processes the initial feature by combining continuous conversion and WOE conversion, combines the advantages and disadvantages of the two conversion modes, designs a conversion judging module, and selects the optimal feature conversion mode by assisting with judgment on business logic according to parameters such as feature data attribute, deletion rate, concentration degree and the like. The technical scheme of the application develops a new way, combines three modes of continuous conversion, WOE conversion and dummy feature conversion, reprocesses part of the features which are carefully screened, combines the advantages and disadvantages of the three conversion modes, designs a conversion judging method, and selects the optimal feature conversion mode by assisting with judgment on business logic according to parameters such as feature data attribute, loss rate, concentration degree and the like. The feature conversion mode determined by the mode is substituted into the model for calculating the retail credit risk, so that the retail credit risk of the sample can be predicted more accurately.
The sub-model for calculating the credit violation probability is a model constructed based on the sample retail credit prediction data and the credit violation probability by adopting logistic regression based on the existing user population, namely, a model constructed by utilizing the method described by the application.
< Apparatus, system and computer storage Medium for calculating retail Credit Risk >
The application relates to a device for calculating retail credit risk, comprising: the data acquisition module is used for acquiring retail credit prediction data of a sample to be predicted; a module for classifying the sample to be predicted, for classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating a credit violation probability; a credit breach probability calculation module for substituting retail credit prediction data into a credit breach probability submodel to calculate a credit breach probability for the sample to be predicted.
The inventive means for calculating retail credit risk may perform the steps of the inventive method for calculating retail credit risk.
The application also relates to a system for calculating retail credit risk, which is characterized in that the system for calculating retail credit risk comprises: the system comprises a memory, a processor and the program for calculating retail credit risk stored in the memory and capable of running on the processor, wherein the program for calculating retail credit risk is executed by the processor to realize the steps of the method for calculating retail credit risk according to the application.
The application relates to a computer storage medium, characterized in that the computer storage medium has stored thereon a program for a method of calculating retail credit risk, which program, when being executed by a processor, implements the steps of the method of calculating retail credit risk according to the application.
All of the above description of the method for calculating retail credit risk may be fully applicable to the apparatus, system, and computer storage medium for calculating retail credit risk.
The method for calculating the retail credit risk can avoid the technical defects that the model is built by the complete WOE variable, the model is built by the complete continuous variable, and the model cannot be well adapted to the classified variable, so that the model can better cover the retail credit risk prediction requirements of various different types of passenger groups when the model is used for calculating the retail credit risk, and can meet the requirement of obtaining more accurate prediction results when the model is used for general scoring prediction of the credit risk.
Examples
Example 1 collection of modeling samples
In the process of construction of this embodiment, credit card application and behavior data, personal loan application and behavior data, personal financial property transaction data, and personal customer information data of all retail customers in the construction bank between 2017 and 2021 are collected for a total of 7.3 hundred million people. Modeling sample confirmation was performed by a professional model design scheme, and 6.2 million data in 2018 was selected as an analysis sample, and the age distribution of 6.2 million clients was shown in fig. 1. Since the modeling requires the presence of performance variables, the analysis sample is divided into two parts, namely an applied registration credit service and an applied no registration credit service, and the sample size is 1.2 hundred million and 5 hundred million people respectively.
The specific Delta model scores the scene of the online credit service, the model is designed and analyzed for 910 ten thousand clients of the online credit service in 1.2 hundred million clients applied, and the online result is applied to 7.3 hundred million people or other potential whole client groups in full volume of applied and non-applied people after the model is developed.
Model design includes 1) rule-out: if the customer data of the established, sold, no expression and special situation are excluded, modeling crowd is 240 ten thousand customers after the exclusion; 2) Time window setting: taking the future 15 month time range as the expression period of Y, and determining that the sample takes closer 2019 data as a modeling sample because the expression period is relatively moderate; 3) Sampling: sampling modeling was performed with a sample number of good/bad of 9:1. And designing different subdivision schemes in a decision tree mode, confirming the final scheme of model subdivision by adopting a comparison method of father and son models, and further modeling each subdivision model respectively. The model subdivision scheme is designed by the division modes of whether the client is applied for, whether the client has stable wage income, whether the client is a new client, the region to which the client belongs and the like, fully embodies the control variables closely related to the risk characteristics in the service development process, and can comprehensively match various client groups on the market.
In this embodiment, the modeling sample is first divided into the following sub-model modeling samples based on the decision tree for the construction of the subsequent sub-models. When determining the modeling sample, the first layer of the decision tree is divided according to whether the steady payroll income exists or not, and is used for distinguishing whether the client income is steady or not. In this embodiment, the classification of the guest group without stable payroll is classified into a Delta1 model, and for the guest group with stable payroll, it is determined whether the account age is less than 3 (the registered credit service has been applied but the application has not been exceeded for 3 months), and if the account age is less than 3, the new customer is classified into a Delta2 model.
For customers with stable payroll income and account age greater than or equal to 3, according to the administrative region division area to which the customers belong, the division of the administrative region is based on data issued by an authoritative statistics part, or the division can be based on a certain rating mechanism, or the division can be based on a financial model which is additionally constructed, and the skilled person can fully understand that after the division standard is selected, the customer population is not repeatedly divided into three different areas, and the customer population is further divided into three subdivision models of Delta3, delta4 and Delta5 in combination with the three different situations of low, medium and high bad account rates, wherein the bad account rate refers to the proportion of bad customers in a certain sample to the total number of the sample.
In addition, in this embodiment, in order to construct a model for a sample group without application for registration credit service, a guest group with an account age less than 3 months is selected as an approximate guest group, and Delta6 subdivision is entered for model training.
In this embodiment, a stable revenue customer refers to a customer who has continuously sent out for the past 6 months and whose monthly average sending out payouts are greater than 3000.
Specifically, the sample size of the sub-guest group used for constructing the Delta 1 subdivision model is about 108 ten thousand, the guest group is mainly a stable income client who has applied for credit business, and the model is constructed for predicting the probability that the crowd thereof has overdue credit for 30 days or more.
Specifically, the sample size of the sub-guest group used for constructing the Delta 2 subdivision model is about 4 ten thousand, the guest group is mainly a new customer without stable income of the applied credit business, and the model is constructed for predicting the probability of occurrence of overdue credit of the crowd for 30 days or more.
Specifically, the sample size of the sub-guest group used for constructing the Delta 3 subdivision model is about 30 ten thousand, the guest group is mainly a stock client without stable income which has applied for credit service and belongs to a client in a less developed economic area, and the model is constructed for predicting the probability of occurrence of overdue credit for 30 days or more for the crowd.
Specifically, the sample size of the sub-guest group used for constructing the Delta 4 subdivision model is about 75 Mo Zuo, the guest group is mainly a stock client without stable income for which credit service has been applied and belongs to a client in an economically developed area, and the model is constructed for predicting the probability of occurrence of credit overdue for 30 days or more for the crowd.
Specifically, the sample size of the sub-guest group used for constructing the Delta 5 subdivision model is about 23 ten thousand, the guest group is mainly a stock client without stable income for which credit service has been applied and belongs to a client in an economic and moderately developed region, and the model is constructed for predicting the probability of occurrence of overdue credit for 30 days or more for the crowd.
Specifically, the sample size of the sub-guest group used for constructing the Delta 6 subdivision model is about 6 ten thousand, the guest group is mainly a new customer with the account age smaller than 3 months of the applied credit service, and the constructed model is used for predicting the probability of 30 days or more of future occurrence of credit overdue of the sample group without the applied registered credit service.
Based on the confirmed different guest group samples of each sub-model, historical data information of borrowers is obtained: 1) Credit cards, the basic field contains account, overdue, balance, amount, refund, actual refund and other information under different dimensions such as bill, cash withdrawal, installment, interest and the like; 2) The personal loan class comprises account, overdue, balance, amount, response, actual response and other information; 3) Customer base information including sex, age, and administrative area information to which the service belongs; 4) Personal financial assets, including AUM, deposit, financing and shipping wage information.
And splitting macroscopic credit risks according to various risk points in credit business and information dimension, time slicing and other modes. The variable derivation is carried out by a professional characteristic variable construction method, and the basic mode is 1) a client relationship length class variable: such as customer account opening duration, customer maximum account age, etc. 2) time interval class variable: such as the month number of the last repayment of the customer from the observation point, the month number of the last overdue of the customer from the present month, etc.; 3) Behavior frequency degree variable: such as the number of repayment > N in the last X months of the customer, the number of the limit use rate > N in the last X months of the customer, etc.; 4) Current time point variable: current month limit of the customer, current month balance of the customer, etc.; 5) Statistical value class variable: if the maximum overdue number of the client is X months recently, the average limit use rate of the client is X months recently; 6) Continuous behavior variable: such as the maximum continuous times of continuous overdue > N of the last X months of the customer, the continuous times of continuous > N of the repayment rate of the last X months of the customer, and the like. Finally, 3067 characteristics with potential prediction capability for the overdue condition of the client are derived, such as ' current repayment amount ', ' average overdue number of the last 6 months ', ' number of the last 12 months credit card overdue, and the like. In this context, a viewpoint refers to the point in time when a sample is taken at the time of the modeling, and currently also refers to the sampling cutoff time, with the viewpoint and current meaning being the same.
Table 1 summary of the basic and derivative variables used in the examples
Example 2 feature preliminary screening
The preliminary screening for the 5 sub-model features of the partitions Delta 1 through Delta 5 in example 1 was performed as follows:
the first round of preliminary screening, first culled out the features with deletion rate exceeding 95% in the data collected in example 1, deleted 129 variables altogether, and remained 2938 variables.
And (3) the second round of preliminary screening, aiming at 2938 features subjected to the first round of preliminary screening, eliminating features with single values exceeding 99% in the remaining features, eliminating 150 variables altogether, and eliminating 2788 variables remaining.
And thirdly, primarily screening, namely sorting 2788 features subjected to the primary screening of the second round according to feature values (the sorting method is that if the features are character type variables, each value is independently one box, if the feature is the character type variables and are sorted from small to large according to the numerical value), equally dividing 10-20 boxes according to the dividing points, calculating a feature IV value, and if the feature IV value is lower than 0.02, rejecting the feature IV value, wherein in the embodiment, because the variable attribute is strong financial type and asset type variables, the prediction effect is strong, rejecting variables with the IV value lower than 0.05, totally rejecting 842 features, and remaining 1946 feature variables.
And a fourth round of preliminary screening, wherein the 1946 characteristics after the third round of screening are subjected to further screening of the characteristics based on a step-by-step discrimination algorithm, and 193 important characteristic variables can be rapidly screened out through the third round of screening. The step-by-step discrimination method is described below, and there may be little difference in the average value of different categories on a certain variable in actual data, in which case the effect of classifying using the variable will not be good; there are also a class of variables that do distinguish well between different classes in the data when considered independently, but that may appear redundant when included into the model variables. Therefore, the embodiment creatively adopts a gradual discrimination method, better screens out more important characteristic variables under the same dimension, greatly reduces the workload of judging screening one by one according to the variable trend in the fifth round of screening, and fully improves the model development efficiency on the premise of not affecting the overall model effect.
The gradual judging method adopts WILKS' S LAMBDA criterion to measure the strength of the characteristic capability and eliminates the characteristics which do not meet the set threshold value in the residual characteristics after three rounds of screening. In the gradual discriminating process, firstly, the variable with the strongest discriminating capability is added, along with the gradual increase of the variable in the model, the discriminating capability of the variable introduced earlier can also change, if the discriminating capability of a certain variable in the model is smaller than a threshold value, the variable is removed, and then the process is repeated until the variables contained in the model all meet WILKS' S LAMBDA similarity ratio criterion and other variables do not reach the standard of entering the model.
And a fifth round of preliminary screening, wherein the 193 important features after the fourth round of screening are further screened based on the risk characteristics (system preset) of various risk points and the actual bad account rate condition of the sample.
Judging whether the actual bad account rate distribution situation of the residual features accords with the business trend, and eliminating the features of the sample that the actual bad account rate distribution does not accord with the business trend. In particular the number of the elements,
(1) And equally dividing the residual target features into 10-20 boxes according to the quantiles, and calculating the median of the values of each box and the corresponding bad account rate.
(2) The change rate (slope) is calculated using the median of the values of each bin and the preceding bin and the corresponding bad account rate.
(3) And counting the number of the boxes which are larger than 0 and the non-0 boxes in the change rate between two adjacent boxes, and calculating the percentage of the number of the boxes which are larger than 0 in the change rate to the number of the non-0 boxes in the change rate.
(4) And acquiring the risk characteristic preset by the characteristic, and according to the percentage of the number of boxes larger than 0 in the calculated change rate to the number of boxes other than 0 in the change rate, judging whether the actual bad account rate change trend is approximately consistent with the business logic bad account rate change trend.
The criteria for approximate agreement are as follows:
First, if the business logic trend of this feature should be: as this feature value increases, the bad account rate increases (e.g., credit line usage, etc.). And in the module, eliminating the characteristics, of which the number of boxes which are larger than 0 in the calculated slope accounts for less than 70 percent of the non-0 boxes, namely eliminating the characteristics of which the characteristics show no business trend.
Second, if the business logic trend of this feature should be: as this feature value increases, the bad account rate decreases (e.g., deposit amount, etc.). And in the module, eliminating the characteristics, of which the number of boxes which are larger than 0 in the calculated slope accounts for more than 30 percent of the non-0 boxes, namely eliminating the characteristics of which the characteristics show no business trend.
70 Features were removed by the fifth round of screening, leaving 123 features.
For the divided Delta 6 sub-model of example 1, the variables selected are a customer base information class and a personal financial asset class variable (total of 316 feature variables) for approximately predicting the probability of future credit overdue for a registered customer without credit application, with the feature prescreening steps and subsequent steps being the same as for the other sub-models.
Example 3 conversion of the features after Primary screening
And a feature judging step, wherein an optimal conversion mode is selected according to the concentration degree, the data type and the like of the features in the feature preliminary screening step. When judging the conversion mode, the data types are generally divided into two main types, one type is a character type variable and the other type is a numerical type variable. For character type variables, a dummy variable conversion mode is generally adopted for variable conversion; for a numerical variable, a WOE conversion method is used if the variable value is less than about 5, and a WOE or continuous conversion method is used if the variable value is more (an optimal conversion method is selected by the correlation with the target variable). In the process, the condition of variable concentration is needed to be comprehensively considered, if continuous variable value is more, but the concentration of single value exceeds 95%, continuous processing is not carried out, and a WOE conversion mode is directly used. Features that employ different transformation schemes are grouped and partitioned into different data sets, and, in particular,
The remaining 123 features in example 2 were first determined, and the following three methods were selected according to the concentration of the features, the data type, and the like.
And the feature conversion mode 1 is used for carrying out WOE conversion on the feature of the WOE conversion judged as the optimal conversion mode in the feature judgment module acquired from the data acquisition module.
And the feature conversion mode 2 is used for converting the features of the optimal conversion mode judged to be the dummy feature conversion in the feature judgment module acquired in the data acquisition module into the dummy feature conversion.
And the feature conversion mode 3 is used for selecting the optimal continuous conversion mode to perform continuous conversion by acquiring the features which are judged to be optimal to be continuous conversion in the feature judgment module in the data acquisition module.
And the feature merging module is used for transversely splicing the data of the feature conversion mode 1, the feature conversion mode 2 and the feature conversion mode 3.
Through this example, 123 features were transformed, 44 of them were WOE-transformed, 6 were dummy-transformed (amplified to 14 variables), and 73 were continuously transformed. The transformation of the post-prescreened features for the Delta 6 sub-model is also substantially similar.
Example 4 feature depth screening (feature fine screening step)
In the embodiment, the following 4 steps are mainly performed, and the features with larger multiple collinearity in the feature merging module are removed based on stepwise regression and variance expansion factor calculation in the steps 1 and 2, so that the robustness of the model is enhanced. And 3, eliminating the feature that the signs of the training coefficients in the model do not accord with the business trend. Step 4 uses the population stability factor (PSI) to reject unstable features. The present embodiment uses LOGISTIC processes in SAS for depth screening.
And a feature fine screening step 1, which is based on a stepwise regression algorithm and on an F test and T test method, wherein features are sequentially introduced from high to low on the basis of significance, and each time a feature is introduced, the selected features are tested one by one. The features originally introduced are rejected when they become no longer significant due to the introduction of the following features. This process is cycled until no features above the significance threshold are selected into the equation, or no features below the significance threshold are removed from the regression equation. Through this step, 131 features are culled to 67 features.
And (2) a feature fine screening step, which is based on a method for eliminating features with higher variance expansion factors, and further reduces multiple collinearity in the model. Through this step, 67 features were culled to 55 features.
And 3, comparing the risk characteristics (preset by the system) of various risk points with the positive and negative signs of the model training coefficients, judging whether the characteristic coefficients of the residual characteristics in the model in the step 3 accord with the business trend, eliminating the characteristics of the model coefficients not accord with the business trend, and iterating again.
The specific embodiment of the feature fine screening module 3 is as follows:
1. For the characteristics of WOE class in the characteristic conversion mode, the corresponding model training coefficient should be negative, and the WOE conversion class characteristics of which the training coefficient is positive should be removed.
2. For the continuous conversion mode, if the feature value is increased in service logic, the bad account rate should be increased (such as credit line usage rate, etc.), the corresponding model training coefficient should be positive, and the continuous conversion type feature with the training coefficient being negative should be removed; if the feature value increases in service logic and the bad account rate should decrease (e.g. deposit amount, etc.), the corresponding model training coefficient should be negative, and the continuous conversion type feature with the training coefficient being positive should be removed.
3. For the dummy feature conversion mode, the bad account rate with the value of 1 and the value of 0 needs to be judged, if the bad account rate with the value of 1 is larger than the bad account rate with the value of 0, the coefficient is positive, otherwise, the coefficient is negative.
And a feature fine screening step 4, namely a data stability monitoring step, wherein the data stability monitoring step is used for evaluating whether obvious deviation occurs to the distribution situation of single features and integral scores of different points, the features of PSI >0.25 are directly removed, and the features of 0.25 PSI >0.1 are removed with cautions according to the influence of removing the features on the model distinguishing capability.
And (3) carrying out multiple iterations on the feature list based on the steps (3) and (4), stopping the iteration until no new feature is added or removed from the model finally, and obtaining a final feature list and a conversion value thereof. Through the steps, the 55 features are rejected as feature variables of the last modeling, for example, 5 features, 6 features, 7 features and 8 features can be adopted.
EXAMPLE 5 construction of Delta 1 model
In general, the stronger the correlation between the characteristic variable and the target variable is, the more the accuracy of the final model can be improved. In this embodiment, delta 1 is used as a sub-model of the subdivision model, the decision tree method is adopted, the sample size obtained by classification is about 108 ten thousand, the guest group is mainly a customer with stable income who has applied for credit service, and the model is constructed for predicting the probability (target variable) that the crowd of the guest group has overdue credit for 30 days or more.
Final confirmation in this example 5 the final modeling results were described taking seven features of the maximum overdue number of the past 12 month credit account, the current credit card remaining amount, the current month AUM (financial property) total value, the past 12 month wage maximum value, the past 3 month average amount usage of the personal loan, the continuous increase of the amount of the personal loan over 12 months, the number of months of credit card over 12 months interest >0 as examples. In the feature conversion step conversion mode selection, 7 modulo variables have been converted in a corresponding form according to the correlation of the feature and the target variable. In the credit violation probability modeling step, 7 features screened in the feature fine screening step are substituted into a Sigmoid function (LOGISTIC process using SAS software) to perform logistic regression to calculate a model of the credit violation probability.
The maximum overdue number of the past 12-month credit account represents the historical credit performance of the borrower. For people with stable income, if other loans cannot be returned on time in the past period of time, the performance of future loans is directly and negatively influenced, and the future overdue occurrence is most likely to be caused. The performance relationship between the variable and the customer risk is that the smaller the maximum overdue number of the past 12-month credit account is, the better the credit performance habit of the borrower is, and the lower the possibility of serious default is; conversely, the greater the maximum overdue number of the past 12 month credit account, the worse the borrower's credit practice, and the greater the likelihood of serious violations. The conversion mode is WOE conversion.
The current credit card residual limit reflects the credit demand of borrowers. For people with stable income, excessive consumption caused by unbalanced balance is likely to lead to credit card expiration and even overdue, so the credit card residual amount can well reflect the consumption requirement of customers. The expression relationship between the variable and the customer risk is that the higher the residual credit card amount is, the lower the credit demand of the borrower is, and the lower the possibility of serious default is; conversely, the lower the current credit card remaining amount, the higher the credit demand on the borrower, and the greater the likelihood of serious default. The conversion mode is natural logarithmic conversion in continuous conversion.
The total value of AUM (financial assets) in the current month reflects the repayment capability of borrowers. For people with stable income, the current total value of AUM (financial assets) represents the recent economic capability of borrowers, and under the condition that the economic capability is sufficient, the possibility of overdue occurrence in the future is low. The expression relationship between the variable and the customer risk is that the lower the total value of AUM (financial asset) in the current month is, the lower the repayment capability of borrowers is, and the higher the possibility of serious default is; conversely, the higher the total value of AUM (financial asset) in the current month, the stronger the borrower's repayment ability and the lower the likelihood of serious default. The conversion mode is square root conversion.
The maximum payroll over the past 12 months represents the borrower's revenue level. For the population with stable income, the maximum payroll value of the past 12 months represents the income level and repayment capability of the borrower in the middle period. The performance relationship between the variable and the customer risk is that the smaller the maximum value of wages in the past 12 months is, the lower the income of borrowers is, the lower the repayment capability is, and the higher the possibility of serious default is; conversely, the greater the maximum payroll over the past 12 months, the greater the borrower's revenue and repayment capacity, and the lower the likelihood of serious violations. The conversion mode is cube root conversion.
The average limit use rate of the personal loan for the past 3 months reflects the credit requirement of borrowers. For people with stable income, a short term high credit demand may lead to excessive liabilities, ultimately leading to overdue occurrences. The performance relationship between the variable and the customer risk is that the lower the average limit use rate of the personal loan for the past 3 months is, the lower the credit demand of the borrower is, and the lower the possibility of serious default is; conversely, the higher the average rate of usage of the individual loan over the past 3 months, the higher the credit demand on the borrower and the greater the likelihood of serious violations. The conversion mode is square conversion in continuous conversion.
The amount of the personal loan should be added for the past 12 months continuously increases the number of months, thus reflecting the credit requirement of borrowers. For people with stable income, the rising credit demand over a period of time is directly reflected in the continuous increase in the amount of refund, which can lead to excessive liabilities of borrowers and ultimately to overdue occurrences. The performance relationship between the variable and the customer risk is that the smaller the continuous increase of the amount of the personal loan over the past 12 months, the lower the short-term credit demand of the borrower, the lower the probability of serious default; conversely, the greater the continuous increase in the amount of the person's loan over the past 12 months, the greater the borrower's short-term credit requirement, with a consequent increase in the likelihood of serious violations. The conversion mode is WOE conversion.
The credit card has 12 months of interest >0, which represents the credit requirement of borrowers. For people with stable income, the credit card generates a lot of interest, which means that the non-exempt business of using the credit card recently by customers is frequent, the credit demand is large, and overdue occurrence may be caused in the future. The smaller the number of months the credit card spends 12 months with interest >0, the less credit demand and repayment pressure the borrower has, the lower the likelihood of serious violations; conversely, the greater the number of months the credit card spends at >0 for the past 12 months, the greater the borrower's credit requirements and repayment pressures, with a consequent increase in the likelihood of serious violations. The conversion mode is WOE conversion.
In the step of calculating credit-default probability, the probability (P) of characterizing the borrower default is predicted using the following model formula according to the data and coefficient conditions acquired in the feature conversion step:
Where k is the number of features of the model entered, and k is 7 in equation 1.
Alpha is the intercept term, the numerical range is (4.5954,4.9098), and most preferably 4.7526; β1 is the maximum overdue number corresponding coefficient of the past 12 months credit account, the value range is (-0.7309, -0.7094), and the most preferable value is-0.7201; beta 2 is the corresponding coefficient of the residual credit card, the numerical range is (-0.3564, -0.3502), and the most preferable value is-0.3533; beta 3 is the corresponding coefficient of the total value of AUM (financial asset) in the current month, the numerical range is (-0.241, -0.223), and the most preferable value is-0.232; beta 4 is the corresponding coefficient of the maximum payroll value in the past 12 months, the numerical range is (-0.3124, -0.2744), and the most preferable value is-0.2934; beta 5 is the corresponding coefficient of the average limit usage rate of the person loan for the past 3 months, the numerical range is (0.2713,0.3168), and the optimal value is 0.294; beta 6 is the corresponding coefficient of the number of months of the person loan, which is added continuously for the past 12 months, the value range is (-1.2154, -1.0571), and the most preferable value is-1.1363; beta 7 is the month number corresponding coefficient of the credit card with interest of >0 in the past 12 months, and the value range is (-0.2852, -0.2503), and is most preferably-0.2678. ( And (3) injection: the value range is from the 95% confidence interval, 95% CI in the table below )
X1 is the WOE conversion value of the maximum overdue number of the past 12 month credit account generated by the feature conversion step; x2 is the natural logarithmic conversion value of the residual usable credit of the current credit card generated in the feature conversion step; x3 is a square root converted value of the total value of the current month AUM (financial asset) generated by the feature conversion step; x4 is the cube root conversion value of the wage maximum value of the past 12 months generated in the feature conversion step; x5 is the square conversion of the average rate of usage of the credit line of the person loan generated by the feature conversion step for the past 3 months; x6 is a WOE conversion value of the month number of the personal loan generated in the feature conversion step, which is continuously increased by the addition amount of the past 12 months; x7 is the WOE conversion value for the number of months for which 12 months interest >0 for the credit card generated by the feature conversion step. The model of the partial features is presented in table 2 below.
TABLE 2
The P-values for all model features were less than 0.05, indicating that the features were significantly correlated with default performance.
EXAMPLE 6 construction of Delta 2 model
Aiming at Delta 2 as a sub-model of the subdivision model, the decision tree method is adopted, the sample size obtained by classification is about 4 ten thousand, the guest group is mainly a new client without stable income applying for credit service, and the model is constructed for predicting the probability (target variable) that the crowd of the new client generates credit overdue for 30 days or more.
In this example 6, the final modeling result was described by taking five characteristics of the minimum balance from the time of the deposit account of the past 12 months, the average value of the usage rate of the line of the past 3 months of the cyclic credit, the minimum value of the AUM of the past 3 months, the average balance of the investment and financial account of the past 3 months, and the minimum balance from the time of the deposit account of the past 3 months as an example. In the feature conversion step conversion mode selection, 5 modulo variables have been converted in a corresponding form according to the correlation of the feature and the target variable. In the credit violation probability modeling step, 5 features screened in the feature fine screening step are substituted into a Sigmoid function (LOGISTIC processes of SAS software are utilized) to carry out logistic regression calculation on the model of the credit violation probability.
The minimum balance of the past 12 month deposit account time point is the number of the present months, which reflects the change condition of the fund level of borrowers. For new customers without steady revenue, the more distant the minimum deposit balance occurs in the past period of time, the more sufficient the borrower's current funds will be than before and the less likely future borrowings will be overdue. The expression relationship between the variable and the customer risk is that the larger the minimum balance of the past 12 month deposit account is from the present month, the better the fund level of the borrower changes, and the lower the possibility of serious default occurs; conversely, the greater the minimum balance of the past 12 month deposit account time point is from the present month number, the worse the borrower's funds level changes, and the greater the likelihood of serious violations. The conversion mode is WOE conversion.
The average value of the usage rate of the line of 3 months in the past of the cyclic lending represents the credit demand of borrowers. For new customers without steady revenue, the short term high credit demand may lead to excessive liabilities, ultimately leading to the occurrence of overdue. The expression relationship between the variable and the customer risk is that the lower the average value of the usage rate of the line of the 3 months in the past of the cyclic credit, the lower the credit demand of the borrower, the lower the possibility of serious default; conversely, the higher the average value of the usage rate of the line of 3 months in the cycle credit, the higher the credit demand of the borrower, and the higher the probability of serious default. The conversion mode is WOE conversion.
AUM minimum value in the past 3 months reflects the repayment ability of borrowers. For new customers without stable revenue, the minimum AUM over the last 3 months represents the borrower's recent economic capability, and a larger minimum AUM variation indicates more sufficient economic capability and a lower likelihood of future overdue. The performance relationship between the variable and the customer risk is that the lower the AUM minimum value of the past 3 months is, the lower the repayment capability of the borrower is, and the higher the possibility of serious default is; conversely, the higher the AUM minimum over the past 3 months, the greater the borrower's repayment capacity and the lower the likelihood of serious violations. The conversion mode is square root conversion in continuous conversion.
The average balance of the financial account is invested for the past 3 months, so that the investment level and repayment capability of borrowers are reflected. For new customers without stable income, the average balance of the investment financial account over the past 3 months represents the borrower's short-term investment level and repayment capability. The expression relationship between the variable and the client risk is that the lower the average balance of the investment financial account in the past 3 months is, the lower the investment level of the borrower is, the lower the repayment capability is, and the higher the possibility of serious default is; conversely, the greater the average balance of the investment financial account over the past 3 months, the higher the borrower's investment level and the higher the repayment capacity, and the lower the likelihood of serious violations. The conversion mode is square root conversion in continuous conversion.
The minimum balance of the past 3 months deposit account time point represents the deposit level and repayment capability of the borrower. For new customers without stable income, the minimum balance of the past 3 months deposit account time point represents the short-term deposit level and repayment capability of the borrower, and if the balance of the borrower is balanced, the larger the deposit of the borrower is, the smaller the future overdue probability of the borrower is. The expression relationship between the variable and the customer risk is that the lower the minimum balance of the past 3 month deposit account, the lower the deposit level of the borrower, the lower the repayment capability and the higher the possibility of serious default; conversely, the higher the minimum balance of the deposit account over the last 3 months, the higher the borrower's deposit level and the higher the repayment capacity, and the lower the likelihood of serious violations. The conversion mode is WOE conversion.
In the step of calculating credit-default probability, the probability (P) of characterizing the borrower default is predicted using the following model formula according to the data and coefficient conditions acquired in the feature conversion step:
where k is the number of features of the model entered, and k is 5 in equation 2.
Alpha is the intercept term, the numerical range is (-0.6781, -0.4868), and most preferably-0.582; β1 is the corresponding coefficient of the minimum balance of the past 12 month deposit account time point to the number of months, the numerical range is (-0.7058, -0.529), and the most preferable is-0.617; β2 is the average corresponding coefficient of the usage rate of the line of 3 months in the past of the cyclic credit, the numerical range is (0.0666,0.114), and the optimal value is 0.090; β3 is the AUM minimum corresponding coefficient for the last 3 months, the range of values is (-0.6899, -0.4916), and optimally-0.591; beta 4 is the corresponding coefficient of the average balance of the investment financial account in the past 3 months, the numerical range is (-0.7065, -0.3619), and the optimal value is-0.534; beta 5 is the minimum balance corresponding coefficient of the past 3 month deposit account time point, the numerical range is (-0.118, -0.0557), and the most preferable value is-0.087. ( And (3) injection: the value range is from the 95% confidence interval, 95% CI in the table below )
X1 is a WOE conversion value of the minimum balance of the past 12 month deposit account time points generated in the feature conversion step from the current month number; x2 is the square conversion value of the average value of the usage rate of the line of the past 3 months of the cyclic credit generated in the feature conversion step; x3 is the natural logarithmic conversion value of the AUM minimum value of the past 3 months generated in the feature conversion step; x4 is the square root conversion value of the average balance of the investment financial account of the past 3 months generated in the characteristic conversion step; x5 is the cube root conversion value of the minimum balance of the past 3 month deposit account timepoints generated by the feature conversion step. The model of the partial features is presented in table 3 below:
TABLE 3 Table 3
The P-values for all model features were less than 0.05, indicating that the features were significantly correlated with default performance.
EXAMPLE 7 construction of Delta3 model
Aiming at Delta 3 as a sub-model of the subdivision model, the decision tree method is adopted, the sample size obtained by classification is about 30 ten thousand, the guest group is mainly a stock client without stable income applying for credit service and belongs to a client in an economic underdeveloped area, and the model is constructed for predicting the probability (target variable) that the crowd of the guest group generates credit overdue for 30 days or more.
In this example 7, the final confirmation describes the final modeling result taking six features of the current time point deposit account balance, the maximum value of the past 12 month expiration date of the credit card, the minimum value of the credit usage rate of the past 12 month line, the difference between the past 3 month monthly credit and the current month minimum amount due and monthly AUM, the average value of the past 12 month refund amount of the personal loan, and the current credit card remaining line as examples. In the feature conversion step conversion mode selection, 6 modulo variables have been converted in a corresponding manner according to the correlation of the feature and the target variable. In the credit violation probability modeling step, 6 features screened in the feature fine screening step are substituted into a Sigmoid function (LOGISTIC process of SAS software) to perform logistic regression calculation to calculate a model of the credit violation probability.
The balance of the deposit account at the current time point reflects the deposit level and repayment capability of the borrower. For the customers with no stable income in the less developed economic areas, when the income level is larger than the liability level, the customers have a certain deposit balance, the repayment capability of the customers can be reflected, and the possibility of overdue future borrowing is low. The expression relationship between the variable and the customer risk is that the lower the balance of the deposit account at the current time point is, the lower the deposit level of the borrower is, the lower the repayment capability is, and the higher the possibility of serious default is; conversely, the higher the balance of the deposit account at the current time, the higher the deposit level of the borrower and the higher the repayment capability, and the probability of serious default is reduced. The conversion mode is natural logarithmic conversion.
The maximum value of the expiration number of the past 12 months of the credit card represents the historical credit performance of the borrower. For customers in less developed economic areas without stable revenue, if credit card debt is not returned on time in the past period of time, there will be a direct negative impact on the performance of future borrowings, most likely resulting in future overdue. The expression relationship between the variable and the customer risk is that the smaller the maximum value of the expiration number of the past 12 months of the credit card is, the better the credit performance habit of the borrower is, and the lower the possibility of serious default is; conversely, the greater the maximum number of expiration dates for the past 12 months of the credit card, the worse the borrower's credit practice, and the greater the likelihood of serious violations. The conversion mode is WOE conversion.
The minimum value of the use rate of the line of credit for the past 12 months reflects the credit requirement of borrowers. For stock customers in less developed economic areas and without stable revenue, a large credit demand over a period of time may result in excessive liabilities, ultimately resulting in overdue occurrences. The expression relationship between the variable and the customer risk is that the lower the minimum value of the line usage rate of the past 12 months of credit is, the lower the credit demand of the borrower is, and the lower the possibility of serious default is; conversely, the higher the minimum value of the line usage of the past 12 months of credit, the higher the credit demand of the borrower, and the higher the likelihood of serious violations. The conversion mode is square root conversion.
The difference between the lowest payable amount of the present month of the past 3 months monthly of credit and monthly AUM represents the liability level and repayment capability of the borrower. For stock customers in less developed economic areas without stable income, the difference between the minimum payable amount and monthly AUM in the past 3 months monthly of credit present month represents the short-term liability level and repayment capability of the borrower, and once too many liabilities cannot be repayed in time, unbalance of the repayment capability can be caused, and finally, default of borrowing occurs. The performance relationship between this variable and customer risk is that the greater the difference between the minimum amount due for the past 3 months monthly of credit and monthly AUM, the higher the liability level of the borrower, the lower the liability, and the higher the likelihood of serious default; conversely, the smaller the difference between the minimum payoff amount and monthly AUM for the present month of credit for the last 3 months monthly, the lower the liability level of the borrower and the higher the liability, and the lower the likelihood of serious default. The conversion mode is a cube root conversion in continuous conversion.
The average value of the repayment amount of the person loan in the past 12 months represents the repayment behavior and the performance habit of the borrower. For stock customers in less developed economic areas and without stable income, the average value of the repayment amount of the personal loan over the past 12 months represents the repayment behavior and the performance habit of the borrower over a period of time. The performance relationship between the variable and the client risk is that the lower the average value of the repayment amount of the personal loan for the past 12 months is, the worse the repayment behavior of the borrower is, and the higher the possibility of serious default is; conversely, the higher the average value of the person's loan over the past 12 months, the better the borrower's repayment activity and the better the performance habit, and the lower the likelihood of serious default. The conversion mode is original value conversion.
The current credit card residual limit reflects the credit demand of borrowers. For customers in underdeveloped economic areas without stable income, excessive consumption caused by unbalanced balance is likely to lead to credit card expiration or even overdue, so the credit card residual amount can well reflect the consumption demands of customers. The expression relationship between the variable and the customer risk is that the higher the residual credit card amount is, the lower the credit demand of the borrower is, and the lower the possibility of serious default is; conversely, the lower the current credit card remaining amount, the higher the credit demand on the borrower, and the greater the likelihood of serious default. The conversion mode is cube root conversion.
In the step of calculating credit-default probability, the probability (P) of characterizing the borrower default is predicted using the following model formula according to the data and coefficient conditions acquired in the feature conversion step:
Where k is the number of features of the model entered, and k is 6 in equation 3.
Alpha is the intercept term, the value range is (-0.5296, -0.394), and most preferably-0.462; β1 is the corresponding coefficient of the balance of the deposit account at the current time point, the numerical range is (-0.1997, -0.1867), and the optimal value is-0.193; β2 is the maximum value corresponding to the expiration number of the past 12 months of the credit card, and the value range is (-0.4853, -0.4469), and is optimally-0.466; beta 3 is the minimum value corresponding coefficient of the credit limit usage rate of the past 12 months, the numerical range is (0.1058,0.1187), and the optimal value is 0.112; beta 4 is the corresponding coefficient of the difference between the lowest payable amount of the present month of the past 3 months monthly credit and monthly AUM, the numerical range is (0.0053,0.0073), and the optimal value is 0.006; beta 5 is the average corresponding coefficient of the repayment amount of the person loan for the past 12 months, the numerical range is (-0.0096, -0.0088), and the optimal value is-0.009; beta 6 is the corresponding coefficient of the residual credit card, the numerical range is (-0.1637, -0.1519), and the most optimal value is-0.158. ( And (3) injection: the value range is from the 95% confidence interval, 95% CI in the table below )
X1 is a natural logarithmic conversion value of the balance of the deposit account at the current time point generated in the feature conversion step; x2 is the WOE conversion value of the maximum value of the past 12 month expiration number of the credit card generated by the feature conversion step; x3 is a square root conversion value of a minimum value of the credit line usage rate of the past 12 months of the credit generated by the feature conversion step; x4 is the cube root conversion value of the difference between the lowest payable amount of the past 3 months monthly credit present month and monthly AUM generated by the feature conversion step; x5 is the original converted value of the average value of the repayment amount of the person loan for the past 12 months, which is generated in the feature conversion step; x6 is the cube root conversion value of the current credit card residual amount generated by the feature conversion step.
The model of the partial features is presented in table 4 below:
TABLE 4 Table 4
The P-values for all model features were less than 0.05, indicating that the features were significantly correlated with default performance.
EXAMPLE 8Delta 4 model construction
Aiming at Delta 4 as a sub-model of the subdivision model, the decision tree method is adopted, the sample size obtained by classification is about 75 Mo Zuo, the guest group is mainly a stock client without stable income which has applied for credit service and belongs to a client in an economically developed area, and the model is constructed for predicting the probability (target variable) that the crowd of the guest group generates credit overdue for 30 days or more.
In example 8, the final modeling result was described by taking as an example six features of an average value of the amount of repayment of the person loan over 12 months, a month number duty ratio in which the amount of repayment of the person loan over 3 months should be continuously increased, a minimum value of the usage rate of the amount of credit over 3 months in the cycle, a minimum balance of the account at the time of deposit in 6 months in the past, a current remaining amount of credit card, and a maximum value of the amount of overdue in 12 months in the past as final confirmation. In the feature conversion step conversion mode selection, 6 modulo variables have been converted in a corresponding manner according to the correlation of the feature and the target variable. In the credit violation probability modeling step, 6 features screened in the feature fine screening step are substituted into a Sigmoid function (LOGISTIC process of SAS software) to perform logistic regression calculation to calculate a model of the credit violation probability.
The average value of the repayment amount of the person loan in the past 12 months represents the repayment behavior and the performance habit of the borrower. For stock customers in economically developed areas without stable income, the average value of the repayment amount of the personal loan over the past 12 months represents the repayment behavior and the performance habit of the borrower over a period of time. The performance relationship between the variable and the client risk is that the lower the average value of the repayment amount of the personal loan for the past 12 months is, the worse the repayment behavior of the borrower is, and the higher the possibility of serious default is; conversely, the higher the average value of the person's loan over the past 12 months, the better the borrower's repayment activity and the better the performance habit, and the lower the likelihood of serious default. The conversion mode is natural logarithmic conversion.
The month number duty ratio that the amount of the personal loan should be continuously increased for the past 3 months represents the short-term credit requirement of the borrower. For stock customers in economically developed areas without steady revenue, the rising credit demand over time is directly reflected in the continued increase in the amount to be refunded, possibly resulting in excessive liabilities for borrowers and ultimately overdue occurrences. The performance relationship between the variable and the customer risk is that the lower the month number ratio of the continuous increase of the amount of the personal loan over the past 3 months is, the lower the short-term credit demand of the borrower is, and the lower the possibility of serious default is; conversely, the higher the month count ratio of the continuous increase in the amount of the person's loan over the past 3 months, the higher the borrower's short-term credit requirement, and the greater the likelihood of serious default. The conversion mode is WOE conversion.
The minimum value of the usage rate of the line of 3 months in the past of the cyclic credit reflects the short-term credit requirement of borrowers. For stock customers in economically developed areas without steady revenue, a large credit demand over a period of time may result in excessive liabilities, directly reflected in recurring credits as an ever-increasing usage of the line, ultimately resulting in overdue occurrences. The expression relationship between the variable and the customer risk is that the lower the minimum value of the limit use rate of the cyclic credit for the past 3 months is, the lower the short-term credit demand of the borrower is, and the lower the serious default probability is; conversely, the higher the minimum rate of usage of the line of 3 months in the past of the recurring credit, the higher the short-term credit demand on the borrower, and the greater the likelihood of serious violations. The conversion mode is square conversion.
The minimum balance of the past 6 month deposit account time point represents the deposit level and repayment capability of the borrower. For stock customers in economically developed areas without stable revenues, the less the minimum balance of the deposit is over a period of time, the more stressed the borrower is in paying the debt, resulting in a higher likelihood of overdue. The expression relationship between the variable and the customer risk is that the larger the minimum balance of the deposit account in the past 6 months is, the better the deposit level of the borrower is, the stronger the repayment capability is, and the lower the possibility of serious default is; conversely, the smaller the minimum balance of the deposit account over the past 6 months, the poorer the borrower's deposit level and the weaker the repayment capacity, and the greater the likelihood of serious violations. The conversion mode is original value conversion.
The current credit card residual limit reflects the credit demand of borrowers. For stock customers in economically developed areas without steady revenue, the large credit demand in the near future may lead to excessive liabilities, a direct reflection on credit cards as a continual decrease in the available credit line, ultimately leading to the occurrence of overdue. The expression relationship between the variable and the customer risk is that the lower the residual credit card amount is, the stronger the credit demand of the borrower is, the higher the repayment pressure is, and the higher the possibility of serious default is; conversely, the higher the current credit card remaining amount in the past, the weaker the borrower's credit demand, the less repayment pressure, and the likelihood of serious violations. The conversion mode is square root conversion.
The maximum value of the expiration number of the past 12 months of the credit card represents the historical credit performance of the borrower. For stock customers in economically developed areas and without stable revenues. If the credit card debt is not returned on time over a period of time, there will be a direct negative impact on the performance of the future borrow, most likely resulting in the future overdue. The expression relationship between the variable and the customer risk is that the smaller the maximum value of the expiration number of the past 12 months of the credit card is, the better the credit performance habit of the borrower is, and the lower the possibility of serious default is; conversely, the greater the maximum number of expiration dates for the past 12 months of the credit card, the worse the borrower's credit practice, and the greater the likelihood of serious violations. The conversion mode is WOE conversion.
In the step of calculating credit-default probability, the probability (P) of characterizing the borrower default is predicted using the following model formula according to the data and coefficient conditions acquired in the feature conversion step:
Where k is the number of features of the model entered, and in equation 4, k is 6.
Alpha is the intercept term, the numerical range is (3.5781,3.7431), and most preferably 3.661; β1 is the average corresponding coefficient of the repayment amount of the person loan for the past 12 months, the numerical range is (-0.5156, -0.496), and the optimal value is-0.506; β2 is a month number corresponding coefficient of continuously increasing the amount of the person loan added for the past 3 months, and the numerical range is (0.6685,0.6991), and the optimal value is 0.684; beta 3 is the minimum value corresponding coefficient of the limit usage rate of the cyclic credit for the past 3 months, the numerical range is (0.0088,0.0096), and the optimal value is 0.009; beta 4 is the minimum balance ratio corresponding coefficient of the past 6 month deposit account time point, the numerical range is (-0.2122, -0.2012), and the optimal value is-0.207; beta 5 is the corresponding coefficient of the residual credit card, the numerical range is (-0.0588, -0.0419), and the optimal value is-0.050; beta 6 is the maximum value corresponding to the expiration number of the past 12 months of the credit card, and the value range is (-0.4024, -0.3742), and is optimally-0.388. ( And (3) injection: the value range is from the 95% confidence interval, 95% CI in the table below )
X1 is WOE conversion value of average value of repayment amount of the person loan for past 12 months generated in the feature conversion step; x2 is a WOE conversion value of a month number ratio of a continuous increase in the amount of the person loan generated in the feature conversion step for the past 3 months; x3 is WOE conversion value of minimum value of the use rate of the line of 3 months in the past of the cyclic credit generated in the feature conversion step; x4 is the WOE conversion value of the minimum balance of the past 6 month deposit account time points generated in the feature conversion step; x5 is the WOE conversion value of the residual credit card amount generated in the feature conversion step; x6 is the WOE conversion value for the maximum number of expiration dates for the past 12 months for the credit card generated by the feature conversion step.
The model of the partial features is shown in table 5 below:
TABLE 5
The P-values for all model features were less than 0.05, indicating that the features were significantly correlated with default performance.
EXAMPLE 9Delta 5 model construction
Aiming at Delta 5 as a sub-model of the subdivision model, the decision tree method is adopted, the sample size obtained by classification is about 23 ten thousand, the guest group is mainly a stock client without stable income which has applied for credit service and belongs to a client in an economic medium developed area, and the model is constructed for predicting the probability (target variable) that the crowd of the model generates credit overdue for 30 days or more.
In example 9, final modeling results were described by finally confirming eight features of average balance at the time of deposit account of the past 6 months, continuously increasing balance of bill of the past 6 months of credit card, minimum usage rate of credit line of the past 6 months of circulation, total AUM of the amount of money to be paid per month of the past 6 months monthly credit of this month, number of months of 12 months of repayment rate >80% of credit, maximum continuous month of usage rate >90% of credit card of 12 months, sum of repayment amount of the month of the person loan, and number of credit account of current use rate exceeding 90%. In the feature conversion step conversion mode selection, 8 modulo variables have been converted in a corresponding form according to the correlation of the feature and the target variable. In the credit violation probability modeling step, 8 features screened in the feature fine screening step are substituted into a Sigmoid function (LOGISTIC process using SAS software) to perform logistic regression to calculate a model of the credit violation probability.
The average balance of the past 6 month deposit account time points represents the deposit level and repayment capability of borrowers. For stock customers in moderately developed areas of economy and without steady revenue, the more stressful the borrower is in paying the debt if the average balance of the deposit is smaller over the past period of time, the higher the likelihood that overdue will occur. The expression relationship between the variable and the customer risk is that the larger the average balance of the past 6 month deposit account time point is, the better the deposit level of the borrower is, the stronger the repayment capability is, and the lower the possibility of serious default is; conversely, the smaller the average balance at the time of the past 6 month deposit account, the poorer the borrower's deposit level and the weaker the repayment ability, and the greater the likelihood of serious violations. The conversion mode is natural logarithmic conversion.
The past 6 months of bill balance of the credit card continuously increases the month number, thus reflecting the short-term and medium-term credit requirement of borrowers. For stock customers in moderately developed areas of economy and without steady revenue, the large credit demand of borrowers in the short term may result in excessive liabilities, with direct reflection on credit cards as an ever-increasing balance of available bills, ultimately resulting in overdue occurrences. The expression relationship between the variable and the customer risk is that the smaller the continuous increase of the balance of the past 6 months bill of the credit card is, the lower the demand level of the short-medium-term credit of the borrower is, and the lower the possibility of serious default is; conversely, the greater the continuous increase in the balance of the credit card over the past 6 months, the greater the borrower's short-to-medium term credit demand, with a concomitant increase in the likelihood of serious violations. The conversion mode is WOE conversion.
The minimum value of the utilization rate of the line of the past 6 months of cyclic credit reflects the short-medium-term credit requirement of borrowers. For stock customers in moderately developed regions of economy and without steady revenue, a large credit demand over a period of time may lead to excessive liabilities, directly reflected in recurring credits as an ever-increasing rate of usage of the line, ultimately leading to overdue occurrences. The expression relationship between the variable and the customer risk is that the lower the minimum value of the limit use rate of the cyclic credit for the past 6 months is, the lower the credit demand degree of the borrower in the short-medium term is, and the lower the serious default probability is; conversely, the higher the minimum value of the usage rate of the line of the past 6 months of the cyclic credit, the higher the credit requirement level of the borrower in the short-term and medium-term, and the higher the probability of serious default. The conversion mode is square root conversion.
The amount of money to be paid per month for the past 6 months monthly of credit per month is AUM, which represents the liability level and the liability of borrowers in the short-term and mid-term. For stock customers in moderately developed areas of economy and without stable revenues, once the liabilities are excessive and the available revenues cannot cover the payment gap over a period of time in the past, this will result in an imbalance in the repayment capacity and ultimately in the occurrence of a borrowing breach. The performance relationship between this variable and customer risk is that the higher the rate of the amount of money that should be paid per month (log conversion) for the past 6 months monthly of credit, the lower the liability level of the borrower, and the higher the likelihood of serious default; conversely, the smaller the ratio of amount to month average AUM (log conversion) that should be refunded per month for the past 6 months monthly of credit, the lower the liability level of the borrower, the higher the liability, and the lower the likelihood of serious default. The conversion mode is natural logarithmic conversion.
The month number of 12 months of repayment rate >80% represents the medium and long-term credit performance condition of borrowers. For stock customers in the economic mid-developed area without stable income, the credit performance is good in the mid-long term, the higher repayment rate level can be kept for a long time, the customers have better repayment behavior habits, and the possibility of unbalanced repayment capability and default borrowing is low. The performance relationship between the variable and the customer risk is that the smaller the month number of 12 months of repayment rate >80% in the past, the worse the credit performance habit of the borrower, the higher the probability of serious default; conversely, the greater the number of months the repayment rate of >80% over the past 12 months, the better the borrower's credit practice will be, and the likelihood of serious violations will be reduced. The conversion mode is cube root conversion.
The credit card has a use rate of >90% of the past 12 months of line up to the number of continuous months, which reflects the long-term and medium-term credit demands of borrowers. For stock customers in moderately developed areas of economy and without steady revenue, the extensive credit demand in the borrower's mid-to-long period may lead to excessive liabilities, with direct reflection on credit cards as an ever-increasing rate of usage of the line, ultimately leading to overdue occurrences. The lower the credit card usage rate of >90% up to the number of consecutive months over the past 12 months, the lower the borrower's demand for money is, and the lower the likelihood of serious violations; conversely, the greater the use rate of credit cards over the past 12 months, and up to 90% of the number of consecutive months, the higher the demand for funds by the borrower, with a consequent increase in the likelihood of serious violations. The conversion mode is WOE conversion.
The sum of the repayment amounts of the person loans in the current month reflects the repayment behaviors and the performance habits of borrowers. For the deposit customers in the economic medium developed area without stable income, the sum of the repayment amounts of the individual loans reflects repayment behaviors and performance habits of borrowers in a period of time, and the larger the repayment amount is, the stronger the repayment capacity and repayment willingness of the customers are, and the lower the possibility of overdue in the future is. The larger the sum of the repayment amounts of the person loan in the month is, the better the repayment capacity and repayment habit of the borrower are, and the lower the possibility of serious default is; conversely, the smaller the sum of the monthly payouts of the individual loans, the worse the borrower's payability and payback habits, with a consequent increase in the likelihood of serious default. The conversion mode is original value conversion.
The number of individual credit accounts with the current limit use rate exceeding 90 percent reflects the credit demand of borrowers. For customers in moderately developed regions of the economy and with no steady revenue, a higher individual credit usage represents a higher demand for recent borrowing by the customer, and the greater the customer's repayment pressure, the greater the likelihood of overdue future occurrence. The more the number of individual credit accounts with the current use rate of the line exceeding 90%, the higher the credit demand of the borrower, and the possibility of serious default is increased; conversely, the fewer individual credit accounts the current usage rate of which exceeds 90%, the more occasional the overdue borrower will be, and the lower the likelihood of serious violations will occur. The conversion mode is square conversion.
In the step of calculating credit-default probability, the probability (P) of characterizing the borrower default is predicted using the following model formula according to the data and coefficient conditions acquired in the feature conversion step:
where k is the number of features of the model entered, and in equation 5, k is 8.
Alpha is the intercept term, the numerical range is (0.2614,0.4061), and the optimal value is 0.334; β1 is the average balance corresponding coefficient of the past 6 month deposit account time point, the numerical range is (-0.2041, -0.1896), and the most preferable value is-0.197; beta 2 is the corresponding coefficient of continuously increasing the balance of the bill for the past 6 months of the credit card, the numerical range is (-0.4365, -0.3966), and the most preferable value is-0.417; beta 3 is the minimum value corresponding coefficient of the limit usage rate of the cyclic credit for the past 6 months, the numerical range is (0.1747,0.1912), and the optimal value is 0.183; beta 4 is the corresponding coefficient of the amount to be refund per month AUM of the same credit for the past 6 months monthly, the numerical range is 0.2261,0.2516, and the optimal value is 0.239; beta 5 is a month number corresponding coefficient of which the repayment rate of the past 12 months of credit is more than 80 percent, and the numerical range is (-0.5716, -0.51), and is optimally-0.541; beta 6 is the corresponding coefficient of the credit card with the use rate of the credit card for the past 12 months of more than 90 percent and the maximum continuous month number, the numerical range is (-0.3114, -0.2683), and the most optimal value is-0.290; beta 7 is the corresponding coefficient of the sum of the repayment amounts of the person loan in the month, the numerical range is (-0.3661, -0.281), and the optimal value is-0.324; β8 is the corresponding coefficient of the number of credit accounts with the current limit usage rate exceeding 90%, and the value range is (-0.4713, -0.3529), and is optimally-0.412. ( And (3) injection: the value range is from the 95% confidence interval, 95% CI in the table below )
X1 is a natural logarithmic conversion value of the average balance of the time point of the deposit account of the past 6 months generated in the characteristic conversion step; x2 is a continuous WOE conversion value of the continuous mode in which the balance of the past 6 months bill of the credit card generated in the feature conversion step is continuously increased by the number of months; x3 is the square root conversion value of the minimum value of the usage rate of the line of the past 6 months of the cyclic credit generated by the feature conversion step; x4 is the natural logarithmic conversion value of the AUM of the amount/month average of the last 6 months monthly of credit generated by the feature conversion step; x5 is a cube root conversion value for the month number of which the past 12 months repayment rate of the credit generated by the feature conversion step is > 80%; x6 is a WOE conversion value of up to 90% of the credit card generated in the feature conversion step for the past 12 months of credit usage; x7 is the original conversion value of the sum of the monthly repayment amounts of the personal loan generated in the feature conversion step; x8 is the square conversion value of the number of individual credit accounts with the current limit usage rate exceeding 90% generated by the feature conversion step.
The model of the partial features is presented in table 6 below:
TABLE 6
The P-values for all model features were less than 0.05, indicating that the features were significantly correlated with default performance.
Example 10Delta 6 model construction
Aiming at Delta 6 as a sub-model of the subdivision model, the decision tree method is adopted, the sample size obtained by classification is about 6 ten thousand, the guest group is mainly a new customer with the account age smaller than 3 months of the applied credit service, and the model is constructed for predicting the probability (target variable) that the future occurrence credit overdue of the sample group without the applied registered credit service is 30 days or more.
In example 10, the final confirmation describes the final modeling result taking five features of the past 12 months of gold account holding month number, the past 3 months of investment and financial account average balance, the past 3 months of deposit account time point minimum balance, the past 6 months of shipping wages number, and the past 12 months of wages maximum value as examples. In the feature conversion step conversion mode selection, 5 modulo variables have been converted in a corresponding form according to the correlation of the feature and the target variable. In the credit violation probability modeling step, 5 features screened in the feature fine screening step are substituted into a Sigmoid function (LOGISTIC processes of SAS software are utilized) to carry out logistic regression calculation on the model of the credit violation probability.
The past 12 months the gold account holds the month number, which reflects the borrower's investment intent and level of funds. For a new user population without application for a registration credit service, if the number of holding months of a gold account is larger in the past period of time, the investment tendency of the borrower is stronger, and the borrower can be represented to have sufficient funds. The performance relationship between the variable and the customer risk is that the greater the number of the past 12 months of the holding month of the gold account, the stronger the investment intention and the fund level of the borrower, and the lower the possibility of serious default; conversely, the smaller the number of months the gold account has been held for the past 12 months, the weaker the borrower's investment intent and level of funds will be, with a consequent increase in the likelihood of serious violations. The conversion mode is WOE conversion.
The average balance of the financial account is invested in the past 3 months, so that the asset level and repayment capability of borrowers are reflected. For a new user population not having an application for registering a credit service, the more the balance of the short-term investment financial account is, the more sufficient the available revenue representing the borrower, and the lower the likelihood that future borrowing will be overdue. The expression relationship between the variable and the client risk is that the higher the average balance of the investment financial account in the past 3 months is, the stronger the asset level and repayment capability of the borrower are, and the lower the possibility of serious default is; conversely, the lower the average balance of the investment financial account over the past 3 months, the weaker the borrower's asset level and repayment capacity, and the greater the likelihood of serious violations. The conversion mode is square root conversion in continuous conversion.
The minimum balance of the past 3 months deposit account time point reflects the repayment capability of borrowers. For a new user population not applying for a registration credit service, the minimum AUM of the past 3 months represents the borrower's recent economic capability, and the larger the minimum AUM variation indicates the more economic capability, the lower the likelihood of future overdue occurring. The performance relationship between the variable and the customer risk is that the lower the AUM minimum value of the past 3 months is, the lower the repayment capability of the borrower is, and the higher the possibility of serious default is; conversely, the higher the AUM minimum over the past 3 months, the greater the borrower's repayment capacity and the lower the likelihood of serious violations. The conversion mode is natural logarithmic conversion.
The past 6 months replace the payroll number, which reflects the income level and repayment capability of borrowers. For a new user population without application for registering credit service, the past 6 months of wages represent income level and repayment capability of borrowers in the short-medium term. The performance relationship between the variable and the customer risk is that the smaller the number of wages sent over the past 6 months, the lower the income level of borrowers and the lower the repayment capability, and the higher the probability of serious default; conversely, the greater the number of payouts over the past 6 months, the higher the borrower's revenue level and the higher the repayment capacity, and the lower the likelihood of serious violations. The conversion mode is WOE conversion.
The maximum payroll value of the past 12 months represents the borrower's income and repayment capabilities. For a new user population without application for registering credit service, the maximum payroll value of the past 12 months represents the long-term income capability and repayment capability of borrowers, and the larger the maximum payroll value in a period of time is, the higher the average income level of the borrowers is, the stronger the repayment capability is. The performance relationship between the variable and the customer risk is that the lower the maximum payroll value of the past 12 months is, the lower the income capability of borrowers is, the lower the repayment capability is, and the higher the probability of serious default is; conversely, the higher the maximum payroll over the past 12 months, the higher the borrower's revenue level and repayment capacity, and the lower the likelihood of serious violations. The conversion mode is original value conversion in continuous conversion.
In the step of calculating credit-default probability, the probability (P) of characterizing the borrower default is predicted using the following model formula according to the data and coefficient conditions acquired in the feature conversion step:
Where k is the number of features of the model entered, and in equation 6, k is 5.
Alpha is the intercept term, the numerical range is (0.2884,1.9611), and the optimal value is 1.125; β1 is the corresponding coefficient of the number of months the gold account holds for the past 12 months, and the numerical range is (-0.721, -0.6281), and is optimally-0.675; β2 is the corresponding coefficient of the average balance of the investment financial account in the past 3 months, the numerical range is (-0.0646, -0.0125), and the optimal value is-0.039; β3 is the minimum balance corresponding coefficient at the time of the deposit account of the past 3 months, the numerical range is (-0.1275, -0.0891), and the most preferable value is-0.108; beta 4 is the corresponding coefficient of the past 6 months of wage times, the numerical range is (-0.4612, -0.2386), and the optimal value is-0.350; beta 5 is the payroll maximum corresponding coefficient for the past 12 months, the range of values is (-0.3679, -0.1707), and optimally-0.269. ( And (3) injection: the value range is from the 95% confidence interval, 95% CI in the table below )
X1 is the WOE conversion value of the past 12 months of the gold account holding month number generated in the feature conversion step; x2 is the square root conversion value of the average balance of the investment financial account of the past 3 months generated in the characteristic conversion step; x3 is a natural logarithmic conversion value of the minimum balance of the past 3 month deposit account time points generated in the feature conversion step; x4 is the WOE conversion value of the past 6 months of the wage times generated in the feature conversion step; x5 is the original converted value of the payroll maximum for the past 12 months generated by the feature conversion step.
The model of the partial features is presented in table 7 below:
TABLE 7
The P-values for all model features were less than 0.05, indicating that the features were significantly correlated with default performance.
Example 11
The P value calculated for each of the above formulas may be further used to calculate a score for any one customer.
And a scoring step of converting the calculated credit violation probability into a score of 0-1000 points by pre-storing a default score conversion code.
In the calculate scoring module, a credit score for characterizing the borrower is calculated using the following formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
The P value calculated in each of the above embodiments may be substituted in the present embodiment to calculate the score.
KS (Kolmogorov-Smirnov) statistics were presented by two Soviet math A.N.Kolmogorov and N.V.Smirnov. In wind control, KS is commonly used to evaluate model discrimination. The larger the discrimination, the more risk ranking capability (ranking ability) of the model is explained.
KS statistics are established based on an empirical cumulative distribution function (EMPIRICAL CUMULATIVE DISTRIBUTION FUNCTION, ECDF), defined generally as:
KS=max(|cum(bad_rate)-cum(good_rate)|)
according to the prediction model, a KS curve is drawn, and KS of the embodiment is 67.98, which is an example of the overall Delta model, it can be seen that the model constructed in the embodiment performs well in evaluating the credit risk function of the client, and as a result, fig. 2 shows an example of a KS curve with a KS value of 58.
The following shows the results of the model applied to the development sample and the reserved validation sample for KS values in the development sample and the validation sample. It can be seen from table 8 that the model has a good distinguishing effect in the training set and the verification set (in which the sample number ratio of the training set to the verification set is 6:4) and the whole sample, i.e., the distinguishing effect is very excellent in all the sub-models.
TABLE 8
/>

Claims (56)

1. A method of calculating retail credit risk, comprising:
A data acquisition step of acquiring retail credit prediction data of a sample to be predicted;
Classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating credit violation probability;
And a credit violation probability calculating step, wherein retail credit prediction data are substituted into a credit violation probability submodel to calculate the credit violation probability of the sample to be predicted, and preferably the credit violation probability is the credit violation probability for an online credit service.
2. The method of claim 1, further comprising:
after calculating the credit violation probability, calculating a credit score of the sample to be predicted, for calibrating the calculated credit violation probability to a normalized score of 0-1000 points.
3. The method according to claim 1 or 2, wherein,
The retail credit prediction data comprises original retail credit prediction data of a sample to be predicted and derived retail credit prediction data is processed based on the original retail credit prediction data;
preferably, the original retail credit prediction data includes:
credit card based data, which is based on all available data in the sample user's credit card creation process and use process,
Personal loan class base data, which is all available data based on the sample user's loan application case and usage behaviors,
Customer base information class base data, which is data based on the properties of the sample user itself, but not directly related to behavior at the financial institution,
Personal financial property type base data, which is sample user data of all other financial properties and financial transaction types not related to credit cards and loans at a financial institution.
4. The method according to claim 1 to 3, wherein,
Processing derived retail credit prediction data based on the original retail credit prediction data refers to data obtained by processing the collected original retail credit prediction data based on a time dimension, a space dimension, a frequency dimension, and a statistical information dimension;
preferably, the derived retail credit prediction data includes, but is not limited to:
derived retail credit prediction data processed based on sample relationship lengths,
Derived retail credit forecast data processed based on time interval class variables,
Derived retail credit forecast data processed based on the degree of sample behavioral frequency,
Derived retail credit prediction data processed based on the sample current point in time conditions,
Derived retail credit forecast data processed based on sample duration behavior,
Derived retail credit prediction data processed from the sample data based on the statistical information dimension.
5. The method according to any one of claims 1 to 4, wherein,
The retail credit prediction data is selected from one or two or three or four or five or six or seven or eight of the following:
The maximum expected number of past 12 month credit account, the current credit card residual amount, the current month AUM total value, the maximum wage value of past 12 months, the average use rate of the past 3 months of personal loan, the continuous increase of the month number of the past 12 month account of personal loan, the month number of the past 12 month interest of credit card greater than 0, the minimum balance of the past 12 month deposit account time point and the month number, the average of the cyclic credit past 3 month account use rate, the minimum AUM value of past 3 months, the average balance of the past 3 month investment financial account, the minimum balance of the past 3 month deposit account time point, the current time point account balance, the maximum value of the past 12 month expiration date of credit card, the minimum use rate of the past 12 month account of credit, the difference between the minimum account of past 3 months monthly credit and monthly AUM average value of 12 month repayment amount of the personal loan, month account number ratio of continuous increment of the repayment amount of the personal loan for 3 months, minimum value of the utilization rate of the credit line of 3 months in the past, minimum balance of the deposit account time point of 6 months in the past, average balance of the deposit account time point of 6 months in the past, continuous increment of the bill balance of 6 months in the past of a credit card, minimum value of the utilization rate of the credit line of 6 months in the past, AUM of the repayment amount/month of monthly in the past, month number of >80% of the repayment rate of 12 months in the past, maximum continuous month number of >90% of the utilization rate of the credit line of 12 months in the past, total sum of the repayment amount of the personal loan, credit account number of which the current use rate exceeds 90%, account holding month number of 12 months in the past, wage number of 6 months in the past, and maximum wage value of 12 months in the past.
6. The method according to any one of claims 1 to 5, wherein,
The step of classifying the samples to be predicted comprises the sub-steps of:
Whether the sample to be tested is a customer who has applied for credit services at a financial institution;
Whether the sample to be tested is a customer with stable income;
Whether the sample to be tested is newly registered in m months;
the sample to be tested handles the geographical area to which the business belongs;
Classifying samples to be predicted based on the substeps to determine a submodel for calculating credit violation probability, wherein the sequence of the substeps can be arbitrarily set on the premise of ensuring reasonable business logic;
The samples to be predicted are preferably classified in the following order:
firstly, judging whether the sample to be tested is a customer who applies for credit business at a financial institution;
then judging whether the sample to be tested is a customer with stable income;
then judging whether the sample to be tested is newly registered in m months;
and then judging the geographical area to which the sample to be tested handles business.
7. The method according to any one of claims 1 to 6, wherein,
And substituting the feature conversion of the retail credit prediction data into a credit violation probability submodel to calculate the credit violation probability of the sample to be predicted, wherein the feature conversion step comprises the following steps:
and selecting a WOE mode or a continuous mode for feature conversion based on the feature type of the retail credit prediction data which is required to be substituted into the credit violation probability submodel.
8. The method of claim 7, wherein,
The feature conversion in the continuous mode comprises the following steps: the continuous feature conversion is carried out in a mode of directly selecting an original value, calculating the square of original data, calculating the square root of the original data, calculating the cube root of the original data or calculating the natural logarithm of the original data.
9. The method according to any one of claims 1 to 8, wherein,
The credit violation probability submodel is a model constructed based on existing user populations using logistic regression based on sample retail credit prediction data and credit violation probabilities.
10. The method according to any one of claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: the maximum expiration number of the past 12 month credit account, the current credit card residual amount, the current month AUM total value, the past 12 month wage maximum value, the past 3 month average credit usage of the personal loan, the past 12 month refund amount of the personal loan continuously increases by one, two, three, four, five, six or seven months of credit card past 12 month interest > 0.
11. The method of claim 10, wherein the credit violation probability calculating step comprises:
Feature converting the maximum expiration number of the past 12 month credit account, the current credit card residual amount, the current month AUM total value, the past 12 month wage maximum value, the past 3 month average credit usage of the personal loan, the continuous increase of the past 12 month refund amount of the personal loan for the month number of the credit card for the past 12 month interest >0,
Preferably, the maximum overdue number of the credit account of the past 12 months is converted by adopting a WOE mode; adopting a continuous conversion mode for the residual credit card; adopting a continuous conversion mode for the AUM total value of the previous month; adopting a continuous conversion mode for the maximum wage value of the past 12 months; adopting a continuous conversion mode for the average limit use rate of the personal loan for the past 3 months; continuously increasing the amount of the personal loan for the past 12 months and converting the number of the months by adopting a WOE mode; converting the month number of the credit card with the past 12 months interest of >0 by adopting a WOE mode;
further preferably, the continuous conversion mode is adopted for the current credit card residual amount to be a calculation mode taking natural logarithm of the current credit card residual amount; adopting a continuous conversion mode for the total AUM value of the previous month to obtain a square root of the total AUM value of the previous month; adopting a continuous conversion mode for the wage maximum value of the past 12 months to be a calculation mode for taking a cube root for the wage maximum value of the past 12 months; the average credit usage rate of the person loan for the past 3 months is calculated by squaring the average credit usage rate of the person loan for the past 3 months by adopting a continuous conversion mode.
12. The method of claim 11, wherein,
Substituting the converted values of seven characteristics, namely the maximum overdue number of the credit account of the past 12 months, the residual amount of the current credit card, the total value of AUM of the current month, the maximum payroll value of the past 12 months, the average use rate of the credit amount of the personal loan of the past 3 months, the continuous increase of the amount of the credit of the personal loan of the past 12 months and the number of months of interest of the credit card of the past 12 months >0, into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
13. The method of claim 12, wherein,
Wherein the submodel is shown in formula 1 below:
Where k is the number of features of the model entered, preferably k is 7,
Alpha is the intercept term, preferably the range of values is (4.5954,4.9098), most preferably 4.7526;
β1 is the maximum expiration number corresponding to the past 12 months credit account, preferably in the range (-0.7309, -0.7094), most preferably-0.7201;
β2 is the corresponding coefficient of the residual credit card, the preferred value range is (-0.3564, -0.3502), and most preferably-0.3533;
beta 3 is the corresponding coefficient of the total value of AUM (financial asset) of the current month, and the preferable value range is (-0.241, -0.223), and the most preferable value range is-0.232;
beta 4 is the maximum payroll corresponding coefficient for the past 12 months, the preferred range of values is (-0.3124, -0.2744), and most preferably-0.2934;
beta 5 is the corresponding coefficient of average rate of usage of the loan for the past 3 months of the person, and the preferred value range is (0.2713,0.3168), and the most preferred value range is 0.294;
beta 6 is the corresponding coefficient of the number of months of the person loan, which is added continuously for the past 12 months, the value range is (-1.2154, -1.0571), and the most preferable value is-1.1363;
Beta 7 is the month number corresponding coefficient of the credit card with the interest of 12 months >0, the value range is (-0.2852, -0.2503), and the most preferable range is-0.2678;
x1 is the WOE conversion value of the maximum overdue number of the past 12 month credit account generated by the feature conversion step;
x2 is the natural logarithmic conversion value of the residual usable credit of the current credit card generated in the feature conversion step;
x3 is a square root converted value of the total value of the current month AUM (financial asset) generated by the feature conversion step;
x4 is the cube root conversion value of the wage maximum value of the past 12 months generated in the feature conversion step;
x5 is the square conversion of the average rate of usage of the credit line of the person loan generated by the feature conversion step for the past 3 months;
x6 is a WOE conversion value of the month number of the personal loan generated in the feature conversion step, which is continuously increased by the addition amount of the past 12 months;
x7 is the WOE conversion value for the number of months for which 12 months interest >0 for the credit card generated by the feature conversion step.
14. The method of claim 13, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
15. The method according to any one of claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: one, two, three, four or five of the minimum balance of the past 12 month deposit account time points, the average value of the usage rate of the line of 3 months in the past of the cyclic credit, the minimum value of AUM in the past 3 months, the average balance of the investment and financial account in the past 3 months, and the minimum balance of the deposit account time points in the past 3 months.
16. The method of claim 15, wherein the credit violation probability calculating step comprises:
Performing characteristic conversion on the minimum balance of the past 12 month deposit account time point from the current month, the average value of the use rate of the credit line of the past 3 months, the AUM minimum value of the past 3 months, the average balance of the past 3 month investment financial account and the minimum balance of the past 3 month deposit account time point,
Preferably, the minimum balance of the past 12 month deposit account time point is converted from the current month number by adopting a WOE mode; converting the average value of the use rate of the line of the circulation credit for the past 3 months in a WOE mode; converting AUM minimum value of the past 3 months in a continuous mode; converting the average balance of the investment financial account in the past 3 months in a continuous mode; converting the minimum balance of the past 3 month deposit account time point by adopting a WOE mode;
further preferably, the continuous mode conversion of the AUM minimum value for the past 3 months is a square root calculation mode for the AUM minimum value for the past 3 months, and the continuous mode conversion of the average balance of the investment and financial account for the past 3 months is a square root calculation mode for the average balance of the investment and financial account for the past 3 months.
17. The method of claim 15, wherein,
Substituting the values obtained by converting the five characteristics of the minimum balance of the past 12 month deposit account time points, the number of present months, the average value of the use rate of the line of 3 months of the cyclic credit, the AUM minimum value of the past 3 months, the average balance of the past 3 month investment financial account and the minimum balance of the past 3 month deposit account time points into a submodel constructed by adopting logistic regression based on sample retail credit prediction data to calculate the default probability of the sample to be predicted.
18. The method of claim 17, wherein,
The submodel is shown in the following formula 2:
Where k is the number of features of the model entered, preferably k is 5;
alpha is the intercept term, preferably in the range (-0.6781, -0.4868), optimally-0.582;
β1 is the corresponding coefficient of the minimum balance of the past 12 month deposit account time point to the number of present months, and the preferred value range is (-0.7058, -0.529), and most preferably-0.617;
β2 is the average corresponding coefficient of the usage rate of the line of 3 months in the past of the cyclic credit, and the preferred value range is (0.0666,0.114), and the most preferred value range is 0.090;
β3 is the AUM minimum corresponding coefficient for the last 3 months, preferably in the range (-0.6899, -0.4916), most preferably-0.591;
Beta 4 is the corresponding coefficient of the average balance of the investment financial account in the past 3 months, and the preferable value range is (-0.7065, -0.3619), and the most preferable value range is-0.534;
Beta 5 is the minimum balance corresponding coefficient at the time of the deposit account of the past 3 months, and the preferred value range is (-0.118, -0.0557), and the most preferred value range is-0.087;
x1 is a WOE conversion value of the minimum balance of the past 12 month deposit account time points generated in the feature conversion step from the current month number;
x2 is the square conversion value of the average value of the usage rate of the line of the past 3 months of the cyclic credit generated in the feature conversion step;
x3 is the natural logarithmic conversion value of the AUM minimum value of the past 3 months generated in the feature conversion step;
x4 is the square root conversion value of the average balance of the investment financial account of the past 3 months generated in the characteristic conversion step;
x5 is the cube root conversion value of the minimum balance of the past 3 month deposit account timepoints generated by the feature conversion step.
19. The method of claim 18, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
20. The method according to any one of claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: the present time point deposit account balance, the maximum value of the past 12 month expiration date of the credit card, the minimum value of the credit usage rate of the past 12 month line, the difference between the minimum amount due for the past 3 month monthly credit and monthly AUM, the average value of the past 12 month repayment amount of the personal loan, and one, two, three, four, five or six of the present credit card residual line.
21. The method of claim 20, wherein the credit violation probability calculating step comprises:
performing characteristic conversion on the balance of the deposit account at the current time point, the maximum value of the overdue amount of the past 12 months of credit, the minimum usage rate of the credit amount of the past 12 months of credit, the difference between the minimum payable amount of the past 3 months monthly of credit and monthly AUM of the present month, the average value of the payable amount of the past 12 months of personal loan and the residual amount of the current credit card,
Preferably, converting the balance of the deposit account at the current time point in a continuous mode; converting the maximum value of the expiration number of the past 12 months of the credit card by adopting a WOE mode; converting the minimum value of the credit use rate of the past 12 months in a continuous mode; converting the difference value between the lowest payable amount of the present month of the past 3 months monthly credit and monthly AUM in a continuous mode; converting the average value of the repayment amount of the person loan for the past 12 months in a continuous mode; converting the current credit card residual amount in a continuous mode;
Further preferably, the current time point deposit account balance is converted into a calculation mode of taking natural logarithm, the minimum value of the credit use rate of the past 12 months is converted into a calculation mode of taking square root for the minimum value of the credit use rate of the past 12 months, the difference between the minimum payable amount of the past 3 months monthly credit and monthly AUM is converted into a calculation mode of taking cube root for the difference between the minimum payable amount of the past 3 months monthly credit and monthly AUM, the average value of the personal loan repayment amount of the past 12 months is converted into a calculation mode of taking original value for the average value of the personal loan repayment amount of the past 12 months, and the current credit card residual amount is converted into a calculation mode of taking cube root for the residual credit card by adopting the continuous mode.
22. The method of claim 20, wherein,
Substituting the converted values of the six characteristics of the balance of the deposit account at the current time point, the maximum value of the past 12 month expiration date of the credit card, the minimum value of the credit usage rate of the past 12 month line, the difference value between the minimum amount of the credit due to the past 3 months monthly credit and monthly AUM, the average value of the repayment amount of the personal loan in the past 12 months and the current credit card residual line into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
23. The method of claim 22, wherein,
The submodel is shown in equation 3 below:
Where k is the number of features of the model entered, preferably k is 6;
alpha is the intercept term, preferably in the range (-0.5296, -0.394), most preferably-0.462;
β1 is the corresponding coefficient of the deposit account balance at the current time point, and the preferred value range is (-0.1997, -0.1867), and the most preferred value range is-0.193;
β2 is the maximum value corresponding to the expiration number of the past 12 months of the credit card, and the preferred range is (-0.4853, -0.4469), and most preferably-0.466;
beta 3 is the minimum value corresponding coefficient of the credit limit usage rate of 12 months in the past, the preferred value range is (0.1058,0.1187), and the most preferred value range is 0.112;
β4 is the coefficient corresponding to the difference between the lowest payable amount of the present month of the past 3 months monthly credit and monthly AUM, preferably in the range of values (0.0053,0.0073), most preferably 0.006;
Beta 5 is the average corresponding coefficient of the repayment amount of the person loan for the past 12 months, and the preferable value range is (-0.0096, -0.0088), and the most preferable value range is-0.009;
beta 6 is the corresponding coefficient of the residual credit card, the preferable value range is (-0.1637, -0.1519), and the most preferable value range is-0.158;
x1 is a natural logarithmic conversion value of the balance of the deposit account at the current time point generated in the feature conversion step;
x2 is the WOE conversion value of the maximum value of the past 12 month expiration number of the credit card generated by the feature conversion step;
x3 is a square root conversion value of a minimum value of the credit line usage rate of the past 12 months of the credit generated by the feature conversion step;
x4 is the cube root conversion value of the difference between the lowest payable amount of the past 3 months monthly credit present month and monthly AUM generated by the feature conversion step;
x5 is the original converted value of the average value of the repayment amount of the person loan for the past 12 months, which is generated in the feature conversion step;
x6 is the cube root conversion value of the current credit card residual amount generated by the feature conversion step.
24. The method of claim 23, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
25. The method according to any one of claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: one, two, three, four, five or six of an average value of the amount paid over 12 months of the personal loan, a ratio of the number of months by which the amount to be paid over 3 months of the personal loan is continuously increased, a minimum usage rate of the amount of 3 months of the recurring credit over the past, a minimum balance of the account points of the deposit over 6 months of the past, a current credit card remaining amount, and a maximum value of the number of overdue credit cards over 12 months.
26. The method of claim 25, wherein the credit violation probability calculating step comprises:
The average value of the repayment amount of the personal loan in the past 12 months, the continuously increased month number proportion of the repayment amount of the personal loan in the past 3 months, the minimum usage rate of the credit line in the past 3 months of the cyclic loan, the minimum balance of the deposit account in the past 6 months, the current credit card residual amount and the maximum value of the overdue amount of the credit card in the past 12 months are subjected to characteristic conversion,
Preferably, the average value of the repayment amount of the personal loan for the past 12 months is converted in a continuous mode; the month number duty ratio of continuously increasing the amount of the personal loan added for the past 3 months is converted by adopting a WOE mode; converting the minimum value of the utilization rate of the line of 3 months in the past of the cyclic credit in a continuous mode; converting the minimum balance of the past 6 month deposit account time point in a continuous mode; converting the current credit card residual amount in a continuous mode; converting the maximum value of the expiration number of the past 12 months of the credit card by adopting a WOE mode;
Further preferably, the average value of the repayment amount of the person loan for the past 12 months is converted into a calculation mode using natural logarithm for the average value of the repayment amount of the person loan for the past 12 months, the minimum value of the utilization rate of the credit for the 3 months in the past of the circulation is converted into a calculation mode using square for the minimum value of the utilization rate of the credit for the 3 months in the past of the circulation, the minimum balance of the credit account for the past 6 months is converted into a calculation mode using original value for the minimum balance of the credit account for the past 6 months, and the residual credit of the present credit card is converted into a calculation mode using cube root for the residual credit card by adopting the continuous mode.
27. The method of claim 25, wherein,
The value converted from six characteristics of the average value of the repayment amount of the past 12 months of the personal loan, the continuously increased month number proportion of the repayment amount of the past 3 months of the personal loan, the minimum usage rate of the credit line of the past 3 months of the cyclic loan, the minimum balance of the past 6 months of deposit account time points, the current credit card residual amount and the maximum value of the overdue amount of the past 12 months of the credit card is substituted into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
28. The method of claim 27, wherein,
The submodel is shown in equation 4 below:
Where k is the number of features of the model entered, preferably k is 6;
Alpha is the intercept term, preferably the range of values is (3.5781,3.7431), most preferably 3.661;
β1 is the average corresponding coefficient of the repayment amount of the person loan for the past 12 months, and the preferred value range is (-0.5156, -0.496), and the most preferred value range is-0.506;
β2 is a month number corresponding coefficient of continuously increasing the amount of the person loan added over the past 3 months, and the preferred value range is (0.6685,0.6991), and the most preferred value range is 0.684;
Beta 3 is the minimum value corresponding coefficient of the limit usage rate of the cyclic credit for the past 3 months, and the preferable value range is (0.0088,0.0096), and the most preferable value range is 0.009;
beta 4 is the minimum balance ratio corresponding coefficient of the past 6 month deposit account time point, the preferred value range is (-0.2122, -0.2012), and the most preferred value range is-0.207;
beta 5 is the corresponding coefficient of the residual credit card, the preferable value range is (-0.0588, -0.0419), and the most preferable value range is-0.050;
beta 6 is the maximum value corresponding to the expiration number of the past 12 months of the credit card, and the preferred value range is (-0.4024, -0.3742), and most preferably-0.388;
x1 is WOE conversion value of average value of repayment amount of the person loan for past 12 months generated in the feature conversion step;
x2 is a WOE conversion value of a month number ratio of a continuous increase in the amount of the person loan generated in the feature conversion step for the past 3 months;
x3 is WOE conversion value of minimum value of the use rate of the line of 3 months in the past of the cyclic credit generated in the feature conversion step;
x4 is the WOE conversion value of the minimum balance of the past 6 month deposit account time points generated in the feature conversion step;
x5 is the WOE conversion value of the residual credit card amount generated in the feature conversion step;
x6 is the WOE conversion value for the maximum number of expiration dates for the past 12 months for the credit card generated by the feature conversion step.
29. The method of claim 28, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
30. The method according to any one of claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: average balance of past 6 month deposit account points, continuous increase of credit card past 6 month bill balance, minimum usage of credit line of past 6 months, AUM of past 6 months monthly credit to be paid back per month, month number of 12 months of credit past 12 months of repayment rate >80%, maximum continuous month number of >90% credit card past 12 months of credit line usage rate, sum of individual loan repayment amount, one, two, three, four, five, six, seven or eight of the credit account numbers with current credit line usage rate exceeding 90%.
31. The method of claim 30, wherein the credit violation probability calculating step comprises:
The average balance of the past 6 month deposit account time points, the past 6 month bill balance of the credit card are continuously increased by the month number, the utilization rate of the credit line of the past 6 months is minimum, the past 6 months monthly of credit should be paid back per month is AUM, the 12 month payment rate of the credit is >80 percent of the month number, the maximum continuous month number of the credit line of the past 12 months of the credit card is >90 percent, the sum of the payment rates of the current month of the personal loan is equal to the month number, the utilization rate of the credit line exceeds 90 percent,
Preferably converting the average balance continuous conversion method at the time of the deposit account of the past 6 months; continuously increasing the balance of the past 6 months bill of the credit card for months, and converting the balance of the past 6 months bill by adopting a WOE mode; converting the minimum value of the utilization rate of the line of the past 6 months of the cyclic credit by adopting a continuous conversion method; converting the amount to be refund per month AUM of the past 6 months monthly credit present month by adopting a continuous conversion method; converting the month number of which the repayment rate is more than 80% in the past 12 months of credit by adopting a continuous conversion method; converting the maximum continuous month number of the credit card with the use rate of more than 90% in the past 12 months by adopting a WOE mode; converting the sum of the repayment amounts of the person loans in the month by adopting a continuous conversion method; the credit account number with the current limit use rate exceeding 90% is converted by adopting a continuous conversion method;
Further preferably, the method for continuously converting the average balance at the time of the past 6 month deposit account is converted into a method for calculating the average balance at the time of the past 6 month deposit account by taking the natural logarithm, the method for continuously converting the minimum value of the usage rate of the line of the past 6 months of the cyclic credit is converted into a method for calculating the square root of the minimum value of the usage rate of the line of the past 6 months of the cyclic credit, the continuous conversion method is adopted for the AUM of the payoff amount/month of the last 6 months monthly credit and the AUM of the payoff amount/month of the last 6 months monthly credit is converted into the calculation mode of taking natural logarithms, the month number of 12 months repayment rate >80% of the past credit is converted into the calculation mode of taking cube roots for the month number of 12 months repayment rate >80% of the past credit, the total of the current month repayment amount of the personal loan is converted into the calculation mode of taking original values for the total of the current month repayment amount of the personal loan, and the calculation mode of taking squares for the account number of the credit with the current line usage rate exceeding 90% is converted into the calculation mode of the account number of the credit with the current line usage rate exceeding 90% by adopting a continuous conversion method.
32. The method of claim 30, wherein,
The method comprises the steps of substituting values converted by eight characteristics, namely, average balance of past 6 month deposit account points, continuous increase of the balance of a credit card past 6 month bill, minimum usage rate of a credit line of the past 6 months, minimum usage rate of the credit line of the credit of the cyclic credit past 6 months, total AUM of the amount to be paid per month of the credit of the past 6 months monthly, month number of 12 months of repayment rate >80%, maximum continuous month number of >90% of the credit line usage rate of the credit card past 12 months, sum of repayment amount of the person loan in month, and number of credit account of the current use rate of the credit line exceeding 90%, into a submodel constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
33. The method of claim 32, wherein,
The submodel is shown in equation 5 below:
Where k is the number of features of the model entered, preferably k is 8;
Alpha is the intercept term, preferably the range of values is (0.2614,0.4061), most preferably 0.334;
β1 is the average balance corresponding coefficient at the time of the deposit account of the past 6 months, and the preferred value range is (-0.2041, -0.1896), and most preferably-0.197;
β2 is a corresponding coefficient of continuous increase of the balance of the bill for the past 6 months of the credit card, and the preferred value range is (-0.4365, -0.3966), and most preferably-0.417;
beta 3 is the minimum value corresponding coefficient of the limit usage of the cyclic credit for the past 6 months, the preferable value range is (0.1747,0.1912), and the most preferable value range is 0.183;
β4 is the corresponding coefficient of the amount of money to be refund per month AUM for the past 6 months monthly of credit, the preferred range of values is (0.2261,0.2516), most preferably 0.239;
Beta 5 is a month number corresponding coefficient of which the repayment rate of the past 12 months of credit is more than 80 percent, and the preferable numerical range is (-0.5716, -0.51), and the most preferable numerical range is-0.541;
Beta 6 is the corresponding coefficient of the credit card with the use rate of the credit card for the past 12 months of >90% of the maximum continuous month number, and the preferred value range is (-0.3114, -0.2683), and the most preferred value range is-0.290;
beta 7 is the sum corresponding coefficient of the payment amount of the person loan in the month, and the preferable value range is (-0.3661, -0.281), and the most preferable value range is-0.324;
β8 is the corresponding coefficient of the number of credit accounts with the current limit usage rate exceeding 90%, and the preferred value range is (-0.4713, -0.3529), and the most preferred value range is-0.412;
x1 is a natural logarithmic conversion value of the average balance of the time point of the deposit account of the past 6 months generated in the characteristic conversion step;
x2 is a continuous WOE conversion value of the continuous mode in which the balance of the past 6 months bill of the credit card generated in the feature conversion step is continuously increased by the number of months;
x3 is the square root conversion value of the minimum value of the usage rate of the line of the past 6 months of the cyclic credit generated by the feature conversion step;
x4 is the natural logarithmic conversion value of the AUM of the amount/month average of the last 6 months monthly of credit generated by the feature conversion step;
x5 is a cube root conversion value for the month number of which the past 12 months repayment rate of the credit generated by the feature conversion step is > 80%;
x6 is a WOE conversion value of up to 90% of the credit card generated in the feature conversion step for the past 12 months of credit usage;
x7 is the original conversion value of the sum of the monthly repayment amounts of the personal loan generated in the feature conversion step;
x8 is the square conversion value of the number of individual credit accounts with the current limit usage rate exceeding 90% generated by the feature conversion step.
34. The method of claim 33, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
35. The method according to any one of claims 1 to 9, wherein,
The retail credit prediction data is selected from the group consisting of: one, two, three, four or five of the past 12 month gold account holding month number, the past 3 month investment financial account average balance, the past 3 month deposit account timepoint minimum balance, the past 6 month shipping wages, the past 12 month wage maximum.
36. The method of claim 35, wherein the credit violation probability calculating step comprises:
Performing characteristic conversion on the number of holding months of the gold account in the past 12 months, the average balance of the investment financial account in the past 3 months, the minimum balance of the deposit account in the past 3 months, the number of sending wages in the past 6 months and the maximum wages in the past 12 months,
Preferably, the number of the past 12 months of gold account holding months is converted by adopting a WOE mode; converting the average balance of the investment financial account in the past 3 months by adopting a continuous conversion method; converting the minimum balance of the past 3 month deposit account time point by adopting a continuous conversion method; converting the past 6-month wage times by adopting a WOE mode; converting the maximum wage value of the past 12 months by adopting a continuous conversion method;
Further preferably, the average balance of the investment financial account in the past 3 months is converted into a calculation mode of taking square root for the average balance of the investment financial account in the past 3 months by adopting a continuous conversion method; converting the minimum balance of the past 3 month deposit account time point into a calculation mode of taking natural logarithm for the minimum balance of the past 3 month deposit account time point by adopting a continuous conversion method; the continuous conversion method is adopted for converting the wage maximum value of the past 12 months into a calculation mode using the original value for the wage maximum value of the past 12 months.
37. The method of claim 35, wherein,
Substituting the values converted by five characteristics of the past 12 months of gold account holding month number, the past 3 months of investment financial account average balance, the past 3 months of deposit account time minimum balance, the past 6 months of wage times, and the past 12 months of wage maximum value into a sub-model constructed by adopting logistic regression based on sample retail credit prediction data and credit violation probability to calculate the violation probability of the sample to be predicted.
38. The method of claim 37, wherein,
The submodel is shown in equation 6 below:
Where k is the number of features of the model entered, preferably k is 5;
Alpha is an intercept term, preferably in the range of (0.2884,1.9611), most preferably 1.125;
β1 is the corresponding coefficient for the number of months the gold account has been held for the past 12 months, with a preferred range of values (-0.721, -0.6281), optimally-0.675;
β2 is the corresponding coefficient of the average balance of the investment financial account for the past 3 months, and the preferred value range is (-0.0646, -0.0125), and optimally-0.039;
β3 is the minimum balance corresponding coefficient at the time of the deposit account of the past 3 months, and the preferred value range is (-0.1275, -0.0891), and the most preferred value range is-0.108;
Beta 4 is the corresponding coefficient of the past 6 months of wage times, the preferable value range is (-0.4612, -0.2386), and the most preferable value range is-0.350;
Beta 5 is the maximum payroll corresponding coefficient for the past 12 months, the preferred range of values is (-0.3679, -0.1707), and most preferably-0.269;
x1 is the WOE conversion value of the past 12 months of the gold account holding month number generated in the feature conversion step;
x2 is the square root conversion value of the average balance of the investment financial account of the past 3 months generated in the characteristic conversion step;
x3 is a natural logarithmic conversion value of the minimum balance of the past 3 month deposit account time points generated in the feature conversion step;
x4 is the WOE conversion value of the past 6 months of the wage times generated in the feature conversion step;
x5 is the original converted value of the payroll maximum for the past 12 months generated by the feature conversion step.
39. The method of claim 38, wherein,
After calculating the credit breach probability, calculating a credit score for the sample to be predicted is calculating a credit score for characterizing the borrower using the formula:
Wherein P is the default probability (P) of the borrower generated in the credit violation probability calculation module, and A is 443.9036; b is-72.1348, and the round function rounds up the rounding value after calculating the score; and finally, setting the score of more than 1000 as 1000 and the score of less than 0 as 0.
40. An apparatus for calculating retail credit risk, comprising:
The data acquisition module is used for acquiring retail credit prediction data of a sample to be predicted;
A module for classifying the sample to be predicted, for classifying the sample to be predicted based on a decision tree method to determine a sub-model for calculating a credit violation probability;
A credit breach probability calculation module for substituting retail credit prediction data into a credit breach probability submodel to calculate a credit breach probability of the sample to be predicted, preferably the credit breach probability is a credit breach probability for an online credit service.
41. The apparatus of claim 40, wherein the apparatus performs the steps of the method of calculating retail credit risk of any of claims 1 to 39.
42. A system for calculating retail credit risk, the system for calculating retail credit risk comprising: a memory, a processor and a program stored on the memory and executable on the processor for the method of calculating retail credit risk, the program for calculating retail credit risk when executed by the processor implementing the steps of the method of calculating retail credit risk as recited in any one of claims 1 to 39.
43. A computer storage medium having stored thereon a program for a method of calculating a risk of retail credit, the program for a method of calculating a risk of retail credit, when executed by a processor, implementing the steps of the method of calculating a risk of retail credit as recited in any one of claims 1 to 39.
44. A method of constructing a retail credit risk prediction model for an online credit service, comprising:
a data acquisition step of acquiring raw retail credit prediction data of a sample used for constructing a model;
a data deriving step of processing derived retail credit prediction data based on the original retail credit prediction data;
A feature preliminary screening step of preliminarily screening all categories, i.e., all features, including the original retail credit prediction data and the derived retail credit prediction data to obtain preliminarily screened features;
a preliminary screening data conversion step of judging a conversion mode of the preliminarily screened features to confirm that one of a WOE conversion mode, a dummy feature conversion mode and a continuous conversion mode is adopted to perform feature conversion, and performing feature conversion by adopting a judged optimal mode for each preliminarily screened feature;
A feature fine screening step, namely performing deep screening on the features subjected to the primary screening after feature conversion to obtain the features subjected to the fine screening;
A credit violation probability modeling step, namely selecting a logistic regression mode for the combination of the characteristics after fine screening and the probability relation between the credit violations to carry out model construction, and confirming a mode for calculating the credit violation probability;
wherein the data sample obtained in the data collection step is a customer sample using an online credit service.
45. The method of claim 44, wherein,
In the data acquisition step, the raw retail credit prediction data of the sample for constructing the model obtained includes:
credit card based data, which is based on all available data in the sample user's credit card creation process and use process,
Personal loan class base data, which is all available data based on the sample user's loan application case and usage behaviors,
Customer base information class base data, which is data based on the properties of the sample user itself, but not directly related to behavior at the financial institution,
Personal financial property type base data, which is sample user data of all other financial properties and financial transaction types not related to credit cards and loans at a financial institution.
46. The method of claim 44, wherein,
In the data deriving step, processing derived retail credit prediction data based on the original retail credit prediction data refers to data obtained by processing the collected original retail credit prediction data based on a time dimension, a space dimension, a frequency dimension, and a statistical information dimension;
preferably, the derived retail credit prediction data includes, but is not limited to:
derived retail credit prediction data processed based on sample relationship lengths,
Derived retail credit forecast data processed based on time interval class variables,
Derived retail credit forecast data processed based on the degree of sample behavioral frequency,
Derived retail credit prediction data processed based on the sample current point in time conditions,
Derived retail credit forecast data processed based on sample duration behavior,
Derived retail credit prediction data processed from the sample data based on the statistical information dimension.
47. The method of any one of claims 44-46, wherein,
The feature primary screening step comprises the following steps:
A first preliminary screening step of screening the features based on the data missing for each feature of the sample used to construct the model,
A second preliminary screening step of screening the features based on the condition that the single value of a certain feature sample is too high,
A third preliminary screening step, namely calculating the information IV value of each feature to carry out preliminary screening on the features;
The order of the first preliminary screening step, the second preliminary screening step and the third preliminary screening step may be any order,
A fourth preliminary screening step, wherein the features subjected to the first to third preliminary screening are subjected to preliminary screening by adopting a gradual discrimination algorithm;
And a fifth preliminary screening step, wherein the features after the fourth preliminary screening step are subjected to preliminary screening based on the coincidence condition of the risk characteristics of the features and the actual and real results of the samples for model construction.
48. The method of any one of claims 44-47, further comprising:
a sample selection step for screening all users and acquiring samples for model construction using an online credit service prior to the data acquisition step,
Preferably, the sample selection step includes classifying all users of the sample based on a decision tree, the classification basis including but not limited to:
Whether a user is a customer who has applied for registering credit services at a financial institution;
Whether a certain user is a income-stable client;
A certain user is a new client or stock client;
A geographic area to which a user handles business.
49. The method of any one of claims 44 to 48, wherein in the preliminary screening data conversion step, the conversion of the preliminary screened features is determined based on the concentration and data type of the preliminary screened features.
50. The method of claim 49, wherein,
The primary screening data conversion step based on the judgment of the concentration and the data type comprises the following steps:
classifying the data type for each feature classifies each feature into character-type variables and numerical-type variables,
The character type variable is subjected to preliminary screening data conversion by adopting a dummy feature conversion mode,
The process of further classifying the numerical variables comprises the following sub-steps:
If the numerical variable has less than n values, performing primary screening data conversion by adopting a WOE conversion mode,
If the value of the numerical variable is more than n, further judging that if the numerical variable is converted into continuous variable with more values and the concentration degree of single value is more than m%, adopting a WOE conversion mode, if the concentration degree of single value is less than or equal to m%, adopting a continuous conversion mode,
Preferably, n and m are both positive integers, where n=5 to 10 and m=90 to 99.
51. The method of claim 50, further comprising:
An optimal conversion method is selected for confirming the feature adopting the continuous conversion mode based on the correlation of the feature with credit violations under different continuous conversion modes to perform continuous feature conversion of the feature,
The continuous feature transformation is preferably performed by directly selecting the original value, calculating the square of the original data, calculating the square root of the original data, calculating the cube root of the original data, or calculating the natural logarithm of the original data.
52. The method of any one of claims 44 to 51, wherein the feature sizing step comprises:
A first fine screening step, based on a stepwise regression algorithm, of screening features based on the significance of the features by the F test and the T test,
A second fine screening step of calculating a variance expansion factor based on each feature and eliminating features having a higher variance expansion factor to screen the features,
And a third fine screening step of analyzing whether the characteristic coefficient accords with the trend of the prediction result aiming at credit violations or not based on the logistic regression pair after the first fine screening step and the second fine screening step so as to further perform characteristic screening.
53. The method of any one of claims 44 to 52, wherein the credit breach probability modeling step substitutes the features filtered by the feature fine screening step into a Sigmoid function for logistic regression to calculate a model of credit breach probability.
54. An apparatus for constructing a retail credit risk prediction model for an online credit service, the apparatus comprising:
A data acquisition module for acquiring raw retail credit prediction data for a sample used to build a model;
a data derivation module for processing derived retail credit prediction data based on the original retail credit prediction data;
the feature primary screening module is used for carrying out primary screening on all categories comprising original retail credit prediction data and derived retail credit prediction data, namely all features, so as to obtain features after primary screening;
The primary screening data conversion module is used for judging a conversion mode of the features after primary screening so as to confirm that one of a WOE conversion mode, a dummy feature conversion mode and a continuous conversion mode is adopted for carrying out feature conversion, and carrying out feature conversion by adopting an optimal mode for judging each feature after primary screening;
The feature fine screening module is used for carrying out deep screening on the features subjected to the primary screening after feature conversion so as to obtain the features subjected to the fine screening;
A credit violation probability modeling module for model construction by selecting a logistic regression mode for the probability relation between the feature combination after fine screening and the credit violation, and confirming the mode for calculating the credit violation probability,
Wherein the data sample acquired by the data acquisition module is a customer sample using an online credit service.
55. The apparatus of claim 54, wherein the apparatus performs the steps of the method of constructing a retail credit risk prediction model of any one of claims 44 to 53.
56. A system for constructing a retail credit risk prediction model for an online credit service, the system comprising: a memory, a processor and a program stored on the memory and executable on the processor to construct a retail credit risk prediction model method, the program to construct a retail credit risk prediction model method when executed by the processor implementing the steps of constructing a retail credit risk prediction model method as claimed in any one of claims 44 to 53.
CN202211325815.8A 2022-10-27 2022-10-27 Method for constructing retail credit risk prediction model and online credit service Scoredelta model Pending CN117994017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211325815.8A CN117994017A (en) 2022-10-27 2022-10-27 Method for constructing retail credit risk prediction model and online credit service Scoredelta model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211325815.8A CN117994017A (en) 2022-10-27 2022-10-27 Method for constructing retail credit risk prediction model and online credit service Scoredelta model

Publications (1)

Publication Number Publication Date
CN117994017A true CN117994017A (en) 2024-05-07

Family

ID=90897864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211325815.8A Pending CN117994017A (en) 2022-10-27 2022-10-27 Method for constructing retail credit risk prediction model and online credit service Scoredelta model

Country Status (1)

Country Link
CN (1) CN117994017A (en)

Similar Documents

Publication Publication Date Title
CN110837931B (en) Customer churn prediction method, device and storage medium
CN111291816B (en) Method and device for carrying out feature processing aiming at user classification model
CN109035003A (en) Anti- fraud model modelling approach and anti-fraud monitoring method based on machine learning
CN110807700A (en) Unsupervised fusion model personal credit scoring method based on government data
CN104321794A (en) A system and method using multi-dimensional rating to determine an entity&#39;s future commercial viability
CN110930038A (en) Loan demand identification method, loan demand identification device, loan demand identification terminal and loan demand identification storage medium
CN111709826A (en) Target information determination method and device
CN110895758A (en) Screening method, device and system for credit card account with cheating transaction
CN111738819A (en) Method, device and equipment for screening characterization data
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN112232950A (en) Loan risk assessment method and device, equipment and computer-readable storage medium
CN109146667B (en) Method for constructing external interface comprehensive application model based on quantitative statistics
CN116821759A (en) Identification prediction method and device for category labels, processor and electronic equipment
CN115099933A (en) Service budget method, device and equipment
CN115809930A (en) Anti-fraud analysis method, device, equipment and medium based on data fusion matching
CN113421154B (en) Credit risk assessment method and system based on control chart
CN114626940A (en) Data analysis method and device and electronic equipment
CN117252677A (en) Credit line determination method and device, electronic equipment and storage medium
CN114612239A (en) Stock public opinion monitoring and wind control system based on algorithm, big data and artificial intelligence
Niknya et al. Financial distress prediction of Tehran Stock Exchange companies using support vector machine
CN113822751A (en) Online loan risk prediction method
CN110570301B (en) Risk identification method, device, equipment and medium
CN117994017A (en) Method for constructing retail credit risk prediction model and online credit service Scoredelta model
CN113240513A (en) Method for determining user credit line and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination