CN110866819A

CN110866819A - Automatic credit scoring card generation method based on meta-learning

Info

Publication number: CN110866819A
Application number: CN201910991618.1A
Authority: CN
Inventors: 尹昌; 王靖文; 仵伟强; 周金黄; 钟丽莉; 万谊强
Original assignee: Huarong Fusion (beijing) Technology Co Ltd
Current assignee: Huarong Fusion (beijing) Technology Co Ltd
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2020-03-06

Abstract

The invention discloses an automatic credit rating card generation method based on meta-learning, which comprises the following steps: the first step is as follows: cleaning and normalizing data, establishing an incidence relation of a plurality of tables between internal data and external data through incidence variables between data tables, and distinguishing test data from training data; the second step is that: calculating the meta-characteristics of the input data set, calculating the similar data set of the input data set, and initializing a search space according to the corresponding parameter configuration output parameters; the third step: performing a parameter tuning strategy based on a Hyperband method, and sampling in a search space to generate a parameter sample set; the fourth step: performing box separation operation on each variable; performing WOE conversion calculation on input data based on the box dividing result, fitting training data by using a logistic regression method, and constructing a final scoring calculation method of the scoring card according to fitting parameters. The invention reduces the manpower consumption in the modeling process; the modeling efficiency is improved; and ensuring the prediction accuracy of the model.

Description

Automatic credit scoring card generation method based on meta-learning

Technical Field

The invention relates to an automatic credit rating card generation method based on meta-learning, in particular to an automatic modeling method of a pre-credit application rating card, belonging to the field of machine learning and data mining, and specifically relating to a modeling method based on a logistic regression algorithm and an automatic machine learning method.

Background

In the financial field, whether investment financing or loan payment, risk control is always the core foundation of business. For consumer finance, the main service objects are characterized by: the characteristics of small amount, large crowd and short period lead the situation to be recognized as the subdivision field with the highest risk. With the continuous penetration of technologies such as artificial intelligence, big data and the like, various financial data are actively collected, analyzed and sorted by means of financial technologies, more accurate wind control service is provided for subdivided people, and the method becomes an effective way for solving the problem of consumption financial wind control. Anti-fraud is a key item in the field of risk control, and once an anti-fraud link is in a problem, immeasurable great economic loss can be caused. The existing anti-fraud strategy generation depends on manual experience to judge, however, with the rapid increase of application users and the continuous expansion of user application data dimensions, the traditional pure manual experience method is more and more difficult to obtain the effective anti-fraud strategy. With the development of artificial intelligence technology and the coming of data era, the adoption of a data-driven method will be the mainstream method for the generation of anti-fraud strategies in the future. The credit scoring card model is an advanced technical means in the management of consumption credit, is one of the most core management technologies of enterprise entities related to consumption credit, such as banks, credit card companies, personal consumption credit companies, telecommunication companies, insurance companies and the like, is widely applied to the fields of credit card life cycle management, automobile loan management, housing loan management, personal loan management, other consumption credit management and the like, and plays an important role in various aspects of marketing, credit approval, risk management, account management, customer relationship management and the like.

The credit scoring model utilizes an advanced data mining technology and a statistical analysis method, systematically analyzes big data such as population characteristics, credit history records and transaction records of consumers, mines behavior patterns and credit characteristics contained in the data, and captures the relationship between history information and future credit performances. And establishing a predictive model, and integrally evaluating certain future credit performance of the consumer by using a credit score. Credit scoring is essentially a classification problem in pattern recognition that classifies businesses or individual consumers into two categories, being able to pay for themselves on schedule (i.e., "good" customers) and default (i.e., "bad" customers). The method is characterized in that according to a plurality of samples of each category (such as due payment and default) in history, the characteristics of the default and non-default persons are found out from the known data, so that the classified rules are summarized, and a machine learning model is established for measuring the default risk (or default probability) of the borrower and providing basis for the credit consumption decision.

Due to the interpretability characteristic of the logistic regression model, the method for constructing the scoring card model by using the logistic regression method is a commonly used solution in the industry at present, and a typical modeling flow comprises the following steps: one is problem preparation, which requires the definition of default and normal users, the scope and origin of data, etc. to be determined based on historical performance of specific credit products. And secondly, data preparation, namely, data required by modeling needs to be acquired at this stage, and external data such as credit investigation data, external scoring data and the like can be acquired besides application data and enterprise internal data. As the sources of external data increase, it becomes more important how to select the appropriate, most valuable external resource. In order to examine data and understand its characteristics, modelers are usually required to perform a series of Exploratory Data Analysis (EDA) tasks, including analyzing the evaluation of univariate statistical characteristics of candidate predictive variables and the distribution of their values within a variable range; analyzing and calculating default rate distribution under each candidate predictive variable classification or segmentation condition; and determining the checking relation among different variables through a list table, an association table, relevant retrograde indexes and the like. And thirdly, data preprocessing, which generally needs to perform a great deal of data cleaning and conversion work on the data to determine a unique data set of all elements required for developing the scoring card and create a prediction index or an independent variable with strong prediction capability. Meanwhile, evidence Weight (WOE) conversion is a specific data preparation process in a development process of a score card, all variables used in the score card need to be subjected to WOE conversion, the cardinality of class variables needs to be reduced for class variables, continuous variables need to be segmented for logarithmic value variables, and the like, and the process is equivalent to the rough classification of all variables of data. And fourthly, selecting variables, wherein the result of data preparation is a modeling view containing a plurality of candidate independent variables, but not all the candidate independent variables can be practically applied in the model. Most credit providers have abundant data and therefore need to screen for more powerful variables among these hundreds or even thousands of modeled variables. And fifthly, model development, wherein the standard scoring card is based on a logistic regression model. Logistic regression models are essentially extensions of linear regression, predicting the state of a breach in evaluation by fitting evidence weight transformation (WOE) to the independent variables to obtain the final score. And sixthly, model verification is carried out, the constructed prediction model needs to meet the following basic requirements, firstly, an acceptable accuracy level is achieved, secondly, certain robustness is required, a wider range of data sets need to be adapted, and meanwhile, the model can be detected in the aspects of service variables and period prediction values. Therefore, the constructed model needs to be subjected to multiple verification operations.

Traditional high-quality manual modeling processes rely heavily on manual intervention, including knowledge of the data, substantial expertise, sufficient modeling experience, and the like. Meanwhile, the processes of data preparation, feature engineering and the like consume a great deal of time and energy. With the increase of hardware computing speed and the improvement of machine learning algorithms, the data demand of various industries is increased, and the requirements for processing data and analyzing data are more and more strict.

In order to solve the problems, the automatic credit rating card modeling tool fusing an automatic machine learning method and a credit rating card is provided, various processes in the rating card modeling are realized by using methods such as meta-learning and automatic feature derivation modules, the automation of data processing, feature engineering, model selection, super-parameter tuning, model establishment, rating card establishment and the like is included, the time and energy of a user under repeated and time-consuming work can be greatly reduced, and the modeling efficiency is improved. In addition, the modeling threshold of the user can be reduced, and a scoring card model with excellent performance is established under the condition of no knowledge of the field of machine learning, so that the business can be better developed.

Disclosure of Invention

Based on the existing problems, the invention provides an automatic credit rating card generation method based on meta-learning, based on an automatic machine learning theory, fusing business practice experience, combining a machine learning algorithm under a business scene of a pre-credit rating card, inputting a data set which meets a certain format and has an association relation, automatically executing functions of data preprocessing, characteristic engineering, hyper-parameter optimization, model selection and the like, and outputting a binary prediction result of data. And (4) establishing a scoring model by combining the functions of the scoring card, and generating a critical file generated in the modeling process to automatically form a scoring card report.

In order to realize the purpose, the invention discloses an automatic credit rating card generation method based on meta-learning, which adopts the following technical scheme:

the invention essentially relates to a credit scoring service logic model construction process based on an automatic machine learning method, and the core of the credit scoring service logic model construction process is to realize the automatic operation of each process of scoring card modeling. Therefore, the main processes in the present invention mainly include data preprocessing, meta learner construction, controller construction, score card construction, and the like, and the specific implementation process is shown in fig. 1.

The first step is as follows: data preprocessing: the following operations are mainly carried out: (1) identifying each variable type in each table; (2) and finding association variables and association relations among the tables, wherein the association variables and the association relations comprise four types, namely one-to-one type, one-to-many type, many-to-one type and many-to-many type. (3) The training data and the test data in the input data are confirmed and distinguished. (4) And implementing different missing value processing, abnormal value detection and data standardization operations according to different types of variables in the data.

The input data set contains a master table and a plurality of associated tables, with time-stamped variables. The correlation table is used for containing valuable auxiliary information about the examples in the main table and can be used for improving the prediction performance of the model. Any two tables (master or dependent) may have a relationship, and any pair of tables may have at most one relationship. In the preprocessing stage, operations such as cleaning and regularizing data can be completed, then the incidence relation of a plurality of tables between internal data and external data is established through incidence variables between data tables, and test data and training data are distinguished.

The second step is that: constructing a meta learner: the method comprises the steps of calculating the meta-characteristics of an input data set, calculating a similar data set of the input data set according to a meta-learner, and initializing a search space according to corresponding parameter configuration output parameters.

The meta learner is used for providing an empirical-formula-based hyper-parameter guide, initializing a subsequent parameter space search range, and improving parameter adjusting efficiency through a 'warm start' parameter adjusting process. The meta-learner records different data sets and corresponding meta-feature data thereof in the same service scene and a parameter configuration set enabling a model to be well represented, inputs meta-features of a data set to be modeled, calculates similar data sets of the input data sets through a K-Nearest Neighbors method, and forms a parameter search space according to parameter configuration of the similar data sets, and the specific process is as follows:

the distance between the data set vectors is first calculated by the euclidean distance:

the similar sample set of composition is N_k(x) The parameter set is obtained by the following formula:

the parameters provided by the meta learner include the depth max _ depth of the subsequent feature synthesis, the primitive agg _ priorities of the feature synthesis, and the like.

The third step: the controller is constructed as follows: the controller carries out a parameter optimization strategy based on a Hyperband method, samples in a search space to generate a parameter sample set, screens parameter samples according to model performance in each evaluation process, and selects a parameter combination which enables the model to be best in performance. The steps of each evaluation run included constructing features using Featuretools, screening features using a random forest method, training the model using LightGBM, and evaluating parametric performance.

The controller carries out a parameter optimization strategy based on a Hyperband method, and the Hyperband algorithm expands a successful halogenated algorithm proposed by Jamieson & Talwlkar (2015). The SuccesseviveHall algorithm functions as follows: supposing that n groups of hyper-parameter combinations are provided, then budgets are uniformly distributed to the n groups of hyper-parameters, verification evaluation is carried out, half of hyper-parameter groups with poor performance are eliminated according to verification results, and then the process is repeatedly iterated until a final optimal hyper-parameter combination is found. Based on the algorithm thought, the Hyperband algorithm is improved, parameter adjustment is completed under the resource constraint condition, and more resources can be allocated to the hyper-parameter combination with better parameter adjustment performance each time. And taking the output of the meta-learner as an initial search space of the Hyperband, sampling in the search space to generate a parameter sample set, screening the parameter samples according to the model performance in each evaluation process, and gradually reducing the range of the search space until a parameter combination which enables the model to be best in performance is selected. The evaluation of each parameter comprises operations of feature synthesis, feature screening and the like. The feature synthesis part is based on FeatureTools method, and is a framework for automatic feature generation, which can convert a data set into a feature matrix which can be used for machine learning. Deep feature synthesis overlays multiple transformation and aggregation operations, referred to as feature primitives in the lexicon of the feature tool, to construct new features from data distributed across multiple tables. By the method, all information of the user can be combined in one table, for example, the user name is used as the association to perform characteristic derivation on the user application information, the user history loan information, the user credit card repayment information and the like. Featuretools can automatically derive a large number of characteristics, some characteristics may have low modeling value, overfitting is easy to generate when the characteristics of a data set are excessive, a random forest can be used for generating a data set with importance of each characteristic after training, a threshold value is determined by utilizing the data set, some characteristics which are most helpful for model training are selected, and the model can be trained after important variables are screened out. And then, training and evaluating a prediction result by using a LightGBM method, screening parameters according to the prediction performance of the model, and finally outputting the depth max _ depth, the feature synthesis primitive agg _ primitives and the feature screening variables of the feature synthesis with better model performance in the evaluation process by using the controller.

The fourth step: construction of a scoring card: the scoring card section first performs a binning operation on each variable. Performing WOE conversion calculation on input data based on the box dividing result, fitting the training data by using a logistic regression method, and constructing a final scoring calculation method of the scoring card according to fitting parameters.

The box separation operation aims to perform discretization segmentation on continuous numerical variables and combine more-number category variables. Card-side binning is a bottom-up, merging-based method of data discretization that relies on card-side verification, i.e., adjacent bins with minimum chi-squared values are merged together until a certain stopping criterion is met.

Based on the results of the binning, a logical model is constructed using evidence Weights (WOE) of the bins to which the variables correspond. The expression of WOE is:

P_goodratio, P, of good users_badIs the ratio of bad users.

The result of model regression is

The score of the score card can be expressed by a linear expression of the log of the ratio:

Score_total＝A+B*ln(odds)

given the increased score value PDO when odds is doubled:

Score_total+PDO＝A+B*ln(2*odds)

a basic score and a score coefficient B can be obtained by solving a binary equation.

The formula for obtaining the final score of the scoring card from the logistic regression result is as follows:

Score_total＝A+B(β₀+β₁WOE₁+…+β_nWOE_n)

wherein A, B is a known constant, β is a logistic regression coefficient, WOE_iIs the WOE value of the corresponding bin of variable i.

The evaluation part of the scoring card unit model comprises a plurality of evaluation criteria, such as:

(1) degree of distinction

KS value: KS ═ MAX (TPR-FPR), the larger the value of KS, the stronger the ability of the representative model to discriminate between positive and negative samples.

(2) Accuracy of

Confusion matrix, ROC curve and AUC value: the higher the AUC value, the higher the accuracy of the risk prediction by the representative model.

(3) Stability of the model

And PSI, checking the change of population distribution in each time-span fractional interval, wherein the smaller the change is, the better the stability of the model is.

The scoring card part can generate model report files such as a box-dividing result, a scoring result, a model performance evaluation report and the like.

Compared with the prior art, the automatic credit rating card generation method based on meta-learning has the advantages and effects that: (1) and each step of modeling is implemented by using an automatic machine learning method, so that the modeling process is simplified, and the labor consumption in the modeling process is reduced. (2) And a meta-learner is added to initialize a parameter tuning search space, and the warm start operation saves parameter tuning time and improves modeling efficiency. (3) The parameter of the feature synthesis is evaluated by combining a Hyperband parameter tuning method, so that the prediction accuracy of the model is ensured.

Drawings

FIG. 1 is a flow chart of an automated card scoring method of the present invention.

FIG. 2 evaluation of the scoring card model-ROC plot.

Figure 3 rating card model evaluation-PSI plot.

Figure 4 scoring card model evaluation-KS plot.

TABLE 1 Meta-feature calculation results (part)

Table 2 parametric space output results

Table 3 controller characteristics generation results (parts)

TABLE 4 variable binning results (parts)

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the disclosed embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. As shown.

Example (b):

the following will explain the specific implementation process of the present invention by taking "credit rating card application data of a certain finance company" as an example.

First, data preprocessing

The input data set mostly comprises an application main table and a plurality of related tables for providing external data, and the preprocessing part mainly performs the following operations by reading the variables of each table: (1) the types of variables in the tables are identified, such as category type, numerical type, timestamp type, etc. (2) And finding association variables and association relations among the tables, wherein the association variables and the association relations comprise four types, namely one-to-one type, one-to-many type, many-to-one type and many-to-many type. (3) The training data and the test data in the input data are confirmed and distinguished. (4) And carrying out different operations of missing value processing, abnormal value detection, data standardization and the like according to different types of variables in the data.

Second, construct the meta-learner

The meta-learner first calculates meta-features of input data, and then initializes a search space by configuring output parameters according to parameters corresponding to similar data sets in the meta-learner. The raw feature calculation results of the input data are shown in table 1, and the parameter space results of the meta learning output are shown in table 2.

Meta feature name	Meta-characteristic value
		attr_to_inst	0.003
cat_to_num	1.0
		freq_class.mean	0.5
inst_to_attr	333.3
		nr_attr	6
nr_cat	3
		nr_class	2
nr_inst	2000
		nr_num	3
num_to_cat	1.0

TABLE 1

Parameter name	Parameter value
		max_depth	[2,4]
agg_primitives	['skew','mode','max','mean','min']

TABLE 2

Thirdly, constructing a controller

The controller carries out a parameter optimization strategy based on a Hyperband method, samples in a search space to generate a parameter sample set, screens parameter samples according to model performance in each evaluation process, and selects a parameter combination which enables the model to be best in performance. The steps of each evaluation process include constructing features using Featuretools, screening features using a random forest method, training a model using LightGBM and evaluating parameter performance, and the like. Through evaluation, the parameter configuration max _ depth is 2, and the agg _ priorities [ 'skew', 'mean' ] performs best, and the characteristic engineering part output results are shown in table 3.

TABLE 3

The fourth step, construct the scoring card

The scoring card section first performs binning on each variable, and the partial binning results are shown in table 4. Performing WOE conversion calculation on input data based on a box dividing result, fitting training data by using a logistic regression method, and constructing a final scoring calculation method of a scoring card according to fitting parameters, wherein each evaluation graph of a scoring card model is shown in fig. 2, 3 and 4.

TABLE 4

According to the box separation result and the model evaluation result, the predicted AUC value of the scoring card model reaches 0.7805, the PSI value of the training data and the testing data is 0.0276, and the KS value of the model is 0.2125, so that the requirements of the model on the risk prediction accuracy, the stability of the model and the capability of the model in distinguishing positive and negative samples can be met.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the technical scope of the present invention, so that any minor modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the technical scope of the present invention.

Claims

1. An automatic credit scoring card generation method based on meta-learning is characterized in that: the method comprises the following steps:

the first step is as follows: data preprocessing: finishing cleaning and arranging operations on data, then establishing incidence relations of a plurality of tables between internal data and external data through incidence variables between data tables, and distinguishing test data from training data;

the second step is that: constructing a meta learner: calculating the meta-characteristics of an input data set, calculating a similar data set of the input data set according to a meta-learner, and initializing a search space according to corresponding parameter configuration output parameters;

the third step: the controller is constructed as follows: the controller part carries out a parameter tuning strategy based on a Hyperband method, samples in a search space to generate a parameter sample set, screens parameter samples according to model performance in each evaluation process, and selects a parameter combination which enables the model to be best in performance; the steps of each evaluation process comprise the steps of constructing features by using Featuretools, screening the features by using a random forest method, training a model by using a LightGBM and evaluating the performance of parameters;

the fourth step: construction of a scoring card: the scoring card part firstly carries out box-dividing operation on each variable; performing WOE conversion calculation on input data based on the box dividing result, fitting the training data by using a logistic regression method, and constructing a final scoring calculation method of the scoring card according to fitting parameters.

2. The method of claim 1 for automated credit rating card generation based on meta-learning, wherein: the data preprocessing comprises the following steps: (1) identifying each variable type in each table; (2) searching association variables and association relations among the tables, wherein the association variables and the association relations comprise four types, namely one-to-one type, one-to-many type, many-to-one type and many-to-many type; (3) confirming and distinguishing training data and test data in input data; (4) and implementing different missing value processing, abnormal value detection and data standardization operations according to different types of variables in the data.

3. The method of claim 1 for automated credit rating card generation based on meta-learning, wherein: the specific process of calculating the similar data set of the input data set and initializing the search space according to the corresponding parameter configuration output parameters is as follows:

4. the method of claim 1 for automated credit rating card generation based on meta-learning, wherein: fourthly, carrying out automatic box division operation on the variable by using a chi-square box division method; i.e. the neighbouring bins with the smallest chi-squared value are merged together until a certain stopping criterion is fulfilled.

5. The method of claim 1 for automated credit rating card generation based on meta-learning, wherein: fourthly, based on the result of the box separation, using the evidence weight WOE of the box separation corresponding to the variable to construct a logic model; the expression of WOE is:

P_goodratio, P, of good users_badA ratio of bad users;

the result of model regression is

Score_total＝A+B*ln(odds)

given the increased score value PDO when odds is doubled:

Score_total+PDO＝A+B*ln(2*odds)

a, obtaining a basic score and B a scoring coefficient by solving a binary equation;

Score_total＝A+B(β₀+β₁WOE₁+…+β_nWOE_n)