CN111861705A

CN111861705A - Financial wind control logistic regression feature screening method and system

Info

Publication number: CN111861705A
Application number: CN202010662195.1A
Authority: CN
Inventors: 林建明
Original assignee: Shenzhen Wuyu Technology Co ltd
Current assignee: Shenzhen Wuyu Technology Co ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-30

Abstract

The invention discloses a financial wind-control logistic regression feature screening method and a system, wherein the feature screening method comprises the following steps: acquiring P initial variables; generating N initial variable combinations from the obtained initial variables to form N data models; splitting 1-n parts of data from the data set to serve as an evaluation data set; evaluating each evaluation data set through N data models respectively to obtain evaluation data of each evaluation data set in each data model; for each evaluation data set, selecting a data model m before ranking from evaluation data obtained by N data models; judging whether data models which appear in each data set exist in the data models m before the ranking of each data set, and if one data model exists, taking the data model as an optimal data model; and if at least two models exist, selecting the best model in the initial variable combination according to the evaluation index. The method better ensures that the model has relatively better generalization capability on future data.

Description

Financial wind control logistic regression feature screening method and system

Technical Field

The invention belongs to the technical field of data processing, relates to a data screening method, and particularly relates to a financial wind control logistic regression feature screening method and system.

Background

One of the core goals of internet financial risk control is to control the overdue risk of customers within an optimal interval. And by utilizing past behavior data of the user, applying a machine learning technology to establish a model to predict future risks of the client in different scenes is one of the main means for implementing risk control. Under the internet financial risk control scene, the primary objective of the model is to predict the overdue risk of the borrowed customer in a future period of time, so whether the model has good generalization capability under different business scenes is a main concern of modeling personnel. The generalization ability of the model is closely related to the variable combination selected by modeling personnel. Therefore, the patent focuses on studying how to select variable combinations with good generalization ability by combining a logistic regression model, one of the technologies commonly used in the internet finance field.

The Logistic model is one of main technologies of internet financial risk control, and one of difficulties in practical application of the model is to screen out a group of suitable variables by comparing the advantages and disadvantages of different subsets in a plurality of variables and establish a stable model with good distinguishing capability and generalization capability. At present, the research of Logistic regression variable screening mainly focuses on how to improve the distinguishing capability of the model, namely, reduce the deviation of model prediction and pursue estimation unbiased. For example, many scholars are concerned about the influence of missing or including arguments that have a large influence on the dependent variables on the prediction accuracy of the model.

In practice of the method, stepwise regression considers only the significance of the variable coefficient estimation (based on the p-value) or the fitting degree of the model (based on the AIC criterion and R2) according to the difference of the selection criteria, and the generalization ability of the model is not considered enough. The improved AICC method avoids model overfitting to some extent by imposing a penalty on the number of included variables, and the BIC criterion introduces a Bayes method that imposes a penalty using the posterior distribution of the samples. However, AIC, BIC, etc. are discrete, unordered processes, and variables are either retained or discarded, often representing high variance, and thus, prediction error of the model cannot be effectively reduced. The regularization method essentially adds a penalty function to the sum of squared residuals to reduce the model overfitting. However, when the model is a multivariable large model, regularization-to-coefficient compression has not been able to guarantee the accuracy of the model. And the regularization imposes the same penalty on each regression coefficient, and the degree of coefficient compression cannot be well controlled.

Moreover, the above method for selecting Logistic regression model variables is completed based on training set data, and the selected variables reflect more objective rules in the training data, but only reflect better objective rules of "data distribution" and are not sufficient in the field of internet financial risk control. The reason is that: under the influence of factors such as policies, markets, customer channels and the like, the distribution of Internet financial customer groups has the risks of large fluctuation and quick change, and model engineers pay attention to the effect of the model in a short time and also see the long-term performance of the model; secondly, in the industrial practice, a company can often use a model developed in a certain business scene in other business scenes according to the actual business of the company, and expect a good generalization effect. Therefore, in an internet financial risk control scenario, the generalization capability of the Logistic model in different business scenarios becomes more important in a longer time period.

In view of the above, there is an urgent need to design a new feature screening method to overcome at least some of the above-mentioned disadvantages of the existing feature screening methods.

Disclosure of Invention

The invention provides a financial wind-control logistic regression feature screening method and system, which can better ensure that a model has relatively better generalization capability on future data.

In order to solve the technical problem of the above variable selection, according to one aspect of the present invention, the following technical solutions are adopted:

a financial wind-control logistic regression feature screening method comprises the following steps:

s1, acquiring all features available for modeling as initial variables;

step S2, randomly generating N initial variable combinations from the initial variables obtained in the step S1 in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;

step S3, splitting 1-n parts of data from a set data set as an evaluation data set;

step S4, in the N constructed data models, each data model evaluates each evaluation data set in turn to obtain the evaluation data of each evaluation data set in each data model; the evaluation indexes comprise KS and AUC, and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is used as the evaluation indexes;

Step S5, for N data models, each corresponding to N evaluation data set evaluation results; selecting a data model m before the ranking for each evaluation data set;

step S6, in the m-before-ranking data models of the n evaluation data sets, judging whether a data model appearing in each data set exists, and if one data model exists, taking the data model as an optimal data model f₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation indexes ESA and ESSD₁(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected₁(ii) a Wherein o is a positive integer;

step S7, the optimal data model f selected in step S6₁Modeling by adding the residual variables one by one on the basis of the corresponding variable combination, and logically selecting the optimal generalization capability model f according to the step S6₂(ii) a The residual variable means that the optimal data model f is removed₁All variables other than the variable;

step S8, according to the comprehensive evaluation indexes of the models on the multiple evaluation sets, f is judged₂If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.

According to another aspect of the invention, the following technical scheme is adopted: a financial wind-control logistic regression feature screening method comprises the following steps:

s1, acquiring all features available for modeling as initial variables;

s2, generating N initial variable combinations from the initial variables obtained in the S1 to form N data models;

step S4, in the N constructed data models, each data model evaluates each evaluation data set in turn to obtain the evaluation data of each evaluation data set in each data model;

step S6, in the m-before-ranking data models of the n evaluation data sets, judging whether a data model appearing in each data set exists, and if one data model exists, taking the data model as an optimal data model f₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation index₁(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected ₁(ii) a Wherein o is a positive integer.

As an embodiment of the present invention, the method further comprises:

step S7, based on the variable combination corresponding to the optimal data model selected in step S6, adding residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to step S6₂(ii) a The remaining variables refer to the model f₁All variables other than the variable;

In step S2, N initial variable combinations are randomly generated from the initial variables obtained in step S1 by a random no-back sampling method according to an embodiment of the present invention.

In one embodiment of the present invention, in step S6, if there are at least two, the optimal model in the initial variable combination is selected according to the estimated ranking average ESA and the estimated ranking standard deviation ESSD.

As an embodiment of the present invention，

Wherein n denotes n evaluation data sets, s_iThe representation model evaluates the index ordering on the ith evaluation data set.

According to another aspect of the invention, the following technical scheme is adopted: a financial wind-controlled logistic regression feature screening system, the feature screening system comprising:

the initial variable acquisition module is used for acquiring all characteristics which can be used for modeling and used as initial variables;

the initial data model forming module is used for randomly generating N initial variable combinations from the initial variables acquired by the initial variable acquiring module in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;

the evaluation data set selection module is used for splitting 1-n data from the set data set to serve as an evaluation data set;

the data set evaluation module is used for sequentially evaluating each evaluation data set by each data model in the N constructed data models to obtain evaluation data of each evaluation data set in each data model; the evaluation indexes comprise KS and AUC, and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is used as the evaluation indexes;

the data model selection module is used for correspondingly evaluating the evaluation results of the N evaluation data sets for the N data models; selecting a data model m before the ranking for each evaluation data set;

The optimal data model acquisition module is used for judging whether a data model appearing in each data set exists in the data models m before the ranking of the n evaluation data sets, and if one data model exists, the data model is used as an optimal data model f₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation indexes ESA and ESSD₁(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected₁(ii) a Wherein o is a positive integer;

a generalization ability optimal model generation module for generating the optimal data model f selected by the optimal data model acquisition module₁Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module₂(ii) a The residual variable means that the optimal data model f is removed₁All variables other than the variable; and

a recursion modeling module used for judging the optimal generalization ability model f according to the comprehensive evaluation indexes of the models on the multiple evaluation sets₂If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.

the data set evaluation module is used for sequentially evaluating each evaluation data set by each data model in the N constructed data models to obtain evaluation data of each evaluation data set in each data model;

the data model selection module is used for correspondingly evaluating the evaluation results of the N evaluation data sets for the N data models; selecting a data model m before the ranking for each evaluation data set; and

the optimal data model acquisition module is used for judging whether a data model appearing in each data set exists in the data models m before the ranking of the n evaluation data sets, and if one data model exists, the data model is used as an optimal data model f ₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination₁(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected₁(ii) a Wherein o is a positive integer.

As an embodiment of the present invention, the system further includes a generalization capability optimal model generation module, configured to select the optimal data model f from the optimal data model acquisition module₁Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module₂(ii) a The residual variable means that the optimal data model f is removed₁All variables except the variable.

As an implementation mode of the invention, the system further comprises a recursive modeling module used for judging the generalization ability optimal model f according to the comprehensive evaluation indexes of the models on the multiple evaluation sets₂If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.

The invention has the beneficial effects that: according to the financial wind control logistic regression feature screening method and system, the multiple evaluation data sets are set, so that the variable combinations selected by the scheme can be well represented on the multiple data with different distributions, and the model can be guaranteed to have relatively better generalization capability on future data with higher probability. Meanwhile, a model with better evaluation results on multiple evaluation data sets can be effectively selected through multiple data set comprehensive evaluation indexes ESA and ESSD. In addition, the invention can find the initial variable combination with better distinguishing capability and generalization capability through the variable random combination with different proportions (the more the random selection times are, the greater the probability of finding the optimal initial variable combination is), and can obtain the more optimal variable combination than other variable selection schemes through modeling by gradually increasing the variables one by one.

Drawings

FIG. 1 is a flowchart illustrating a method for screening financial wind-controlled logistic regression features according to an embodiment of the present invention.

FIG. 2 is a flowchart illustrating a method for screening financial wind-controlled logistic regression features according to an embodiment of the present invention.

FIG. 3 is a schematic diagram illustrating a financial wind-controlled logistic regression feature screening system according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

For a further understanding of the invention, reference will now be made to the preferred embodiments of the invention by way of example, and it is to be understood that the description is intended to further illustrate features and advantages of the invention, and not to limit the scope of the claims.

The description in this section is for several exemplary embodiments only, and the present invention is not limited only to the scope of the embodiments described. It is within the scope of the present disclosure and protection that the same or similar prior art means and some features of the embodiments may be interchanged.

The invention discloses a financial wind-control logistic regression feature screening method, and fig. 1 and 2 are flow charts of the financial wind-control logistic regression feature screening method in one embodiment of the invention; referring to fig. 1 and fig. 2, the feature screening method includes:

Step S1, acquiring all features available for modeling as initial variables;

step S2, generating N initial variable combinations from the initial variables obtained in step S1, to form N data models;

in one embodiment of the present invention, in step S2, N initial variable combinations are randomly generated from the initial variables obtained in step S1 by means of random no-back sampling.

Step S3, splitting 1-n data from the data set to be used as an evaluation data set;

in one embodiment, the data set refers to all data that can be taken, and 1 to n data sets are split to serve as evaluation data sets. The splitting mode can be divided according to a time line generated by a data set or randomly according to business requirements.

Step S4, in the N data models constructed above, each data model evaluates each evaluation data set in turn, to obtain the evaluation data of each evaluation data set in each data model.

In one embodiment, the evaluation index includes KS, AUC, etc., and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is taken as the evaluation index.

Step S6, in the m-top-ranked data models of the n evaluation data sets, it is determined whether there is a data model appearing in each data set, and if there is one, the data model is used as the optimal data model f₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation index₁(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected₁(ii) a Wherein o is a positive integer. In one embodiment, the values of m and o may be selected as required, for example, m may be 10, o may be 5, etc.; of course, m and o may have other values.

In an embodiment of the present invention, if there are at least two, the optimal model in the initial variable combination is selected according to the estimated sorting mean ESA and the estimated sorting standard deviation ESSD. In one embodiment of the present invention, the substrate is,

In an embodiment of the invention, the method further comprises:

step S7, based on the variable combination corresponding to the optimal data model selected in step S6, adding the residual variables one by one for modeling, and logically selecting the optimal generalization ability model f according to step S6 ₂(ii) a The remaining variables refer to the model f₁All variables other than the variable;

step S8, according to the comprehensive evaluation indexes of the model on the multiple evaluation sets, f is judged₂If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.

The invention also discloses a financial wind-control logistic regression feature screening system, and FIG. 3 is a schematic composition diagram of the financial wind-control logistic regression feature screening system in one embodiment of the invention; referring to fig. 3, the feature screening system includes: the system comprises an initial variable acquisition module 1, an initial data model forming module 2, an evaluation data set selection module 3, a data set evaluation module 4, a data model selection module 5 and an optimal data model acquisition module 6.

The initial variable acquiring module 1 is used to acquire all features available for modeling as initial variables.

The initial data model forming module 2 is used for randomly generating N initial variable combinations from the initial variables acquired by the initial variable acquiring module in a random non-playback sampling mode to form N initial data models; the N initial data models are used to evaluate the contribution of variables to the model.

The evaluation data set selection module 3 is used for splitting 1-n data from the set data set as an evaluation data set.

The data set evaluation module 4 is configured to evaluate each evaluation data set in turn for each of the N constructed data models, so as to obtain evaluation data of each evaluation data set in each data model.

The data model selecting module 5 is used for selecting N evaluation data sets corresponding to the N data models; for each evaluation data set, the top m ranked data model was selected.

The optimal data model obtaining module 6 is configured to determine whether there is a data model appearing in each data set in the m-top-ranked data models of the n evaluation data sets, and if there is one data model, take the data model as an optimal data model f₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination₁(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected₁(ii) a Wherein o is a positive integer.

In an embodiment of the present invention, the system further includes a generalization capability optimal model generation module 7 for selecting the optimal data model f by the optimal data model obtaining module ₁Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module₂(ii) a The residual variable means that the optimal data model f is removed₁All variables except the variable.

In an embodiment of the present invention, the system further includes a recursive modeling module 8 for determining the generalization capability optimal model f according to the comprehensive evaluation index of the model on the multiple evaluation sets₂If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.

In summary, the financial wind-control logistic regression feature screening method and system provided by the invention can ensure that the variable combinations selected by the scheme can better perform on a plurality of data with different distributions by setting a plurality of evaluation data sets, and also ensure that the model has relatively better generalization capability on future data with higher probability. Meanwhile, a model with better evaluation results on multiple evaluation data sets can be effectively selected through multiple data set comprehensive evaluation indexes ESA and ESSD. In addition, the invention can find the initial variable combination with better distinguishing capability and generalization capability through the variable random combination with different proportions (the more the random selection times are, the greater the probability of finding the optimal initial variable combination is), and can obtain the more optimal variable combination than other variable selection schemes through modeling by gradually increasing the variables one by one.

The general flow of the financial wind control modeling is as follows: selecting a sample; preprocessing data; characteristic engineering; selecting a variable; and (6) modeling. As is well known, logistic regression models predict new data based on the knowledge learned by the model by learning linear combinations of features. However, the interpretability of the model in the field of financial wind control is relatively high, and the characteristics are not generally over-derived. For machine learning models, the data and features determine the upper bound of the model, and thus, to some extent, the combination of variables selected to construct the logistic regression model determines the ultimate performance of the model. Conventional variable selection methods such as correlation, multiple collinearity, stepwise regression, etc. do not well address the requirements of generalization and monotonicity of models in the financial field. And the final performance of the model can be well controlled by combining the actual performance of the variables in the model to carry out variable selection.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The description and applications of the invention herein are illustrative and are not intended to limit the scope of the invention to the embodiments described above. Effects or advantages referred to in the embodiments may not be reflected in the embodiments due to interference of various factors, and the description of the effects or advantages is not intended to limit the embodiments. Variations and modifications of the embodiments disclosed herein are possible, and alternative and equivalent various components of the embodiments will be apparent to those skilled in the art. It will be clear to those skilled in the art that the present invention may be embodied in other forms, structures, arrangements, proportions, and with other components, materials, and parts, without departing from the spirit or essential characteristics thereof. Other variations and modifications of the embodiments disclosed herein may be made without departing from the scope and spirit of the invention.

Claims

1. A financial wind-control logistic regression feature screening method is characterized by comprising the following steps:

s1, acquiring all features available for modeling as initial variables;

2. A financial wind-control logistic regression feature screening method is characterized by comprising the following steps:

s1, acquiring all features available for modeling as initial variables;

step S6, in the m-before-ranking data models of the n evaluation data sets, judging whether a data model appearing in each data set exists, and if one data model exists, taking the data model as an optimal data model f₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation index₁(ii) a If the model m before the rank of each data set has no common data model, the selection range is expanded, and the data m + o before the rank is selectedModel until the optimal data model f is selected₁(ii) a Wherein o is a positive integer.

3. The method of claim 2, wherein the method comprises:

the method further comprises:

step S8, according to the comprehensive evaluation indexes of the models on the multiple evaluation sets, f is judged ₂If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.

4. The method of claim 2, wherein the method comprises:

in step S2, N initial variable combinations are randomly generated from the initial variables acquired in step S1 by means of random no-play sampling.

5. The method of claim 2, wherein the method comprises:

in step S6, if there are at least two models, the optimal model in the initial variable combination is selected according to the estimated sorting average ESA and the estimated sorting standard deviation ESSD.

6. The method of claim 5, wherein the method comprises:

in step S6, the ranking average is estimated

Estimating rank standard deviation

7. A financial wind-controlled logistic regression feature screening system, the feature screening system comprising:

the optimal data model acquisition module is used for judging whether a data model appearing in each data set exists in the data models m before the ranking of the n evaluation data sets, and if one data model exists, the data model is used as an optimal data model f ₁(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation indexes ESA and ESSD₁(ii) a If in the model m before each data set rankIf no common data model exists, the selection range is expanded, and the data model of m + o before ranking is selected until the optimal data model f is selected₁(ii) a Wherein o is a positive integer;

8. A financial wind-controlled logistic regression feature screening system, the feature screening system comprising:

9. The system of claim 8, wherein the financial wind-controlled logistic regression feature screening system is further configured to:

the system also comprises a generalization capability optimal model generation module which is used for selecting the optimal data model f by the optimal data model acquisition module₁Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module₂(ii) a The residual variable means that the optimal data model f is removed₁All variables except the variable.

10. The system of claim 8, wherein the financial wind-controlled logistic regression feature screening system is further configured to:

the system also comprises a recursion modeling module which is used for judging the optimal generalization capability model f according to the comprehensive evaluation indexes of the model on the multiple evaluation sets₂If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.