CN111861705A - Financial wind control logistic regression feature screening method and system - Google Patents

Financial wind control logistic regression feature screening method and system Download PDF

Info

Publication number
CN111861705A
CN111861705A CN202010662195.1A CN202010662195A CN111861705A CN 111861705 A CN111861705 A CN 111861705A CN 202010662195 A CN202010662195 A CN 202010662195A CN 111861705 A CN111861705 A CN 111861705A
Authority
CN
China
Prior art keywords
data
evaluation
model
models
optimal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010662195.1A
Other languages
Chinese (zh)
Inventor
林建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wuyu Technology Co ltd
Original Assignee
Shenzhen Wuyu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wuyu Technology Co ltd filed Critical Shenzhen Wuyu Technology Co ltd
Priority to CN202010662195.1A priority Critical patent/CN111861705A/en
Publication of CN111861705A publication Critical patent/CN111861705A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention discloses a financial wind-control logistic regression feature screening method and a system, wherein the feature screening method comprises the following steps: acquiring P initial variables; generating N initial variable combinations from the obtained initial variables to form N data models; splitting 1-n parts of data from the data set to serve as an evaluation data set; evaluating each evaluation data set through N data models respectively to obtain evaluation data of each evaluation data set in each data model; for each evaluation data set, selecting a data model m before ranking from evaluation data obtained by N data models; judging whether data models which appear in each data set exist in the data models m before the ranking of each data set, and if one data model exists, taking the data model as an optimal data model; and if at least two models exist, selecting the best model in the initial variable combination according to the evaluation index. The method better ensures that the model has relatively better generalization capability on future data.

Description

Financial wind control logistic regression feature screening method and system
Technical Field
The invention belongs to the technical field of data processing, relates to a data screening method, and particularly relates to a financial wind control logistic regression feature screening method and system.
Background
One of the core goals of internet financial risk control is to control the overdue risk of customers within an optimal interval. And by utilizing past behavior data of the user, applying a machine learning technology to establish a model to predict future risks of the client in different scenes is one of the main means for implementing risk control. Under the internet financial risk control scene, the primary objective of the model is to predict the overdue risk of the borrowed customer in a future period of time, so whether the model has good generalization capability under different business scenes is a main concern of modeling personnel. The generalization ability of the model is closely related to the variable combination selected by modeling personnel. Therefore, the patent focuses on studying how to select variable combinations with good generalization ability by combining a logistic regression model, one of the technologies commonly used in the internet finance field.
The Logistic model is one of main technologies of internet financial risk control, and one of difficulties in practical application of the model is to screen out a group of suitable variables by comparing the advantages and disadvantages of different subsets in a plurality of variables and establish a stable model with good distinguishing capability and generalization capability. At present, the research of Logistic regression variable screening mainly focuses on how to improve the distinguishing capability of the model, namely, reduce the deviation of model prediction and pursue estimation unbiased. For example, many scholars are concerned about the influence of missing or including arguments that have a large influence on the dependent variables on the prediction accuracy of the model.
In practice of the method, stepwise regression considers only the significance of the variable coefficient estimation (based on the p-value) or the fitting degree of the model (based on the AIC criterion and R2) according to the difference of the selection criteria, and the generalization ability of the model is not considered enough. The improved AICC method avoids model overfitting to some extent by imposing a penalty on the number of included variables, and the BIC criterion introduces a Bayes method that imposes a penalty using the posterior distribution of the samples. However, AIC, BIC, etc. are discrete, unordered processes, and variables are either retained or discarded, often representing high variance, and thus, prediction error of the model cannot be effectively reduced. The regularization method essentially adds a penalty function to the sum of squared residuals to reduce the model overfitting. However, when the model is a multivariable large model, regularization-to-coefficient compression has not been able to guarantee the accuracy of the model. And the regularization imposes the same penalty on each regression coefficient, and the degree of coefficient compression cannot be well controlled.
Moreover, the above method for selecting Logistic regression model variables is completed based on training set data, and the selected variables reflect more objective rules in the training data, but only reflect better objective rules of "data distribution" and are not sufficient in the field of internet financial risk control. The reason is that: under the influence of factors such as policies, markets, customer channels and the like, the distribution of Internet financial customer groups has the risks of large fluctuation and quick change, and model engineers pay attention to the effect of the model in a short time and also see the long-term performance of the model; secondly, in the industrial practice, a company can often use a model developed in a certain business scene in other business scenes according to the actual business of the company, and expect a good generalization effect. Therefore, in an internet financial risk control scenario, the generalization capability of the Logistic model in different business scenarios becomes more important in a longer time period.
In view of the above, there is an urgent need to design a new feature screening method to overcome at least some of the above-mentioned disadvantages of the existing feature screening methods.
Disclosure of Invention
The invention provides a financial wind-control logistic regression feature screening method and system, which can better ensure that a model has relatively better generalization capability on future data.
In order to solve the technical problem of the above variable selection, according to one aspect of the present invention, the following technical solutions are adopted:
a financial wind-control logistic regression feature screening method comprises the following steps:
s1, acquiring all features available for modeling as initial variables;
step S2, randomly generating N initial variable combinations from the initial variables obtained in the step S1 in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;
step S3, splitting 1-n parts of data from a set data set as an evaluation data set;
step S4, in the N constructed data models, each data model evaluates each evaluation data set in turn to obtain the evaluation data of each evaluation data set in each data model; the evaluation indexes comprise KS and AUC, and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is used as the evaluation indexes;
Step S5, for N data models, each corresponding to N evaluation data set evaluation results; selecting a data model m before the ranking for each evaluation data set;
step S6, in the m-before-ranking data models of the n evaluation data sets, judging whether a data model appearing in each data set exists, and if one data model exists, taking the data model as an optimal data model f1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation indexes ESA and ESSD1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer;
step S7, the optimal data model f selected in step S61Modeling by adding the residual variables one by one on the basis of the corresponding variable combination, and logically selecting the optimal generalization capability model f according to the step S62(ii) a The residual variable means that the optimal data model f is removed1All variables other than the variable;
step S8, according to the comprehensive evaluation indexes of the models on the multiple evaluation sets, f is judged2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
According to another aspect of the invention, the following technical scheme is adopted: a financial wind-control logistic regression feature screening method comprises the following steps:
s1, acquiring all features available for modeling as initial variables;
s2, generating N initial variable combinations from the initial variables obtained in the S1 to form N data models;
step S3, splitting 1-n parts of data from a set data set as an evaluation data set;
step S4, in the N constructed data models, each data model evaluates each evaluation data set in turn to obtain the evaluation data of each evaluation data set in each data model;
step S5, for N data models, each corresponding to N evaluation data set evaluation results; selecting a data model m before the ranking for each evaluation data set;
step S6, in the m-before-ranking data models of the n evaluation data sets, judging whether a data model appearing in each data set exists, and if one data model exists, taking the data model as an optimal data model f1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation index1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected 1(ii) a Wherein o is a positive integer.
As an embodiment of the present invention, the method further comprises:
step S7, based on the variable combination corresponding to the optimal data model selected in step S6, adding residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to step S62(ii) a The remaining variables refer to the model f1All variables other than the variable;
step S8, according to the comprehensive evaluation indexes of the models on the multiple evaluation sets, f is judged2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
In step S2, N initial variable combinations are randomly generated from the initial variables obtained in step S1 by a random no-back sampling method according to an embodiment of the present invention.
In one embodiment of the present invention, in step S6, if there are at least two, the optimal model in the initial variable combination is selected according to the estimated ranking average ESA and the estimated ranking standard deviation ESSD.
As an embodiment of the present invention,
Figure BDA0002579015670000031
Figure BDA0002579015670000032
Wherein n denotes n evaluation data sets, siThe representation model evaluates the index ordering on the ith evaluation data set.
According to another aspect of the invention, the following technical scheme is adopted: a financial wind-controlled logistic regression feature screening system, the feature screening system comprising:
the initial variable acquisition module is used for acquiring all characteristics which can be used for modeling and used as initial variables;
the initial data model forming module is used for randomly generating N initial variable combinations from the initial variables acquired by the initial variable acquiring module in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;
the evaluation data set selection module is used for splitting 1-n data from the set data set to serve as an evaluation data set;
the data set evaluation module is used for sequentially evaluating each evaluation data set by each data model in the N constructed data models to obtain evaluation data of each evaluation data set in each data model; the evaluation indexes comprise KS and AUC, and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is used as the evaluation indexes;
the data model selection module is used for correspondingly evaluating the evaluation results of the N evaluation data sets for the N data models; selecting a data model m before the ranking for each evaluation data set;
The optimal data model acquisition module is used for judging whether a data model appearing in each data set exists in the data models m before the ranking of the n evaluation data sets, and if one data model exists, the data model is used as an optimal data model f1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation indexes ESA and ESSD1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer;
a generalization ability optimal model generation module for generating the optimal data model f selected by the optimal data model acquisition module1Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module2(ii) a The residual variable means that the optimal data model f is removed1All variables other than the variable; and
a recursion modeling module used for judging the optimal generalization ability model f according to the comprehensive evaluation indexes of the models on the multiple evaluation sets2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
According to another aspect of the invention, the following technical scheme is adopted: a financial wind-controlled logistic regression feature screening system, the feature screening system comprising:
the initial variable acquisition module is used for acquiring all characteristics which can be used for modeling and used as initial variables;
the initial data model forming module is used for randomly generating N initial variable combinations from the initial variables acquired by the initial variable acquiring module in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;
the evaluation data set selection module is used for splitting 1-n data from the set data set to serve as an evaluation data set;
the data set evaluation module is used for sequentially evaluating each evaluation data set by each data model in the N constructed data models to obtain evaluation data of each evaluation data set in each data model;
the data model selection module is used for correspondingly evaluating the evaluation results of the N evaluation data sets for the N data models; selecting a data model m before the ranking for each evaluation data set; and
the optimal data model acquisition module is used for judging whether a data model appearing in each data set exists in the data models m before the ranking of the n evaluation data sets, and if one data model exists, the data model is used as an optimal data model f 1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer.
As an embodiment of the present invention, the system further includes a generalization capability optimal model generation module, configured to select the optimal data model f from the optimal data model acquisition module1Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module2(ii) a The residual variable means that the optimal data model f is removed1All variables except the variable.
As an implementation mode of the invention, the system further comprises a recursive modeling module used for judging the generalization ability optimal model f according to the comprehensive evaluation indexes of the models on the multiple evaluation sets2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
The invention has the beneficial effects that: according to the financial wind control logistic regression feature screening method and system, the multiple evaluation data sets are set, so that the variable combinations selected by the scheme can be well represented on the multiple data with different distributions, and the model can be guaranteed to have relatively better generalization capability on future data with higher probability. Meanwhile, a model with better evaluation results on multiple evaluation data sets can be effectively selected through multiple data set comprehensive evaluation indexes ESA and ESSD. In addition, the invention can find the initial variable combination with better distinguishing capability and generalization capability through the variable random combination with different proportions (the more the random selection times are, the greater the probability of finding the optimal initial variable combination is), and can obtain the more optimal variable combination than other variable selection schemes through modeling by gradually increasing the variables one by one.
Drawings
FIG. 1 is a flowchart illustrating a method for screening financial wind-controlled logistic regression features according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating a method for screening financial wind-controlled logistic regression features according to an embodiment of the present invention.
FIG. 3 is a schematic diagram illustrating a financial wind-controlled logistic regression feature screening system according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
For a further understanding of the invention, reference will now be made to the preferred embodiments of the invention by way of example, and it is to be understood that the description is intended to further illustrate features and advantages of the invention, and not to limit the scope of the claims.
The description in this section is for several exemplary embodiments only, and the present invention is not limited only to the scope of the embodiments described. It is within the scope of the present disclosure and protection that the same or similar prior art means and some features of the embodiments may be interchanged.
The invention discloses a financial wind-control logistic regression feature screening method, and fig. 1 and 2 are flow charts of the financial wind-control logistic regression feature screening method in one embodiment of the invention; referring to fig. 1 and fig. 2, the feature screening method includes:
Step S1, acquiring all features available for modeling as initial variables;
step S2, generating N initial variable combinations from the initial variables obtained in step S1, to form N data models;
in one embodiment of the present invention, in step S2, N initial variable combinations are randomly generated from the initial variables obtained in step S1 by means of random no-back sampling.
Step S3, splitting 1-n data from the data set to be used as an evaluation data set;
in one embodiment, the data set refers to all data that can be taken, and 1 to n data sets are split to serve as evaluation data sets. The splitting mode can be divided according to a time line generated by a data set or randomly according to business requirements.
Step S4, in the N data models constructed above, each data model evaluates each evaluation data set in turn, to obtain the evaluation data of each evaluation data set in each data model.
In one embodiment, the evaluation index includes KS, AUC, etc., and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is taken as the evaluation index.
Step S5, for N data models, each corresponding to N evaluation data set evaluation results; selecting a data model m before the ranking for each evaluation data set;
Step S6, in the m-top-ranked data models of the n evaluation data sets, it is determined whether there is a data model appearing in each data set, and if there is one, the data model is used as the optimal data model f1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation index1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer. In one embodiment, the values of m and o may be selected as required, for example, m may be 10, o may be 5, etc.; of course, m and o may have other values.
In an embodiment of the present invention, if there are at least two, the optimal model in the initial variable combination is selected according to the estimated sorting mean ESA and the estimated sorting standard deviation ESSD. In one embodiment of the present invention, the substrate is,
Figure BDA0002579015670000071
wherein n denotes n evaluation data sets, siThe representation model evaluates the index ordering on the ith evaluation data set.
In an embodiment of the invention, the method further comprises:
step S7, based on the variable combination corresponding to the optimal data model selected in step S6, adding the residual variables one by one for modeling, and logically selecting the optimal generalization ability model f according to step S6 2(ii) a The remaining variables refer to the model f1All variables other than the variable;
step S8, according to the comprehensive evaluation indexes of the model on the multiple evaluation sets, f is judged2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
The invention also discloses a financial wind-control logistic regression feature screening system, and FIG. 3 is a schematic composition diagram of the financial wind-control logistic regression feature screening system in one embodiment of the invention; referring to fig. 3, the feature screening system includes: the system comprises an initial variable acquisition module 1, an initial data model forming module 2, an evaluation data set selection module 3, a data set evaluation module 4, a data model selection module 5 and an optimal data model acquisition module 6.
The initial variable acquiring module 1 is used to acquire all features available for modeling as initial variables.
The initial data model forming module 2 is used for randomly generating N initial variable combinations from the initial variables acquired by the initial variable acquiring module in a random non-playback sampling mode to form N initial data models; the N initial data models are used to evaluate the contribution of variables to the model.
The evaluation data set selection module 3 is used for splitting 1-n data from the set data set as an evaluation data set.
The data set evaluation module 4 is configured to evaluate each evaluation data set in turn for each of the N constructed data models, so as to obtain evaluation data of each evaluation data set in each data model.
The data model selecting module 5 is used for selecting N evaluation data sets corresponding to the N data models; for each evaluation data set, the top m ranked data model was selected.
The optimal data model obtaining module 6 is configured to determine whether there is a data model appearing in each data set in the m-top-ranked data models of the n evaluation data sets, and if there is one data model, take the data model as an optimal data model f1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer.
In an embodiment of the present invention, the system further includes a generalization capability optimal model generation module 7 for selecting the optimal data model f by the optimal data model obtaining module 1Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module2(ii) a The residual variable means that the optimal data model f is removed1All variables except the variable.
In an embodiment of the present invention, the system further includes a recursive modeling module 8 for determining the generalization capability optimal model f according to the comprehensive evaluation index of the model on the multiple evaluation sets2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
In summary, the financial wind-control logistic regression feature screening method and system provided by the invention can ensure that the variable combinations selected by the scheme can better perform on a plurality of data with different distributions by setting a plurality of evaluation data sets, and also ensure that the model has relatively better generalization capability on future data with higher probability. Meanwhile, a model with better evaluation results on multiple evaluation data sets can be effectively selected through multiple data set comprehensive evaluation indexes ESA and ESSD. In addition, the invention can find the initial variable combination with better distinguishing capability and generalization capability through the variable random combination with different proportions (the more the random selection times are, the greater the probability of finding the optimal initial variable combination is), and can obtain the more optimal variable combination than other variable selection schemes through modeling by gradually increasing the variables one by one.
The general flow of the financial wind control modeling is as follows: selecting a sample; preprocessing data; characteristic engineering; selecting a variable; and (6) modeling. As is well known, logistic regression models predict new data based on the knowledge learned by the model by learning linear combinations of features. However, the interpretability of the model in the field of financial wind control is relatively high, and the characteristics are not generally over-derived. For machine learning models, the data and features determine the upper bound of the model, and thus, to some extent, the combination of variables selected to construct the logistic regression model determines the ultimate performance of the model. Conventional variable selection methods such as correlation, multiple collinearity, stepwise regression, etc. do not well address the requirements of generalization and monotonicity of models in the financial field. And the final performance of the model can be well controlled by combining the actual performance of the variables in the model to carry out variable selection.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The description and applications of the invention herein are illustrative and are not intended to limit the scope of the invention to the embodiments described above. Effects or advantages referred to in the embodiments may not be reflected in the embodiments due to interference of various factors, and the description of the effects or advantages is not intended to limit the embodiments. Variations and modifications of the embodiments disclosed herein are possible, and alternative and equivalent various components of the embodiments will be apparent to those skilled in the art. It will be clear to those skilled in the art that the present invention may be embodied in other forms, structures, arrangements, proportions, and with other components, materials, and parts, without departing from the spirit or essential characteristics thereof. Other variations and modifications of the embodiments disclosed herein may be made without departing from the scope and spirit of the invention.

Claims (10)

1. A financial wind-control logistic regression feature screening method is characterized by comprising the following steps:
s1, acquiring all features available for modeling as initial variables;
step S2, randomly generating N initial variable combinations from the initial variables obtained in the step S1 in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;
Step S3, splitting 1-n parts of data from a set data set as an evaluation data set;
step S4, in the N constructed data models, each data model evaluates each evaluation data set in turn to obtain the evaluation data of each evaluation data set in each data model; the evaluation indexes comprise KS and AUC, and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is used as the evaluation indexes;
step S5, for N data models, each corresponding to N evaluation data set evaluation results; selecting a data model m before the ranking for each evaluation data set;
step S6, in the m-before-ranking data models of the n evaluation data sets, judging whether a data model appearing in each data set exists, and if one data model exists, taking the data model as an optimal data model f1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation indexes ESA and ESSD1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer;
Step S7, the optimal data model f selected in step S61Modeling by adding the residual variables one by one on the basis of the corresponding variable combination, and logically selecting the optimal generalization capability model f according to the step S62(ii) a The residual variable means that the optimal data model f is removed1All variables other than the variable;
step S8, according to the comprehensive evaluation indexes of the models on the multiple evaluation sets, f is judged2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
2. A financial wind-control logistic regression feature screening method is characterized by comprising the following steps:
s1, acquiring all features available for modeling as initial variables;
s2, generating N initial variable combinations from the initial variables obtained in the S1 to form N data models;
step S3, splitting 1-n parts of data from a set data set as an evaluation data set;
step S4, in the N constructed data models, each data model evaluates each evaluation data set in turn to obtain the evaluation data of each evaluation data set in each data model;
Step S5, for N data models, each corresponding to N evaluation data set evaluation results; selecting a data model m before the ranking for each evaluation data set;
step S6, in the m-before-ranking data models of the n evaluation data sets, judging whether a data model appearing in each data set exists, and if one data model exists, taking the data model as an optimal data model f1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation index1(ii) a If the model m before the rank of each data set has no common data model, the selection range is expanded, and the data m + o before the rank is selectedModel until the optimal data model f is selected1(ii) a Wherein o is a positive integer.
3. The method of claim 2, wherein the method comprises:
the method further comprises:
step S7, based on the variable combination corresponding to the optimal data model selected in step S6, adding residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to step S62(ii) a The remaining variables refer to the model f1All variables other than the variable;
step S8, according to the comprehensive evaluation indexes of the models on the multiple evaluation sets, f is judged 2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
4. The method of claim 2, wherein the method comprises:
in step S2, N initial variable combinations are randomly generated from the initial variables acquired in step S1 by means of random no-play sampling.
5. The method of claim 2, wherein the method comprises:
in step S6, if there are at least two models, the optimal model in the initial variable combination is selected according to the estimated sorting average ESA and the estimated sorting standard deviation ESSD.
6. The method of claim 5, wherein the method comprises:
in step S6, the ranking average is estimated
Figure FDA0002579015660000021
Estimating rank standard deviation
Figure FDA0002579015660000022
Wherein n denotes n evaluation data sets, siThe representation model evaluates the index ordering on the ith evaluation data set.
7. A financial wind-controlled logistic regression feature screening system, the feature screening system comprising:
the initial variable acquisition module is used for acquiring all characteristics which can be used for modeling and used as initial variables;
The initial data model forming module is used for randomly generating N initial variable combinations from the initial variables acquired by the initial variable acquiring module in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;
the evaluation data set selection module is used for splitting 1-n data from the set data set to serve as an evaluation data set;
the data set evaluation module is used for sequentially evaluating each evaluation data set by each data model in the N constructed data models to obtain evaluation data of each evaluation data set in each data model; the evaluation indexes comprise KS and AUC, and if the model grouping risk monotonicity is concerned, the model grouping risk monotonicity is used as the evaluation indexes;
the data model selection module is used for correspondingly evaluating the evaluation results of the N evaluation data sets for the N data models; selecting a data model m before the ranking for each evaluation data set;
the optimal data model acquisition module is used for judging whether a data model appearing in each data set exists in the data models m before the ranking of the n evaluation data sets, and if one data model exists, the data model is used as an optimal data model f 1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination according to the evaluation indexes ESA and ESSD1(ii) a If in the model m before each data set rankIf no common data model exists, the selection range is expanded, and the data model of m + o before ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer;
a generalization ability optimal model generation module for generating the optimal data model f selected by the optimal data model acquisition module1Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module2(ii) a The residual variable means that the optimal data model f is removed1All variables other than the variable; and
a recursion modeling module used for judging the optimal generalization ability model f according to the comprehensive evaluation indexes of the models on the multiple evaluation sets2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
8. A financial wind-controlled logistic regression feature screening system, the feature screening system comprising:
The initial variable acquisition module is used for acquiring all characteristics which can be used for modeling and used as initial variables;
the initial data model forming module is used for randomly generating N initial variable combinations from the initial variables acquired by the initial variable acquiring module in a random non-playback sampling mode to form N initial data models; the N initial data models are used for evaluating the contribution of variables to the models;
the evaluation data set selection module is used for splitting 1-n data from the set data set to serve as an evaluation data set;
the data set evaluation module is used for sequentially evaluating each evaluation data set by each data model in the N constructed data models to obtain evaluation data of each evaluation data set in each data model;
the data model selection module is used for correspondingly evaluating the evaluation results of the N evaluation data sets for the N data models; selecting a data model m before the ranking for each evaluation data set; and
the optimal data model acquisition module is used for judging whether a data model appearing in each data set exists in the data models m before the ranking of the n evaluation data sets, and if one data model exists, the data model is used as an optimal data model f 1(ii) a If at least two data models exist, selecting the optimal data model f in the initial variable combination1(ii) a If the models m before the ranking of each data set have no common data model, the selection range is expanded, and the data model m + o before the ranking is selected until the optimal data model f is selected1(ii) a Wherein o is a positive integer.
9. The system of claim 8, wherein the financial wind-controlled logistic regression feature screening system is further configured to:
the system also comprises a generalization capability optimal model generation module which is used for selecting the optimal data model f by the optimal data model acquisition module1Based on the corresponding variable combination, adding the residual variables one by one for modeling, and logically selecting the optimal generalization capability model f according to the optimal data model acquisition module2(ii) a The residual variable means that the optimal data model f is removed1All variables except the variable.
10. The system of claim 8, wherein the financial wind-controlled logistic regression feature screening system is further configured to:
the system also comprises a recursion modeling module which is used for judging the optimal generalization capability model f according to the comprehensive evaluation indexes of the model on the multiple evaluation sets2If the evaluation result is improved on most evaluation data, recursively adding residual variables one by one for modeling; until the evaluation results are not improved over the last round on most of the evaluation data.
CN202010662195.1A 2020-07-10 2020-07-10 Financial wind control logistic regression feature screening method and system Pending CN111861705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010662195.1A CN111861705A (en) 2020-07-10 2020-07-10 Financial wind control logistic regression feature screening method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010662195.1A CN111861705A (en) 2020-07-10 2020-07-10 Financial wind control logistic regression feature screening method and system

Publications (1)

Publication Number Publication Date
CN111861705A true CN111861705A (en) 2020-10-30

Family

ID=73153750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010662195.1A Pending CN111861705A (en) 2020-07-10 2020-07-10 Financial wind control logistic regression feature screening method and system

Country Status (1)

Country Link
CN (1) CN111861705A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232951A (en) * 2020-12-17 2021-01-15 中证信用云科技(深圳)股份有限公司 Credit evaluation method, device, equipment and medium based on multi-dimensional cross feature

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103262069A (en) * 2010-12-21 2013-08-21 国际商业机器公司 Method and system for predictive modeling
CN106095942A (en) * 2016-06-12 2016-11-09 腾讯科技(深圳)有限公司 Strong variable extracting method and device
CN109359850A (en) * 2018-10-10 2019-02-19 大连诺道认知医学技术有限公司 A kind of method and device generating risk assessment scale
CN109754157A (en) * 2018-11-30 2019-05-14 畅捷通信息技术股份有限公司 A kind of methods of marking and system for reflecting enterprise's health management, financing and increasing letter
CN110223156A (en) * 2019-05-16 2019-09-10 杭州排列科技有限公司 Automation model evolutionary algorithm based on gradually optimal feature selection
CN110298389A (en) * 2019-06-11 2019-10-01 上海冰鉴信息科技有限公司 More wheels circulation feature selection approach and device when training pattern
CN111311400A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Modeling method and system of grading card model based on GBDT algorithm
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103262069A (en) * 2010-12-21 2013-08-21 国际商业机器公司 Method and system for predictive modeling
CN106095942A (en) * 2016-06-12 2016-11-09 腾讯科技(深圳)有限公司 Strong variable extracting method and device
CN109359850A (en) * 2018-10-10 2019-02-19 大连诺道认知医学技术有限公司 A kind of method and device generating risk assessment scale
CN109754157A (en) * 2018-11-30 2019-05-14 畅捷通信息技术股份有限公司 A kind of methods of marking and system for reflecting enterprise's health management, financing and increasing letter
CN110223156A (en) * 2019-05-16 2019-09-10 杭州排列科技有限公司 Automation model evolutionary algorithm based on gradually optimal feature selection
CN110298389A (en) * 2019-06-11 2019-10-01 上海冰鉴信息科技有限公司 More wheels circulation feature selection approach and device when training pattern
CN111311400A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Modeling method and system of grading card model based on GBDT algorithm
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112232951A (en) * 2020-12-17 2021-01-15 中证信用云科技(深圳)股份有限公司 Credit evaluation method, device, equipment and medium based on multi-dimensional cross feature
CN112232951B (en) * 2020-12-17 2021-04-27 中证信用云科技(深圳)股份有限公司 Credit evaluation method, device, equipment and medium based on multi-dimensional cross feature

Similar Documents

Publication Publication Date Title
De Hauw et al. The reversed gender gap in education and assortative mating in Europe
CN109862431B (en) MCL-HCF algorithm-based television program mixed recommendation method
CN110991621A (en) Method for searching convolutional neural network based on channel number
CN115131131A (en) Credit risk assessment method for unbalanced data set multi-stage integration model
CN116664226A (en) Recommendation system popularity unbiasing method and system and storage medium
CN111861705A (en) Financial wind control logistic regression feature screening method and system
Blundell et al. Income and family background: Are we using the right models?
CN114330716A (en) University student employment prediction method based on CART decision tree
CN112950048A (en) National higher education system health evaluation based on fuzzy comprehensive evaluation
CN110543601B (en) Method and system for recommending context-aware interest points based on intelligent set
CN111768037A (en) LS-SVMR-based movie box-office prediction method and system
Schebesch et al. Support vector machines for credit scoring: Extension to non standard cases
CN116756391A (en) Unbalanced graph node neural network classification method based on graph data enhancement
Vozalis et al. Enhancing collaborative filtering with demographic data: The case of item-based filtering
Clark et al. Modeling strategies for categorical data: Examples from housing and tenure choice
Yu et al. Forecasting digital economy of China using an Adaptive Lasso and grey model optimized by particle swarm optimization algorithm
CN107423759B (en) Comprehensive evaluation method, device and application of low-dimensional successive projection pursuit clustering model
CN115829683A (en) Power integration commodity recommendation method and system based on inverse reward learning optimization
Liu Using machine learning models to predict attrition in a survey panel
CN114820074A (en) Target user group prediction model construction method based on machine learning
CN114461905A (en) Parameter adjusting method and device, recommendation system, electronic device and storage medium
CN117197613B (en) Image quality prediction model training method and device and image quality prediction method and device
CN111125541A (en) Method for acquiring sustainable multi-cloud service combination for multiple users
Chaurasia et al. Fuzzy Based Multi Criteria Decision-Making
CN117974209A (en) XGBoost model-based user behavior prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination