CN117350765A

CN117350765A - Variable determining method and device, storage medium and electronic equipment

Info

Publication number: CN117350765A
Application number: CN202311345123.4A
Authority: CN
Inventors: 张璇; 石爱华; 钱富; 杨菲; 李晓娟; 吴正元
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2024-01-05

Abstract

The invention discloses a variable determining method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring a sample set, and randomly sampling the sample set t times to obtain t sub-sample sets; training the logistic regression model by using t subsamples respectively to obtain t reference logistic regression models; acquiring index value sets determined by each reference logistic regression model in the t reference logistic regression models to obtain t index value sets; determining an evaluation value of each of the M variables according to the t index value sets to obtain M evaluation values corresponding to the M variables; and determining N variables from the M variables according to the M evaluation values, wherein the N variables are used for predicting the target event. By adopting the technical scheme, the problems of high repetition rate, complex operation and local optimal solution in the process of selecting the variables by adopting the traditional method are solved.

Description

Variable determining method and device, storage medium and electronic equipment

Technical Field

The embodiment of the invention relates to the field of computers, in particular to a variable determining method and device, a storage medium and electronic equipment.

Background

In order to better predict the default probability, the construction of an initial variable list is usually performed from multiple angles, which often includes customer credit, basic information, behavior information and the like, and in actual operation, efficient variable selection is required for the high-dimensional data of initial processing in consideration of operation efficiency and monitoring convenience after model online, and explanatory variables optimal for response variables are selected from a plurality of variables mainly through a statistical method and in combination with business practice. The variable selection is an important link in the model development process, and the quality of the model is directly influenced by the quality of the result, so that the statistical analysis and the prediction accuracy are greatly influenced.

Variable selection is a process of repeatedly studying the correspondence between independent and dependent variables, and selecting the variable that is most helpful in distinguishing between good and bad. The variable selection method can be summarized into two major types, namely a single variable-based selection method and a model-based selection method, wherein the variable-based selection method is commonly as follows: based on correlation coefficient, information entropy and IV index and clustering method commonly used in current model development; the selection method based on the model comprises the following steps: stepwise logistic regression, LASSO, feature importance selection for machine learning, etc.

Logistic regression is a multivariate analysis method for studying the relationship between the observation results of two or more categories and influencing factors (independent variables), and belongs to probabilistic nonlinear regression. The logistic regression model is a retail business risk metering method adopted by most commercial banks worldwide, and practice proves that the model has wide applicability to large-scale retail banks; among a plurality of metering models, the model constructed by logistic regression has better stability and higher accuracy, and is convenient to explain and develop.

In practice, the model is usually introduced by applying a stepwise regression method to variables passing through the pre-screening, and the model is repeatedly evaluated after each addition of a variable, and the variables which do not contribute to the improvement of the model prediction ability according to the standard are removed. The gradual discrimination procedure ends when all variables in the model meet the criteria and no other variables meet the entered criteria. The retained variables become candidate feature variables of the final model.

In practical projects, a single variable selection method cannot produce very effective effects, a single variable-based screening method ignores the overall interaction of a variable set on a model, and the model-based variable selection method may have the problem of long running time and efficiency, and multiple methods are often combined to perform a two-step method to improve the effect and efficiency of variable selection. Firstly, single variable selection is carried out on a data set, a looser condition is set for filtering partial variables, noise of model variable selection is reduced, and then variables with better model selection interpretation are utilized.

In the project development process, such as a scoring card model, the most commonly used is a logistic regression model based on steplight for multiple screening to finally form a modeling variable. Specifically, in the development process of the scoring card model, multiple gradual logistic regression is often used for variable screening until the modeling variable of each inspection condition is met, multiple steps of repeated operation are needed, and more manual intervention is needed; in order to avoid deleting effective characteristic variables, more variables are still reserved before stepwise regression is performed after other screening methods, and variable binning is a larger workload. Meanwhile, stepwise regression is also a greedy algorithm in a sense that a model local optimal solution is obtained, so that variables selected by one logistic regression may not be the optimal combination. In addition, when a better variable is reserved in a variable combination with high correlation and an expert reselects a replacement variable, an IV is often used as a selection standard, but the standard cannot reflect the actual performance condition of the variable in a model, manual experience judgment is often needed to be combined in actual development, repeated attempts are often needed, and the method has the problems of large box division workload, local optimal solution and the like.

Aiming at the problems of high repetition rate, complex operation and poor selection effect of the selection process in the process of selecting the variables by adopting the traditional method in the related technology, no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the invention provides a variable determining method and device, a storage medium and electronic equipment, which at least solve the problems of high repetition rate, complex operation and local optimal solution in the process of selecting variables by adopting a traditional method.

According to an embodiment of the present application, there is provided a variable determining method including: obtaining a sample set, and randomly sampling the sample set t times to obtain t sub-sample sets, wherein the sample set comprises samples of M variables, each sub-sample set in the t sub-sample sets comprises samples of Q variables, the union of the t sub-sample sets comprises the samples of the M variables, and the Q is smaller than the M; training the logistic regression model by using the t sub-samples respectively to obtain t reference logistic regression models; obtaining index value sets determined by each reference logistic regression model in the t reference logistic regression models to obtain t index value sets, wherein each index value set comprises: index values of P indexes corresponding to each variable in the corresponding Q variables; determining an evaluation value of each variable in the M variables according to the t index value sets to obtain M evaluation values corresponding to the M variables; and determining N variables from the M variables according to the M evaluation values, wherein the N variables are used for predicting a target event.

Optionally, the evaluation value of the i-th variable of the M variables is determined by determining the evaluation value of each variable of the M variables, where i is a positive integer less than or equal to M: determining the occurrence times of the ith variable in the t index value sets, and determining P average index values of the P indexes corresponding to the ith variable according to the t index value sets; and carrying out weighted summation on the P average index values and the occurrence times to obtain an evaluation value of the ith variable.

Optionally, determining P average index values of the P indexes corresponding to the ith variable according to the t index value sets includes: determining an average index value of a j index of the P indexes corresponding to the i variable, so as to determine P average index values of the P indexes corresponding to the i variable, wherein j is a positive integer less than or equal to P: obtaining Z index values of the j index corresponding to the i variable from the t index value sets, wherein Z is the occurrence number of the i variable; and determining the average value of the Z index values as the average index value of the j index corresponding to the i variable.

Optionally, determining N variables from the M variables according to the M evaluation values includes: sequencing the M evaluation values from large to small, and determining N variables with the front evaluation values from the M variables; or determining N variables with evaluation values larger than a preset threshold value from the M variables.

Optionally, after obtaining the t index value sets, the method further includes: and carrying out normalization processing or normalization processing on index values of partial indexes in the P indexes corresponding to each variable in each index value set.

Optionally, after determining N variables from the M variables according to the M evaluation values, the method further includes: training a logistic regression model by using a target sample set to obtain a target logistic regression model, wherein the target sample set comprises samples of the N variables; and obtaining N variable values of the target object on the N variables, and inputting the N variable values into the target logistic regression model to obtain a prediction result of the target event.

Optionally, the logistic regression model is a model using a stepwise logistic regression algorithm or a model using an L1 regular punishment logistic regression algorithm, and in the case that the logistic regression model is a model using a stepwise logistic regression algorithm, the P indexes include: regression coefficients of variables, test chi-square values of variables, variance expansion coefficients of variables, correction determination coefficients of models, area under curve AUC of the models.

According to another embodiment of the present application, there is also provided a variable determining apparatus including: the first processing module is used for obtaining a sample set, and carrying out t times of random sampling on the sample set to obtain t sub-sample sets, wherein the sample set comprises samples of M variables, each sub-sample set in the t sub-sample sets comprises samples of Q variables, a union set of the t sub-sample sets comprises the samples of the M variables, and Q is smaller than M; the training module is used for training the logistic regression models by using the t sub-samples respectively to obtain t reference logistic regression models; the acquisition module is configured to acquire an index value set determined by each reference logistic regression model in the t reference logistic regression models, and obtain t index value sets, where each index value set includes: index values of P indexes corresponding to each variable in the corresponding Q variables; the first determining module is used for determining the evaluation value of each variable in the M variables according to the t index value sets to obtain M evaluation values corresponding to the M variables; and the second determining module is used for determining N variables from the M variables according to the M evaluation values, wherein the N variables are used for predicting a target event.

According to a further embodiment of the present application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the present application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to the method and the device, after t times of random sampling are carried out on a sample set, t sub-sample sets are obtained, a logistic regression model is trained by using the t sub-sample sets respectively, t reference logistic regression models are obtained, index value sets determined by each reference logistic regression model in the t reference logistic regression models are obtained, t index value sets are obtained, evaluation values of each variable in M variables are determined according to the t index value sets, M evaluation values corresponding to the M variables are obtained, N variables are determined from the M variables according to the M evaluation values, and N variables are used for predicting a target event. The model is trained by using each group of sub-sample sets, the index value sets determined by the model are obtained at the same time after model fitting, then the evaluation value of each variable is determined according to the t index value sets, and then the variable is selected according to the evaluation values of the variables. By adopting the scheme, the problems of high repetition rate, complex operation and local optimal solution in the process of variable selection by adopting the traditional method are solved, so that the variable selection process is simplified, and the technical effect of improving the reliability of the variable selection result is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a block diagram of a hardware structure of a mobile terminal of a variable determining method according to an embodiment of the present application;

FIG. 2 is a flow chart of a variable determination method of an embodiment of the present application;

FIG. 3 is a flow chart of a method of determining an optimal candidate variable set ordering in accordance with an embodiment of the present application;

FIG. 4 is a schematic diagram of an optimal candidate variable set ordering effect according to an embodiment of the present application;

fig. 5 is a block diagram of a configuration of a variable determining apparatus according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking the mobile terminal as an example, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a variable determining method according to an embodiment of the present application. As shown in fig. 1, a mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, wherein the mobile terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a variable determining method in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, to implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a variable determining method is provided, fig. 2 is a flowchart of a variable determining method according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S202, a sample set is obtained, t times of random sampling are carried out on the sample set, t sub-sample sets are obtained, wherein the sample set comprises samples of M variables, each sub-sample set in the t sub-sample sets comprises samples of Q variables, a union set of the t sub-sample sets comprises the samples of the M variables, and Q is smaller than M;

it should be noted that, in the random sampling process, each sample set has a certain probability of being drawn, and when the sampling number reaches a certain number, samples of M variables are necessarily included in t sub-sample sets obtained after sampling.

As an alternative example, the random sampling may be monte carlo sampling.

It should be noted that the M variables are all variables that may affect the target event. The size of Q may be determined according to the size of M, so that t sub-sample sets obtained by extracting t times may cover all variable combinations.

As an alternative example, the sample set may be (variable 1, variable 2, …, variable M) and may be (the sample size is n, and the characteristic is M), that is, there are n samples in the sample set, and the n samples are samples corresponding to the M variables.

Step S204, training the logistic regression model by using the t sub-samples respectively to obtain t reference logistic regression models;

in one exemplary embodiment, the logistic regression model is a model using a stepwise logistic regression algorithm or a model of a logistic regression algorithm using an L1 regular penalty.

As an alternative example, the sample set may first need to be preprocessed, such as miss rate analysis and padding, outlier handling, etc., before training the logistic regression model.

It should be noted that, the logistic regression model is used for directly obtaining the actual performance of the variables in the model by fitting the model, and simultaneously, fitting different variable subsets (i.e. sub-sample sets) for multiple times, so that the problem of local optimal solution existing in primary regression is avoided, and the mutual influence of similar variables in model fitting is weakened.

Step S206, obtaining an index value set determined by each reference logistic regression model in the t reference logistic regression models, to obtain t index value sets, where each index value set includes: index values of P indexes corresponding to each variable in the corresponding Q variables;

As an alternative example, in the case where the logistic regression model is a model using a stepwise logistic regression algorithm, the P indices include: regression coefficients of the variables, check chi-square values of the variables, variance expansion coefficients of the variables, correction determining coefficients of the model, area under the curve of the model (Area Under the Curve, abbreviated as AUC).

That is, the P indices include a variable evaluation index including, but not limited to, a regression coefficient coef, a wald test chi-square value, a variance expansion coefficient vif, and a model evaluation index including, but not limited to, a correction decision coefficient of the model, an AUC of the model.

It should be noted that, the variable evaluation index is used to evaluate the quality of the input variable, so as to help to select an appropriate variable, thereby improving the performance of the model. The model evaluation index is used for evaluating the performance of the trained model and is built by learning the characteristics of data, the model evaluation index aims to help to select the optimal model, so that the accuracy and reliability of prediction are improved, and the accuracy and reliability of prediction can be further improved by combining the characteristic evaluation index and the model evaluation index.

Step S208, determining an evaluation value of each variable in the M variables according to the t index value sets to obtain M evaluation values corresponding to the M variables;

and step S210, determining N variables from the M variables according to the M evaluation values, wherein the N variables are used for predicting a target event.

Through the steps S202-S210, the model is trained by using each sub-sample set, and the index value set determined by the model is obtained at the same time after the model is fitted, so as to determine the evaluation value of each variable according to the determined t index value sets, and then the variable is selected according to the evaluation value of the variable. By adopting the scheme, the problems of high repetition rate, complex operation and local optimal solution in the process of variable selection by adopting the traditional method are solved, so that the variable selection process is simplified, and the technical effect of improving the reliability of the variable selection result is achieved.

In an exemplary embodiment, the evaluation value of the ith variable of the M variables may be determined by the following steps S11 to S12 to determine the evaluation value of each variable of the M variables, where i is a positive integer of M or less:

Step S11, determining the occurrence times of the ith variable in the t index value sets, and determining P average index values of the P indexes corresponding to the ith variable according to the t index value sets;

the number of occurrences of the i-th variable is the number of times that was extracted during the t-th sampling; for example, assuming that the number of occurrences of the ith variable is Z, among the t sub-sample sets, there are samples having the ith variable in the Z sub-sample sets.

And step S12, carrying out weighted summation on the P average index values and the occurrence times to obtain an evaluation value of the ith variable.

In the process of weighting, the weights of different indexes can be customized, and the stability of the model can be affected due to the fact that the average index value of part of indexes is too large, so that different weights of different indexes are required to be set, and the influence on the final effect is avoided.

For better understanding, assuming that the logistic regression model is a model using a stepwise logistic regression algorithm, the P indices are regression coefficients, test chi-square values, variance expansion coefficients, correction decision coefficients, respectively, area under the curve AUC.

Wherein, regression coefficients: to measure the extent to which an independent variable affects the dependent variable. It represents the amount of influence of unit independent variable variation on the dependent variable. The regression coefficient may be positive, negative, or zero, where positive indicates that the increase in the independent variable has a positive correlation with the increase in the dependent variable, negative indicates that the increase in the independent variable has a negative correlation with the decrease in the dependent variable, and zero indicates that the independent variable has no effect on the dependent variable.

Checking the chi-square value: chi-square test is a statistical method for verifying whether there is an association between two or more classification variables. In chi-square test, the calculated chi-square value is used to measure the degree of difference between the observed value and the expected value. The larger the chi-square value, the larger the difference between the observed value and the expected value, i.e. there is a significant correlation.

Coefficient of variance expansion: the coefficient of variance expansion (VIF) is an indicator used to check whether there is multiple collinearity between the independent variables in the regression model. It measures the correlation between each argument and other arguments, with a larger value indicating a stronger correlation between arguments. In general, if the correlation between the independent variables is strong, the value of VIF will be greater than 1, and when the value of VIF is greater than 5 or 10, it indicates that there is a serious problem of multiple collinearity.

The corrected decision coefficient (Adjusted R-squared) is an indicator that measures the goodness of fit of the regression model. The method corrects the determined coefficient (R-squared) and considers the influence of the number of independent variables. The correction decision coefficient has a value ranging from 0 to 1, and a value closer to 1 indicates that the interpretation ability of the model is better.

The area under the curve AUC ranges from 0 to 1, which means the accuracy of model prediction. The closer the AUC value is to 1, the better the performance of the representation model, and the better the distinction between positive and negative cases. Whereas the closer the AUC value is to 0.5, the worse the performance of the model, and the weaker the predictive power.

In an exemplary embodiment, taking a self-defined weight as an example, the regression coefficient and the frequency of entry (i.e. the occurrence number) are respectively given with a weight of 0.4, the auc index is given with a weight of 0.15, the model stability is affected by modeling a variable with a larger variance expansion coefficient, thus giving a weight of-0.05, the check chi-square value and the correction decision coefficient are respectively given with a weight of 0.05, and the evaluation value of the final variable i is 0.4×coef.i+0.4×fre.i+0.15×auc.i-0.05×vif.i+0.05wald.i+0.05r.i.

Wherein coef.i is the average index value of the regression coefficient corresponding to the i-th variable, fre.i is the number of occurrences corresponding to the i-th variable, auc.i is the average index value of the AUC corresponding to the i-th variable, vif.i is the average index value of the variance expansion coefficient corresponding to the i-th variable, wald.i is the average index value of the check-card square value corresponding to the i-th variable, R.i is the average index value of the correction decision coefficient corresponding to the i-th variable.

In an exemplary embodiment, the average index value of the j-th index of the P indexes corresponding to the i-th variable may be determined by the following steps S21 to S22 to determine P average index values of the P indexes corresponding to the i-th variable, where j is a positive integer less than or equal to P:

s21, obtaining Z index values of the j index corresponding to the i variable from the t index value sets, wherein Z is the occurrence number of the i variable;

and S22, determining the average value of the Z index values as the average index value of the j index corresponding to the i variable.

It should be noted that, assuming that the ith variable is extracted Z times, that is, there is a Z-group index value set corresponding to the ith variable in the t index value sets (each group index value set has index values of P indexes corresponding to the ith variable), the index values of each index in the Z-group index value sets may be weighted and summed, so as to determine an average index value of each index.

As an alternative example, for example, the ith variable is entered into the model 200 times in the above t model fitting, and the corresponding model AUC values are (AUC 1, AUC2 … AUC 200), respectively, and then the average index value of the final AUC of the ith variable is sum (auc1+auc2+auc200 fingers)/200.

In an exemplary embodiment, determining N variables from the M variables according to the M evaluation values may be implemented in one or two of the following ways:

mode one: sequencing the M evaluation values from large to small, and determining N variables with the front evaluation values from the M variables;

after the evaluation values are ranked from large to small, the obtained variable importance ranking may be used to select N variables with higher scores (i.e., the evaluation values are forward) and predict the target event.

Mode two: and determining N variables with evaluation values larger than a preset threshold value from the M variables.

In an exemplary embodiment, after the step S206, the following processing is further required for the acquired index set: and carrying out normalization processing or normalization processing on index values of partial indexes in the P indexes corresponding to each variable in each index value set.

It should be noted that, for the evaluation indexes simultaneously output by fitting different models, for example, the variance expansion coefficient and the WALD test chi-square value, the values may not be comparable due to the difference of the modeling variables, so that in order to make the variable evaluation indexes between the models have comparability, the simple normalization or normalization and other treatments are respectively performed on the partial evaluation index values of the modeling variables of each model.

It should be noted that the normalization process is a data preprocessing technique for converting data of different ranges into a uniform range for better comparison and analysis. Common normalization methods are min-max normalization and Z-score normalization. The min-max normalization is a linear mapping of data into the range of [0,1], specifically: normalized value = (original value-maximum)/(maximum-minimum), where minimum and maximum are the minimum and maximum in the dataset, respectively. The normalization processing can avoid the influence of the difference of the data range on the analysis result, so that the comparability among different indexes is realized, and the convergence speed and the performance of the machine learning algorithm are improved.

In an exemplary embodiment, assuming that variable 1, variable 2 and variable 3 are modeled in the ith model fitting, model regression coefficients corresponding to the obtained variable 1, variable 2 and variable 3 are coef1, coef2 and coef3 respectively, since only importance of the variable to the model is not positive or negative, absolute value processing is performed first, normalization processing is performed, and a regression coefficient index of the variable 1 after normalization processing takes a value of (|coef1| -max (|coef|))/(max (|coef|) min (|coef|)) and the obtained result is more reliable by normalization processing. Alternatively, assuming that variable 1 is extracted Z times, Z coefs are obtained, max (|coef|) is the maximum value of the absolute values of the Z coefs, and min (|coef|) is the minimum value of the absolute values of the Z coefs.

In an exemplary embodiment, after determining N variables from the M variables according to the M evaluation values, there are further the following steps S31 to S32:

step S31, training a logistic regression model by using a target sample set to obtain a target logistic regression model, wherein the target sample set comprises samples of the N variables;

and S32, obtaining N variable values of the target object on the N variables, and inputting the N variable values into the target logistic regression model to obtain a prediction result of the target event.

In one exemplary embodiment, where the target event is a determination of whether the target object has repayment capabilities, the N variables may include: age, family members, gender, work units, personal income, household income, credit ability, and the like.

It will be apparent that the embodiments described above are merely some, but not all, embodiments of the invention. For better understanding of the above method, the following description will explain the above process with reference to the examples, but is not intended to limit the technical solution of the embodiments of the present invention, specifically:

the method is based on a traditional variable selection method, the integrated learning thought is consulted and fused, a plurality of learners are generated through sample set disturbance, input feature disturbance and other modes, a strong learner with good integration precision is used for solving the problems that the traditional logistic regression model is repeated and complex in variable screening operation and has local optimal solutions, a theoretical method for automatically selecting random logistic regression variables is specifically built, multidimensional comprehensive evaluation indexes are built through repeated model learning, automatic sorting of optimal candidate variable sets is output, feature variables which are more effective and stable for models are selected, reference basis is provided for replacement variables, and the process can weaken manual intervention of model developers and improve the working efficiency of model development and iteration.

The integrated learning is also called a multi-classifier system, and a common method is to generate a plurality of learners from a homogeneous weak learner through sample set disturbance, input characteristic disturbance, output representation disturbance, algorithm parameter disturbance and the like, then combine the learners by adopting a certain integrated strategy to obtain a strong learner with better precision, and finally comprehensively judge and output a final result.

It should be noted that, the random logistic regression in the present application may be built on the feature subset (i.e. input feature disturbance), or may be built on the sample subset (i.e. sample set disturbance) at the same time, and the core is to randomly sample the feature set as many times as possible to select feature combinations, perform model fitting on different feature subsets, artificially synthesize the evaluation criteria of the importance degree of the random logistic regression feature, and finally obtain the optimal candidate variable set ordering, so as to conveniently select a better modeling variable set.

Different methods can be used as a 'base learner' in the model fitting process, such as stepwise regression or L1 regular punishment logistic regression, training subset sampling and feature subset sampling can be simultaneously carried out, and a plurality of learners are generated through sample set disturbance, input feature disturbance and other modes based on the same type of learner, and then integration is carried out. For the evaluation of the feature importance degree, the evaluation indexes such as the number of times of entering a model, the square value of wald test card, the regression coefficient, the AUC value, the F-score, the variance expansion coefficient and the like can be used, and finally, the feature importance degree evaluation of each 'base learner' outputs a scoring result by using a combination strategy such as weighted combination and the like to obtain the optimal candidate variable set ordering. As shown in fig. 3, a flowchart of a method for determining an optimal candidate variable set ordering in the present application is shown, specifically, the flowchart includes the following steps:

Step S1: from the training set S (sample size n, characteristic m) there is a return to randomly selected data set S _i (sample size p<n, characteristic q<m); it should be noted that the features in the embodiments of the present application are the variables described above.

Step S2: using S _i Training learning machine M _i (algorithm such as stepwise regression and L1 regular punishment logistic regression can be adopted) to obtain S _i Multidimensional evaluation index (A) of q features in training set _i ，B _i ，C _i …)；

Step S3: repeating the above steps t times to obtain t base models M _1， M _2， …M _t Multidimensional evaluation index (a ₁ …A _m ，B ₁ …B _m ，C ₁ …C _m )；

Step S4: weighting and other comprehensive strategies are carried out on the multidimensional evaluation indexes of the m feature importance degrees, and final random logistic regression feature importance score ranking (Q) _1， Q _2， …Q _m )。

The optimal candidate variable set sorting obtained through the steps comprehensively considers multiple factors including the corresponding relation between the variables and the dependent variables and the interaction influence of variable combination, and a model developer can intuitively sort according to the optimal candidate variable set and judge and select the last modeling variable by combining business experience.

It should be noted that, taking the random logistic regression based on feature set sampling as an example, the model training process is correspondingly described:

before model training, the full data set needs to be preprocessed, including but not limited to deletion rate analysis, filling, outlier processing and the like, q features are extracted for each time, the specific extraction amount can be set according to the number of candidate features, and t data subsets S are formed by sampling t times by adopting methods such as Monte Carlo sampling and the like _i The number of samples is such that the t data subsets cover substantially all feature combinations.

At each group of data subsets S _i Upper training model M _i The stepwise logistic regression algorithm can be used as a learner to perform model fitting, and the model M is output simultaneously after model fitting _i The medium modulus features and feature evaluation indexes comprise but are not limited to regression coefficients, such as coef, wald test chi square values, variance expansion coefficients (Variance Inflation Factor, vif for short) and other feature and model evaluation indexes, and comprise but are not limited to R-square adjustment, model AUC and the like, and the process is repeated t times to obtain the modulus frequency (Frequentist Estimation Frequency, FRE for short) of each variable, the importance degree of the variable and the interrelationship among variable combinations.

Wherein, learner M _i The method has the effects that the actual performance of the variable in the model is directly obtained through the fitting model, meanwhile, multiple fitting is carried out on different variable subsets, the problem of local optimal solution existing in primary regression is avoided, and the mutual influence of similar variables in the model fitting is weakened. After each model fitting, variable and model evaluation indexes such as variable regression coefficients, variable chi-square test values and variable variance expansion coefficients which are commonly used in logistic regression models can be output, and the variable importance degree and model quality can be judged by the common evaluation indexes of the models such as R-square and AUC. Each time of fitting the Si data subset, the dimension evaluation indexes (Ai, bi, ci …) of the model can be obtained, and then the comprehensive ordering of the candidate feature effectiveness degree can be obtained through normalization, weighting and other modes, so that the variable selection can be further carried out.

The following is a corresponding description of feature importance ranking:

fitting M to different models _i Meanwhile, the evaluation indexes are output, however, part of the models may not be comparable due to the difference value of the modeling variables, such as a variance expansion coefficient and a WALD test chi-square value, so that in order to make the characteristic evaluation indexes of the models comparable, simple processing methods such as normalization or normalization are respectively carried out on the part of the evaluation indexes of the modeling variables of each model, for example, the ith model fitting, model regression coefficients corresponding to the variables 1, 2 and 3 are coef1, coef2 and coef3 respectively, in this case, only the importance degree of the variables to the model is considered instead of positive and negative effects, so absolute value processing is performed first, and then the regression coefficient index of the variable 1 after normalization processing takes the value of (|coef1| -max (|coef|))/(max (|coef|) -min (|coef|)).

And summarizing the results of the t-th model, for example, the variable A enters the model 200 times in the t-th model fitting, the model AUC values are (AUC 1, AUC2 … AUC 200) respectively, and the final AUC evaluation index of the variable A is sum (AUC1+AUC2+ … +AUC200)/200, and the frequency of entry evaluation index FRE=200. Obtaining each dimension evaluation index value (COEF.i, WALD.i, VIF.i, AUC.i and FRE.i …) corresponding to each feature of the candidate variable set (m features) through the method, finally, selecting a proper weighting strategy for combination of the multi-dimension evaluation indexes and expert experience to obtain a unified evaluation index, for example, weighting by adopting custom weights in the following practice, respectively endowing weights with regression coefficients and entering frequencies of 0.4, endowing weights with AUC indexes of 0.15, enabling the variable with larger variance expansion coefficient to enter a model to influence stability, endowing weights of-0.05, enabling the comprehensive score of a final variable i to be 0.4 times COEF.i+0.4 times FRE.i+0.15 times FRE.i-0.05 times VIF.i, sorting the candidate variable set according to the comprehensive score, obtaining final feature importance sorting, and selecting the first 10-20 variables with higher scores to carry out subsequent steps of sorting, modeling and the like.

In order to verify the effect of variable selection by random logistic regression based on feature set sampling, we use stepwise regression as a 'base learner', and circulate to randomly extract 20 variables each time for model fitting for multiple times, and the result basically covers all candidate variable lists, and the entering times, regression coefficients, model AUC, variance expansion coefficients and the like of features in the random logistic regression fitting process are counted, normalized and weighted according to different proportions to obtain the final comprehensive score. And sequencing according to the comprehensive scores to obtain the entering sequence of the optimal candidate variable set, and providing a reference basis for selecting the modulus-entering variable and replacing the variable afterwards. As shown in fig. 4, for the schematic diagram of the sorting effect of the optimal candidate variable set implemented by adopting the technical scheme related to the application, as can be intuitively seen from fig. 4, the top twenty variables are relatively rich in variety, including application, in-line and credit-evaluating categories, and the correlation between the variables is low, the top twenty variables are directly selected for logistic regression without considering business experience, and the AUC value of the final fitting model is 0.785, which is superior to the model fitting AUC value of the variables selected by other conventional methods. The demonstration result shows that the candidate variable generated by the random logistic regression variable selection method based on feature set sampling shows a good effect on the final logistic regression model.

In the variable determining process, selecting feature combinations as many as possible by randomly sampling feature sets for many times, performing model fitting on different feature subsets, artificially synthesizing evaluation criteria of importance degrees of random logistic regression features, and finally obtaining optimal candidate variable set ordering, thereby realizing a variable determining model which is simple to operate and high in reliability, and particularly, based on the traditional variable selecting method, constructing a theoretical method for automatically selecting random logistic regression variables by referring to and integrating an integrated learning idea; establishing a multidimensional comprehensive evaluation index of the characteristics, comprehensively considering various factors, carrying out automatic sequencing on the optimal candidate variable set, selecting characteristic variables which are more effective and stable to the model, and providing a reference basis for replacing the variables.

By this design, the following effects can be achieved:

1. the automatic optimal sequencing of the candidate feature sets can be realized in the modeling of the scoring cards, and the effectiveness of variable selection can be improved by the multidimensional comprehensive evaluation index;

2. the automatic sequencing can weaken the manual intervention of a model developer, reduce the operation steps of variable selection and improve the model development work efficiency;

3. The method is easy to update and optimize in modes such as a base learner, a weighted combination strategy of indexes and the like, and can be popularized and applied to variable selection processes of other modeling works.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiment also provides a variable determining device, which is used for implementing the above embodiment and the preferred implementation manner, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 5 is a block diagram of a variable determining apparatus according to an embodiment of the present application, as in fig. 5, the apparatus includes:

a first processing module 502, configured to obtain a sample set, and perform t random sampling on the sample set to obtain t sub-sample sets, where the sample set includes samples of M variables, each sub-sample set in the t sub-sample sets includes samples of Q variables, a union of the t sub-sample sets includes samples of the M variables, and Q is smaller than M;

the training module 504 is configured to train the logistic regression models by using the t sub-samples, respectively, to obtain t reference logistic regression models;

an obtaining module 506, configured to obtain an index value set determined by each reference logistic regression model in the t reference logistic regression models, to obtain t index value sets, where each index value set includes: index values of P indexes corresponding to each variable in the corresponding Q variables;

a first determining module 508, configured to determine an evaluation value of each of the M variables according to the t index value sets, to obtain M evaluation values corresponding to the M variables;

a second determining module 510, configured to determine N variables from the M variables according to the M evaluation values, where the N variables are used to predict a target event.

Through the device, each sub-sample set is used for training the model, the index value set determined by the model is obtained at the same time after model fitting, then the evaluation value of each variable is determined according to the determined t index value sets, and then the variable is selected according to the evaluation value of the variable. By adopting the scheme, the problems of high repetition rate, complex operation and local optimal solution in the process of variable selection by adopting the traditional method are solved, so that the variable selection process is simplified, and the technical effect of improving the reliability of the variable selection result is achieved.

In an exemplary embodiment, the first determining module 508 is further configured to determine the evaluation value of the ith variable in the M variables to determine the evaluation value of each variable in the M variables, where i is a positive integer less than or equal to M: determining the occurrence times of the ith variable in the t index value sets, and determining P average index values of the P indexes corresponding to the ith variable according to the t index value sets; and carrying out weighted summation on the P average index values and the occurrence times to obtain an evaluation value of the ith variable.

In an exemplary embodiment, the first determining module 508 is further configured to determine an average index value of a j-th index of the P indexes corresponding to the i-th variable to determine P average index values of the P indexes corresponding to the i-th variable, where j is a positive integer less than or equal to P: obtaining Z index values of the j index corresponding to the i variable from the t index value sets, wherein Z is the occurrence number of the i variable; and determining the average value of the Z index values as the average index value of the j index corresponding to the i variable.

In an exemplary embodiment, the second determining module 510 is further configured to rank the M evaluation values from large to small, and determine N variables with evaluation values that are earlier from the M variables; or determining N variables with evaluation values larger than a preset threshold value from the M variables.

In an exemplary embodiment, the obtaining module 506 is further configured to perform a normalization process or a normalization process on the index values of some indexes of the P indexes corresponding to each variable in each index value set after obtaining t index value sets.

In an exemplary embodiment, the apparatus further includes a second processing module, configured to train the logistic regression model using a target sample set after determining N variables from the M variables according to the M evaluation values, to obtain a target logistic regression model, where the target sample set includes samples of the N variables; and obtaining N variable values of the target object on the N variables, and inputting the N variable values into the target logistic regression model to obtain a prediction result of the target event.

In an exemplary embodiment, the logistic regression model is a model using a stepwise logistic regression algorithm or a model using an L1 regular penalty, and in the case where the logistic regression model is a model using a stepwise logistic regression algorithm, the P indices include: regression coefficients of variables, test chi-square values of variables, variance expansion coefficients of variables, correction determination coefficients of models, area under curve AUC of the models.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

Embodiments of the present invention also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A variable determining method, comprising:

Obtaining a sample set, and randomly sampling the sample set t times to obtain t sub-sample sets, wherein the sample set comprises samples of M variables, each sub-sample set in the t sub-sample sets comprises samples of Q variables, the union of the t sub-sample sets comprises the samples of the M variables, and the Q is smaller than the M;

training the logistic regression model by using the t sub-samples respectively to obtain t reference logistic regression models;

obtaining index value sets determined by each reference logistic regression model in the t reference logistic regression models to obtain t index value sets, wherein each index value set comprises: index values of P indexes corresponding to each variable in the corresponding Q variables;

determining an evaluation value of each variable in the M variables according to the t index value sets to obtain M evaluation values corresponding to the M variables;

and determining N variables from the M variables according to the M evaluation values, wherein the N variables are used for predicting a target event.

2. The method of claim 1, wherein determining an evaluation value for each of the M variables from the set of t index values comprises:

Determining an evaluation value of an i-th variable of the M variables to determine an evaluation value of each of the M variables, wherein i is a positive integer of M or less, by:

determining the occurrence times of the ith variable in the t index value sets, and determining P average index values of the P indexes corresponding to the ith variable according to the t index value sets;

and carrying out weighted summation on the P average index values and the occurrence times to obtain an evaluation value of the ith variable.

3. The method of claim 2, wherein determining P average index values of the P indices corresponding to the ith variable from the set of t index values comprises:

determining an average index value of a j index of the P indexes corresponding to the i variable, so as to determine P average index values of the P indexes corresponding to the i variable, wherein j is a positive integer less than or equal to P:

obtaining Z index values of the j index corresponding to the i variable from the t index value sets, wherein Z is the occurrence number of the i variable;

And determining the average value of the Z index values as the average index value of the j index corresponding to the i variable.

4. The method of claim 1, wherein determining N variables from the M variables based on the M evaluation values comprises:

sequencing the M evaluation values from large to small, and determining N variables with the front evaluation values from the M variables; or alternatively

And determining N variables with evaluation values larger than a preset threshold value from the M variables.

5. The method of claim 1, wherein after deriving the set of t index values, the method further comprises:

and carrying out normalization processing or normalization processing on index values of partial indexes in the P indexes corresponding to each variable in each index value set.

6. The method of claim 1, wherein after determining N variables from the M variables based on the M evaluation values, the method further comprises:

training a logistic regression model by using a target sample set to obtain a target logistic regression model, wherein the target sample set comprises samples of the N variables;

and obtaining N variable values of the target object on the N variables, and inputting the N variable values into the target logistic regression model to obtain a prediction result of the target event.

7. The method of claim 1, wherein the logistic regression model is a model using a stepwise logistic regression algorithm or a model of a logistic regression algorithm using an L1 regular penalty, and wherein the P indices include, in the case that the logistic regression model is a model using a stepwise logistic regression algorithm: regression coefficients of variables, test chi-square values of variables, variance expansion coefficients of variables, correction determination coefficients of models, area under curve AUC of the models.

8. A variable determining apparatus, comprising:

the first processing module is used for obtaining a sample set, and carrying out t times of random sampling on the sample set to obtain t sub-sample sets, wherein the sample set comprises samples of M variables, each sub-sample set in the t sub-sample sets comprises samples of Q variables, a union set of the t sub-sample sets comprises the samples of the M variables, and Q is smaller than M;

the training module is used for training the logistic regression models by using the t sub-samples respectively to obtain t reference logistic regression models;

the acquisition module is configured to acquire an index value set determined by each reference logistic regression model in the t reference logistic regression models, and obtain t index value sets, where each index value set includes: index values of P indexes corresponding to each variable in the corresponding Q variables;

The first determining module is used for determining the evaluation value of each variable in the M variables according to the t index value sets to obtain M evaluation values corresponding to the M variables;

and the second determining module is used for determining N variables from the M variables according to the M evaluation values, wherein the N variables are used for predicting a target event.

9. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method according to any of the claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when the computer program is executed.