CN112116443A - Model generation method and model generation device based on variable grouping and electronic equipment - Google Patents

Model generation method and model generation device based on variable grouping and electronic equipment Download PDF

Info

Publication number
CN112116443A
CN112116443A CN201910535051.7A CN201910535051A CN112116443A CN 112116443 A CN112116443 A CN 112116443A CN 201910535051 A CN201910535051 A CN 201910535051A CN 112116443 A CN112116443 A CN 112116443A
Authority
CN
China
Prior art keywords
variables
variable
model generation
data set
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910535051.7A
Other languages
Chinese (zh)
Inventor
刘志玲
党亚瑞
李莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sino Credit Information Technology Beijing Co ltd
Original Assignee
Sino Credit Information Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sino Credit Information Technology Beijing Co ltd filed Critical Sino Credit Information Technology Beijing Co ltd
Priority to CN201910535051.7A priority Critical patent/CN112116443A/en
Publication of CN112116443A publication Critical patent/CN112116443A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a model generation method and a model generation device based on variable grouping and an electronic device. The model generation method based on variable grouping comprises the following steps: acquiring a data set comprising a plurality of variables; calculating a predetermined parameter for each of the plurality of variables; sorting the plurality of variables based on a magnitude of the predetermined parameter; for the plurality of variables, dividing all variables into a plurality of variable sets different from each other based on correlation coefficients between the variables and variance expansion factor values of the variables; and generating a model based on each set of variables. In this way, variables can be screened based on preset criteria, and the screened variables are clustered into different groups for model generation, thereby improving data utilization and model robustness.

Description

Model generation method and model generation device based on variable grouping and electronic equipment
Technical Field
The present application relates generally to the field of data processing, and more particularly, to a method and apparatus for generating a model based on variable grouping, and an electronic device.
Background
Credit scoring models are widely used in the field of credit risk, and credit scoring has a key role and is widely used, particularly in retail credit risk management practices. In particular, in the credit application phase, automatic decision making is realized through a strategy based on application scoring; in the post-loan management stage, the behavior scores and the collection scores can be used for designing customer management, early warning and collection strategies. A credit score-based decision mechanism can help credit risk management personnel manage credit business efficiently and objectively.
The objective of the application scoring model, the behavior scoring model or the collection scoring model is to generate a model according to characteristic variables, and in the generation process of the model, the dimension reduction of the variables or the screening process of the variables is involved.
Traditionally, the variable screening process in the process of generating the model only screens out a set of variables for final model fitting, which often results in the unselected portions of useful variables eventually being unavailable for model generation due to the screening threshold.
Accordingly, there is a need for improved model generation schemes.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide a model generation method based on variable grouping, a model generation apparatus, and an electronic device, which can screen variables based on a preset standard, and cluster the screened variables into different groups for model generation, thereby improving data utilization and robustness of a model.
According to an aspect of the present application, there is provided a variable grouping-based model generation method, including:
step 1: acquiring a data set comprising a plurality of variables;
step 2: calculating a predetermined parameter for each of the plurality of variables;
and step 3: sorting the plurality of variables based on a magnitude of the predetermined parameter;
and 4, step 4: acquiring a first variable of the plurality of variables;
and 5: calculating a correlation coefficient of the first variable and the rest variables;
step 6: deleting variables whose correlation coefficients are greater than a first threshold to obtain an initial set of variables relative to the first variable;
and 7: calculating a variance expansion factor value for a last variable in the initial set of variables;
and 8: deleting the last variable from the set of initial variables in response to the expansion factor value being greater than a second threshold and retaining the last variable in the set of initial variables in response to the expansion factor value being less than a second threshold;
and step 9: repeating steps 7 and 8 above for each variable in the initial set of variables to obtain a first set of variables relative to the first variable;
step 10: repeating said steps 3 through 9 for variables of said plurality of variables other than said first set of variables to obtain a plurality of sets of variables; and
step 11: a model is generated based on each set of variables.
In the above model generation method based on variable grouping, the predetermined parameter is used to represent the discriminative power of the variable and includes one of the following: kini index, KS value, ROC value.
In the above model generation method based on variable grouping, the sorting the plurality of variables based on the magnitude of the predetermined parameter includes: and sorting the variables in order from large to small based on the magnitude of the absolute value of the kini index of each variable.
In the above model generation method based on variable grouping, generating a model based on each variable set includes: generating a plurality of classifiers for each set of variables; and integrating the plurality of classifiers into a final classifier in an integration method.
In the above method for generating a model based on variable grouping, acquiring a data set including a plurality of variables includes: acquiring an initial data set; determining whether the size of the initial data set is greater than a third threshold; responsive to the size of the initial data set being greater than the third threshold, dividing the initial data set into a training data set and a testing data set; and setting the training data set to the data set comprising a plurality of variables.
According to another aspect of the present application, there is provided a variable grouping-based model generation apparatus, including:
a data acquisition unit for acquiring a data set containing a plurality of variables;
a parameter calculation unit for calculating a predetermined parameter for each of the plurality of variables;
a variable grouping unit for performing the steps of:
step 1: sorting the plurality of variables based on a magnitude of the predetermined parameter;
step 2: acquiring a first variable of the plurality of variables;
and step 3: calculating a correlation coefficient of the first variable and the rest variables;
and 4, step 4: deleting variables whose correlation coefficients are greater than a first threshold to obtain an initial set of variables relative to the first variable;
and 5: calculating a variance expansion factor value for a last variable in the initial set of variables;
step 6: deleting the last variable from the set of initial variables in response to the expansion factor value being greater than a second threshold and retaining the last variable in the set of initial variables in response to the expansion factor value being less than a second threshold;
and 7: repeating steps 5 and 6 above for each variable in the initial set of variables to obtain a first set of variables relative to the first variable;
and 8: repeating said steps 1 to 7 for variables of said plurality of variables other than said first set of variables to obtain a plurality of sets of variables; and
a model generation unit for generating a model based on each set of variables.
In the above model generation device based on variable grouping, the predetermined parameter is used to represent the discriminative power of the variable, and includes one of the following: kini index, KS value, ROC value.
In the above-described model generation apparatus based on a grouping of variables, the sorting the plurality of variables based on the magnitude of the predetermined parameter includes: and sorting the variables in order from large to small based on the magnitude of the absolute value of the kini index of each variable.
In the above-mentioned model generation apparatus based on variable grouping, the model generation unit is configured to: generating a plurality of classifiers for each set of variables; and integrating the plurality of classifiers into a final classifier in an integration method.
In the above model generation device based on variable grouping, the data acquisition unit is configured to: acquiring an initial data set; determining whether the size of the initial data set is greater than a third threshold; responsive to the size of the initial data set being greater than the third threshold, dividing the initial data set into a training data set and a testing data set; and setting the training data set to the data set comprising a plurality of variables.
According to yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the variable grouping-based model generation method as described above.
The variable grouping-based model generation method, the variable grouping-based model generation device and the electronic equipment can screen the variables based on the preset standard, and cluster the screened variables into different groups for model generation, so that the data utilization rate and the robustness of the model are improved.
Drawings
Various other advantages and benefits of the present application will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. It is obvious that the drawings described below are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. Also, like parts are designated by like reference numerals throughout the drawings.
FIG. 1 illustrates a flow diagram of a method of variable grouping-based model generation in accordance with an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of a calculation of a Gini index in a variable grouping based model generation method according to an embodiment of the present application;
FIG. 3 illustrates a block diagram of a variable grouping-based model generation apparatus according to an embodiment of the present application;
FIG. 4 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Summary of the application
As described above, in the conventional model generation process, the variable screening process screens out only one set of variables for final model fitting, which generally results in unselected useful variables eventually not being available for model generation, thereby affecting data utilization and affecting model performance.
In view of the technical problem, a basic idea of the present application is to propose a model generation method, a model generation apparatus, and an electronic device for grouping variables, which sort the variables based on predetermined parameters characterizing the properties of the variables, group all the variables based on correlations between the variables and variance expansion factor values to obtain a plurality of sets of variables that are not intersected with each other, and apply the plurality of sets of variables to generate a model. Therefore, as the correlation and the variance expansion factor value among the variables in each variable set are lower than a specific threshold, the inaccuracy of the fitting result caused by multiple collinearity among the variables is avoided, and the link of correlation analysis can be omitted in the generation process of the model. And, since all variables are grouped into a plurality of mutually disjoint sets of variables, all variables can be applied in the model generation process, thereby improving data utilization and robustness of the generated model.
Here, the model to which the present application relates is not limited to the credit scoring model in the credit risk field, but may be other types of models.
Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary method
FIG. 1 illustrates a flow diagram of a method of model generation based on variable grouping according to an embodiment of the application. As shown in fig. 1, a model generation method based on variable grouping according to an embodiment of the present application includes: s101, acquiring a data set containing a plurality of variables; s102, calculating a preset parameter of each variable in the plurality of variables; s103, sorting the variables based on the size of the preset parameter; s104, acquiring a first variable in the plurality of variables; s105, calculating correlation coefficients of the first variable and the other variables; s106, deleting the variable with the correlation coefficient larger than a first threshold value to obtain an initial variable set relative to the first variable; s107, calculating a variance expansion factor value of the last variable in the initial variable set; s108, deleting the last variable from the set of initial variables in response to the variance expansion factor value being greater than a second threshold and retaining the last variable in the set of initial variables in response to the variance expansion factor being less than a second threshold; s109, repeating the above S107 and S108 for each variable in the initial variable set to obtain a first variable set relative to the first variable; s110, repeating said S103 to said S109 for variables other than said first set of variables of said plurality of variables to obtain a plurality of sets of variables; and S111, generating a model based on each variable set.
In step S101, a data set including a plurality of variables is acquired. For example, for the classification problem, the dataset D { (x)(1),y(1)),(x(2),y(2)),…,(x(M),y(M))},x(i)∈RN;y(i)E { -1, +1}, where x represents a feature variable and y represents a tag variable. For example, for a credit card overdue dataset, x ═ (age, income level), y denotes whether the credit card is overdue (-1: denotes not overdue, 1: denotes overdue), x(i)Represents the ith observation, for example: (working year 10, age 36, income 5000 yuan), y(i)The overdue state (-1/1) for the ith observation is shown.
In step S102, a predetermined parameter of each of the plurality of variables is calculated. Here, in the variable grouping-based model generation method according to the embodiment of the present application, the predetermined parameter may be a parameter for representing a property of the variable.
In particular, the predetermined parameter may be a kini index for each variable. The Gini index represents the distinguishing capability of the variable on sample classification, and the larger the Gini index is, the stronger the distinguishing capability of the variable is. For a variable x, the calculation of the kini index for a given sample is as follows:
first, a cross-table of x over tag variables (-1/1) is made, as exemplified below:
Figure BDA0002100944920000061
here, the samples are divided into "good" samples and "bad" samples based on the dataset properties. For example, in the data set for determining a customer's credit card overdue risk described above, samples may be divided into default samples (i.e., "bad" samples) and non-default samples (i.e., "good" samples).
Then, the "good sample cumulative percentage" is used as the x-axis, and the "bad sample cumulative percentage" is used as the y-axis, as shown in fig. 2. Fig. 2 illustrates a schematic diagram of the calculation of the kini index in the variable grouping-based model generation method according to the embodiment of the present application.
Finally, the area a enclosed by the diagonal lines and the curve shown in fig. 2 is calculated, and the kuney index value is 2 × a.
In addition, in the model generation method based on the variable grouping according to the embodiment of the present application, the predetermined parameter may also be other parameters for representing predetermined properties of the variable. For example, a KS value or ROC value, or other parameter, that accounts for the discriminative power of the feature variables.
That is, in the model generation method based on the variable grouping according to the embodiment of the present application, the predetermined parameter is used to represent the discriminative power of the variable and includes one of the following: kini index, KS value, ROC value.
In step S103, the plurality of variables are sorted based on the magnitude of the predetermined parameter. For example, after the kini indexes of the respective variables are calculated, the plurality of variables are sorted in order of the absolute values of the kini indexes from large to small.
Here, in the model generation method based on the variable grouping according to the embodiment of the present application, it is desirable to first obtain the most important variables and then obtain the variable sets on the basis of the variables. Therefore, before grouping the variables, the variables need to be first sorted according to the predetermined parameters. In this way, it can be guaranteed that the most important variables are selected first in the variable grouping process.
That is, in the variable grouping-based model generation method according to the embodiment of the present application, the sorting the plurality of variables based on the magnitude of the predetermined parameter includes: and sorting the variables in order from large to small based on the magnitude of the absolute value of the kini index of each variable.
In step S104, a first variable of the plurality of variables is acquired. That is, taking as an example the sorting of the plurality of variables based on the kini index as described above, the variable having the largest absolute value of the kini index is selected first. For example, assume that 10 variables are included in the dataset and after sorting are: variable 1, variable 2, …, variable 10, then variable 1 is selected first.
In step S105, a correlation coefficient of the first variable with the remaining variables is calculated. Specifically, the correlation coefficient between the variable a and the variable B is calculated as follows:
Figure BDA0002100944920000071
wherein Cov (A, B) is covariance of A, B, and D (A), D (B) are variance of A, B respectively.
In step S106, the variables having the correlation coefficient greater than the first threshold are deleted to obtain an initial set of variables relative to the first variable. For example, as described above, through steps S104 to S106, an initial set of variables based on variable 1 is constructed, e.g., denoted as variable set S0And is assumed to include variable 1, variable 3, variable 5, variable 6, and variable 9.
And S107, calculating the variance expansion factor value of the last variable in the initial variable set. Here, the variance expansion factor value is used to determine whether multiple collinearity exists in the variable, and the greater the variance expansion factor value, the more severe the multiple collinearity. That is, the least important variable in the initial variable set is selected first, for example, the variable with the smallest Gini index in the initial variable set is selected based on the ranking of the Gini indexes, in the above example, the variable set S0The variable 9 in (1) is denoted as variable B. Specifically, the variance expansion factor value of the variable B is calculated by taking the variable B as a dependent variable and a variable set S0The other variables (variable 1, variable 3, variable 5 and variable 6) are subjected to linear regression fitting, and then the determination coefficient is calculated according to the fitting result
Figure BDA0002100944920000072
Figure BDA0002100944920000081
Wherein y isiIs the true value of the variable B,
Figure BDA0002100944920000082
is linearThe fit is to the estimated values of the variable B,
Figure BDA0002100944920000083
is the mean of the variable B. And, variance inflation factor value VIF of variable BBAs shown in the following formula:
Figure BDA0002100944920000084
in step S108, the last variable is deleted from the set of initial variables in response to the variance expansion factor value being greater than a second threshold and retained in the set of initial variables in response to the variance expansion factor being less than a second threshold. For example, in the above example, if the variance expansion factor value coefficient of variable 9 is greater than the second threshold, then the set of variables S is selected0Deleting variable 9. Accordingly, if the variance inflation factor value of variable 9 is less than the second threshold, then in variable set S0Variable 9 is still retained.
In step S109, the above S107 and S108 are repeated for each variable in the initial set of variables to obtain a first set of variables relative to the first variable. For example, in the example above, if variable 9 is deleted, then a linear regression fit is made to it with variable 1, variable 3, and variable 5 for variable 6. Whereas if variable 9 is retained, a linear regression fit is made to it with variable 1, variable 3, variable 5 and variable 9 for variable 6. Then, it is determined whether the variance inflation factor value of variable 6 is greater than a second threshold. Next, the above steps are repeated for variable 5, variable 3 and variable 1 to obtain a first set of variables relative to the first variable. For example, a first set of variables V based on a first variable is obtained0Assume that variable 1, variable 3, and variable 5 are included.
In step S109, the above steps S103 to S108 are repeated for variables other than the first set of variables among the plurality of variables to obtain a plurality of sets of variables. Also according to the above example, in calculating the correlation, the set of variables S is derived0The variables 2, 4, 7, 8 and 10 are excluded fromWhen calculating the variance inflation factor value, variables 6 and 9 are excluded from the first set of variables, and then variables other than for the first set of variables include variables 2, 4, 7, 8, 10, 9 and 6. For example, the variables are reordered according to the absolute value of the kuni index into variable 2, variable 4, variable 6, variable 7, variable 8, variable 9, and variable 10. The process of steps S104 to S108 is repeated, assuming that a variable set V is obtained1For example, variable 2, variable 6, and variable 8. Then, for the remaining variables 4, 7, 9 and 10, steps S103 to S108 are repeated, resulting in a plurality of variable sets V0,V1,...,VK-1The feature variable space is divided into K mutually disjoint subspaces.
Finally, in step S110, a model is generated based on each set of variables.
Here, in the variable grouping-based model generation method according to the embodiment of the present application, a model may be generated in different ways based on the obtained variable grouping depending on a specific model generation method, and hereinafter, one example of generating a classification model will be described.
First, for each different set of variables ViK-1 develops a classifier C, i ═ 0iK-1, i ═ 0. Next, K classifiers are integrated into a final classifier by an integration method.
The process of one example of an integration method is as follows:
(a) for the training dataset D { (x)(1),y(1)),(x(2),y(2)),…,(x(N),y(N)) And initializing the weight of each sample to be 1/N, and recording the weight distribution as W0
(b) By having a weight distribution WmThe training data set of K-1 yields a classifier CmAnd calculating the classification error of the classifier on the N samples, wherein the classification error is the ratio of the misclassified sample number of each class after classification to the total sample number of each class, and the following formula is shown as follows:
Figure BDA0002100944920000091
wherein G ism(x(i)) As a classifier CmThe classification result on the ith sample, I, is an indicator function, and here, if the classifier result is different from the actual result, it is marked as 1; otherwise, it is noted as 0.
For a classifier with a small classification error, a relatively high weight is assigned, and the weight of the classifier is calculated by using the following formula:
Figure BDA0002100944920000092
(c) according to the classifier CmAnd (4) classifying results, updating sample weight distribution:
Figure BDA0002100944920000093
wherein ZmIs a normalization factor:
Figure BDA0002100944920000094
(d) repeating the step (b) and the step (c) for the next classifier until all the classifiers are fitted, and obtaining the final weight of each variable group, and performing classifier integration as shown in the following formula:
Figure BDA0002100944920000095
sign is a sign function, the input is positive and 1 is taken, and the input is negative and 1 is taken.
That is, in the variable grouping-based model generation method according to the embodiment of the present application, generating a model based on each of the variable sets includes: generating a plurality of classifiers for each set of variables; and integrating the plurality of classifiers into a final classifier in an integration method.
In addition, based on the initially acquired data set size, if the data set is large, the data set may be divided into a training set and a test set, and the model generation method based on variable grouping as described above is performed based on the training set.
That is, in the variable grouping-based model generation method according to the embodiment of the present application, acquiring a data set including a plurality of variables includes: acquiring an initial data set; determining whether the size of the initial data set is greater than a third threshold; responsive to the size of the initial data set being greater than the third threshold, dividing the initial data set into a training data set and a testing data set; and setting the training data set to the data set comprising a plurality of variables.
Exemplary devices
Fig. 3 illustrates a block diagram of a variable grouping-based model generation apparatus according to an embodiment of the present application.
As shown in fig. 3, the variable grouping-based model generation apparatus 200 according to an embodiment of the present application includes: a data acquisition unit 210 for acquiring a data set containing a plurality of variables; a parameter calculating unit 220 for calculating a predetermined parameter for each of the plurality of variables in the data set acquired by the data acquiring unit 210; a variable grouping unit 230 for performing the following steps: step 1: sorting the plurality of variables based on a magnitude of the predetermined parameter; step 2: acquiring a first variable of the plurality of variables; and step 3: calculating a correlation coefficient of the first variable and the rest variables; and 4, step 4: deleting variables whose correlation coefficients are greater than a first threshold to obtain an initial set of variables relative to the first variable; and 5: calculating a variance expansion factor value for a last variable in the initial set of variables; step 6: deleting the last variable from the set of initial variables in response to the expansion factor value being greater than a second threshold and retaining the last variable in the set of initial variables in response to the expansion factor value being less than a second threshold; and 7: repeating steps 5 and 6 above for each variable in the initial set of variables to obtain a first set of variables relative to the first variable; and 8: repeating said steps 1 to 7 for variables of said plurality of variables other than said first set of variables to obtain a plurality of sets of variables; and a model generating unit 240 for generating a model based on each variable set obtained by the variable grouping unit 230.
In one example, in the above model generation apparatus 200 based on variable grouping, the predetermined parameter is used to represent the discriminative power of the variable, and includes one of the following: kini index, KS value, ROC value.
In one example, in the above variable grouping-based model generation apparatus 200, the variable grouping unit 230 sorts the plurality of variables based on the magnitude of the predetermined parameter includes: and sorting the variables in order from large to small based on the magnitude of the absolute value of the kini index of each variable.
In an example, in the above variable grouping-based model generation apparatus 200, the model generation unit 240 is configured to: generating a plurality of classifiers for each set of variables; and integrating the plurality of classifiers into a final classifier in an integration method.
In an example, in the above model generating apparatus 200 based on variable grouping, the data obtaining unit 210 is configured to: acquiring an initial data set; determining whether the size of the initial data set is greater than a third threshold; responsive to the size of the initial data set being greater than the third threshold, dividing the initial data set into a training data set and a testing data set; and setting the training data set to the data set comprising a plurality of variables.
Here, it can be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above variable group-based model generation apparatus 200 have been described in detail in the variable group-based model generation method described above with reference to fig. 1 and 2, and thus, a repetitive description thereof will be omitted.
As described above, the variable grouping-based model generation apparatus 200 according to the embodiment of the present application is implemented in various terminal devices, such as a server for developing a credit scoring model. In one example, the variable grouping-based model generation apparatus 200 according to the embodiment of the present application may be integrated into the terminal device as one software module and/or hardware module. For example, the variable grouping-based model generation apparatus 200 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the variable grouping-based model generation apparatus 200 can also be one of many hardware modules of the terminal device.
Alternatively, in another example, the variable grouping-based model generating apparatus 200 and the terminal device may be separate devices, and the variable grouping-based model generating apparatus 200 may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to an agreed data format.
Exemplary electronic device
Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 4.
FIG. 4 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
As shown in fig. 4, the electronic device 10 includes one or more processors 11 and memory 12.
The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 11 to implement the variable grouping-based model generation methods of the various embodiments of the present application described above and/or other desired functions. Various content such as data sets, feature variable data, classifier data, and the like may also be stored in the computer-readable storage medium.
In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device 13 may be, for example, a keyboard, a mouse, or the like.
The output device 14 can output various information to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 4, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 10 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the variable grouping-based model generation method according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the variable grouping-based model generation method according to various embodiments of the present application described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (11)

1. A method of model generation based on variable grouping, comprising:
step 1: acquiring a data set comprising a plurality of variables;
step 2: calculating a predetermined parameter for each of the plurality of variables;
and step 3: sorting the plurality of variables based on a magnitude of the predetermined parameter;
and 4, step 4: acquiring a first variable of the plurality of variables;
and 5: calculating a correlation coefficient of the first variable and the rest variables;
step 6: deleting variables whose correlation coefficients are greater than a first threshold to obtain an initial set of variables relative to the first variable;
and 7: calculating a variance expansion factor value for a last variable in the initial set of variables;
and 8: deleting the last variable from the set of initial variables in response to the expansion factor value being greater than a second threshold and retaining the last variable in the set of initial variables in response to the expansion factor value being less than a second threshold;
and step 9: repeating steps 7 and 8 above for each variable in the initial set of variables to obtain a first set of variables relative to the first variable;
step 10: repeating said steps 3 through 9 for variables of said plurality of variables other than said first set of variables to obtain a plurality of sets of variables; and
step 11: a model is generated based on each set of variables.
2. The variable grouping-based model generation method according to claim 1, wherein the predetermined parameter is used to represent a discriminative power of the variable and includes one of: kini index, KS value, ROC value.
3. The variable grouping-based model generation method according to claim 2, wherein ordering the plurality of variables based on the magnitude of the predetermined parameter comprises:
and sorting the variables in order from large to small based on the magnitude of the absolute value of the kini index of each variable.
4. The variable grouping-based model generation method of claim 1, wherein generating a model based on each variable set comprises:
generating a plurality of classifiers for each set of variables; and
integrating the plurality of classifiers into a final classifier in an integration method.
5. The variable grouping-based model generation method according to claim 1, wherein acquiring a data set containing a plurality of variables comprises:
acquiring an initial data set;
determining whether the size of the initial data set is greater than a third threshold;
responsive to the size of the initial data set being greater than the third threshold, dividing the initial data set into a training data set and a testing data set; and
setting the training data set as the data set comprising a plurality of variables.
6. A variable grouping-based model generation apparatus comprising:
a data acquisition unit for acquiring a data set containing a plurality of variables;
a parameter calculation unit for calculating a predetermined parameter for each of the plurality of variables;
a variable grouping unit for performing the steps of:
step 1: sorting the plurality of variables based on a magnitude of the predetermined parameter;
step 2: acquiring a first variable of the plurality of variables;
and step 3: calculating a correlation coefficient of the first variable and the rest variables;
and 4, step 4: deleting variables whose correlation coefficients are greater than a first threshold to obtain an initial set of variables relative to the first variable;
and 5: calculating a variance expansion factor value for a last variable in the initial set of variables;
step 6: deleting the last variable from the set of initial variables in response to the expansion factor value being greater than a second threshold and retaining the last variable in the set of initial variables in response to the expansion factor value being less than a second threshold;
and 7: repeating steps 5 and 6 above for each variable in the initial set of variables to obtain a first set of variables relative to the first variable;
and 8: repeating said steps 1 to 7 for variables of said plurality of variables other than said first set of variables to obtain a plurality of sets of variables; and
a model generation unit for generating a model based on each set of variables.
7. The variable grouping-based model generation apparatus according to claim 6, wherein the predetermined parameter is used to represent a discriminative power of the variable and includes one of: kini index, KS value, ROC value.
8. The variable grouping-based model generation apparatus according to claim 7, wherein the variable grouping unit orders the plurality of variables based on the magnitude of the predetermined parameter includes:
and sorting the variables in order from large to small based on the magnitude of the absolute value of the kini index of each variable.
9. The variable grouping-based model generation apparatus according to claim 6, wherein the model generation unit is configured to:
generating a plurality of classifiers for each set of variables; and
integrating the plurality of classifiers into a final classifier in an integration method.
10. The variable grouping-based model generation apparatus according to claim 6, wherein the data acquisition unit is configured to:
acquiring an initial data set;
determining whether the size of the initial data set is greater than a third threshold;
responsive to the size of the initial data set being greater than the third threshold, dividing the initial data set into a training data set and a testing data set; and
setting the training data set as the data set comprising a plurality of variables.
11. An electronic device, comprising:
a processor; and
memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the variable grouping based model generation method of any of claims 1-5.
CN201910535051.7A 2019-06-20 2019-06-20 Model generation method and model generation device based on variable grouping and electronic equipment Pending CN112116443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910535051.7A CN112116443A (en) 2019-06-20 2019-06-20 Model generation method and model generation device based on variable grouping and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910535051.7A CN112116443A (en) 2019-06-20 2019-06-20 Model generation method and model generation device based on variable grouping and electronic equipment

Publications (1)

Publication Number Publication Date
CN112116443A true CN112116443A (en) 2020-12-22

Family

ID=73795229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910535051.7A Pending CN112116443A (en) 2019-06-20 2019-06-20 Model generation method and model generation device based on variable grouping and electronic equipment

Country Status (1)

Country Link
CN (1) CN112116443A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060161403A1 (en) * 2002-12-10 2006-07-20 Jiang Eric P Method and system for analyzing data and creating predictive models
CN102930158A (en) * 2012-10-31 2013-02-13 哈尔滨工业大学 Variable selection method based on partial least square
US20150088783A1 (en) * 2009-02-11 2015-03-26 Johnathan Mun System and method for modeling and quantifying regulatory capital, key risk indicators, probability of default, exposure at default, loss given default, liquidity ratios, and value at risk, within the areas of asset liability management, credit risk, market risk, operational risk, and liquidity risk for banks
US20180075175A1 (en) * 2015-04-09 2018-03-15 Equifax, Inc. Automated model development process
CN108898479A (en) * 2018-06-28 2018-11-27 中国农业银行股份有限公司 The construction method and device of Credit Evaluation Model
CN109389274A (en) * 2018-05-22 2019-02-26 深圳壹账通智能科技有限公司 Scorecard configuration method, device, equipment and readable storage medium storing program for executing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060161403A1 (en) * 2002-12-10 2006-07-20 Jiang Eric P Method and system for analyzing data and creating predictive models
US20150088783A1 (en) * 2009-02-11 2015-03-26 Johnathan Mun System and method for modeling and quantifying regulatory capital, key risk indicators, probability of default, exposure at default, loss given default, liquidity ratios, and value at risk, within the areas of asset liability management, credit risk, market risk, operational risk, and liquidity risk for banks
CN102930158A (en) * 2012-10-31 2013-02-13 哈尔滨工业大学 Variable selection method based on partial least square
US20180075175A1 (en) * 2015-04-09 2018-03-15 Equifax, Inc. Automated model development process
CN109389274A (en) * 2018-05-22 2019-02-26 深圳壹账通智能科技有限公司 Scorecard configuration method, device, equipment and readable storage medium storing program for executing
CN108898479A (en) * 2018-06-28 2018-11-27 中国农业银行股份有限公司 The construction method and device of Credit Evaluation Model

Similar Documents

Publication Publication Date Title
US20180349384A1 (en) Differentially private database queries involving rank statistics
US7912795B2 (en) Automated predictive modeling of business future events based on transformation of modeling variables
US20150112903A1 (en) Defect prediction method and apparatus
CN110708285B (en) Flow monitoring method, device, medium and electronic equipment
CN111563218B (en) Page repairing method and device
CN111309577A (en) Spark-oriented batch processing application execution time prediction model construction method
CN113159881B (en) Data clustering and B2B platform customer preference obtaining method and system
US20210397662A1 (en) Search needs evaluation apparatus, search needs evaluation system, and search needs evaluation method
CN116052848B (en) Data coding method and system for medical imaging quality control
CN110619067A (en) Industry classification-based retrieval method and retrieval device and readable storage medium
CN112116443A (en) Model generation method and model generation device based on variable grouping and electronic equipment
CN114970727A (en) Multi-label text classification method and system and computer equipment
CN114121204A (en) Patient record matching method based on patient master index, storage medium and equipment
CN114820003A (en) Pricing information abnormity identification method and device, electronic equipment and storage medium
CN112686312A (en) Data classification method, device and system
CN115146890A (en) Enterprise operation risk warning method and device, computer equipment and storage medium
CN110610213A (en) Mail classification method, device, equipment and computer readable storage medium
JP7207537B2 (en) Classification device, classification method and classification program
CN113205117B (en) Community dividing method, device, computer equipment and storage medium
CN114139657A (en) Guest group portrait generation method and device, electronic equipment and storage medium
CN117391367A (en) Policy task allocation method and device, terminal equipment and storage medium
CN112115956A (en) Data processing method and data processing device for sample classification and electronic equipment
CN109583466A (en) Object ranking method, device, equipment and computer readable storage medium
CN116821820A (en) False transaction identification method and device, electronic equipment and storage medium
CN114331164A (en) Learning management system maturity evaluation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination