WO2020111423A1

WO2020111423A1 - Method for automatically generating variables for data modeling, and device thereof

Info

Publication number: WO2020111423A1
Application number: PCT/KR2019/007409
Authority: WO
Inventors: 김지훈; 최유리; 유두열
Original assignee: 주식회사 솔리드웨어
Priority date: 2018-11-29
Filing date: 2019-06-19
Publication date: 2020-06-04
Also published as: KR101976689B1

Abstract

Disclosed are a method for automatically generating variables for data modeling, and a device thereof. The device for automatically generating variables analyzes the correlation between a target variable to be predicted and each variable of statistical information, selects a plurality of variables among the variables of the statistical information as candidate variables in descending order of correlation with the target variable, generates a new variable by arbitrarily extracting a certain number of variables from among the candidate variables and combining the arbitrarily extracted variables, and stores the new variable and a value for the new variable in the statistical information.

Description

Variable automatic generation method and device for data modeling

The present invention relates to data modeling, and more particularly, to a method and apparatus for automatically generating various variables used in data modeling.

In data modeling, feature engineering, which creates appropriate variables, is a very important process. The variables for the predictive model are mostly generated by heuristic judgment of experts who have knowledge in the data field. For example, suppose you create a credit rating model using statistical data such as gender, age, income, number of existing loans, and amount of existing loans. At this time, it is possible to create a credit evaluation model by using each variable present in the statistical data as it is, but experts in the field can create a new variable by dividing the existing loan amount by income to make the credit evaluation model more accurate and accurate. However, there are limitations in the creation of these variables according to the subjective experience of experts in the field, and in the case of statistical data with hundreds or thousands of variables, it is realistic that experts find the relationship between each variable and create new variables. Almost impossible.

There is a method to automatically generate variables using PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis). However, this automatic variable generation method aims to improve computational efficiency, and thus has limitations in creating variables that are practically useful for predictive models. In particular, when dimensionality reduction is used, there is a problem of losing the explanatory power of the corresponding variable. For example, in the credit evaluation model described above, if PCA is applied to age and loan amount to create a new variable, this becomes a new variable with the largest variance as the basis. The new variable means the direction in which variance is large in the distribution of age and loan amount, but it is difficult for the user to intuitively recall the meaning of the variable.

The technical problem to be achieved by the embodiments of the present invention is to provide a method and apparatus for automatically generating a variable that is substantially helpful in data modeling.

In order to achieve the above technical problem, an example of an automatic variable generation method for data modeling according to an embodiment of the present invention includes: analyzing a correlation between a target variable to be predicted and each variable of statistical information; Selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information; Randomly extracting a predetermined number of variables from the candidate variables; Generating a new variable through combining between the randomly extracted variables; And storing the new variable in the statistical information.

An example of a variable automatic generation device for data modeling according to an embodiment of the present invention for achieving the above technical problem is a correlation between a target variable to be predicted based on statistical information and each variable of the statistical information Correlation analysis unit for analyzing the; A candidate variable selector for selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information; A variable generator that generates new variables through combining between the variables randomly extracted from the candidate variables; And a data storage unit that stores the new variable in the statistical information.

According to an embodiment of the present invention, new variables can be automatically generated through variables existing in the original data without user intervention. Even if there are hundreds or thousands or more of variables included in the raw data, it is possible to generate variables that are practically helpful to the predictive model. Among the variables included in the raw data, various variables can be generated because the candidate variables selected stochastically are utilized. In addition, unnecessary candidates for generating various variables can be effectively reduced because selected candidate variables are utilized rather than all variables.

1 is a view showing an example of a variable automatic generation device and a statistical information database according to an embodiment of the present invention,

2 is a diagram showing an example of statistical information according to an embodiment of the present invention;

3 is a flowchart illustrating an example of an automatic variable generation method according to an embodiment of the present invention,

4 is a diagram illustrating an example of a method for determining a correlation between a target variable and each variable of statistical information according to an embodiment of the present invention;

5 is a diagram showing an example of a method of selecting a variable to be used for generating a new variable from candidate variables according to an embodiment of the present invention;

6 is a diagram illustrating an example of a method of generating a new variable through combining between variables extracted from candidate variables according to an embodiment of the present invention;

7 is a diagram illustrating an example of a method for determining the weight and bias of a polynomial defining a newly generated variable according to an embodiment of the present invention;

8 is a diagram illustrating an example of a method of combining variables in a rule manner according to an embodiment of the present invention,

9 is a diagram showing an example of statistical information in which new variables are generated according to an embodiment of the present invention; and

10 is a view showing an example of the configuration of a variable automatic generation device according to an embodiment of the present invention.

Hereinafter, a variable automatic generation method and apparatus for data modeling according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 is a diagram illustrating an example of a variable automatic generation device and a statistical information database according to an embodiment of the present invention.

Referring to FIG. 1, the variable automatic generation device 100 automatically generates at least one new variable for data modeling based on a variable of statistical information stored in the statistical information database 110.

The variable automatic generation device 100 may be implemented as a computing device including a memory, a processor, and an input/output device. For example, the memory loads a software program in which a variable auto-generation algorithm is implemented, and the processor may generate a new variable according to the present embodiment by executing a software program loaded in the memory. An example of an automatic variable generation method is illustrated in FIG. 3.

The statistical information database 110 stores various statistical information. Here, the statistical information means a data set including information about at least one variable. For example, statistical information on bank customers may include information such as gender, age, income, number of existing loans, and amount of existing loans as variables. An example of statistical information is shown in FIG. 2.

2 is a view showing an example of statistical information according to an embodiment of the present invention.

Referring to FIG. 2, statistical information 200 includes at least one sample 230 including information on at least one variable 210. Also, the statistical information 200 may include a target variable 220 to be predicted through data modeling (ie, a predictive model). For example, in the statistical information for the salpin bank customer in FIG. 1, the target variable 220 may be customer credit.

3 is a flowchart illustrating an example of an automatic variable generation method according to an embodiment of the present invention.

Referring to FIG. 3, the variable automatic generation device 100 determines a correlation between a target variable and each variable of statistical information (S300 ). For example, in the example of FIG. 2, the variable automatic generation device 100 grasps a correlation between the target variable Y 220 and each variable X1,...Xm 210 of statistical information. An example of determining the correlation between variables is shown in FIG. 4.

The variable automatic generation device 100 selects a predetermined number of variables having a high correlation with the target variable as candidate variables (S310). An example of selecting five candidate variables based on a correlation between a target variable and each variable of statistical information is illustrated in FIG. 5. The number of candidate variables may be variously set according to embodiments.

When the candidate variable is determined, the variable automatic generation device 100 randomly extracts a predetermined number of variables from the candidate variable (S320). 5 shows an example of randomly extracting three of five candidate variables. The number of variables extracted from candidate variables may be variously set according to embodiments.

In one embodiment, the variable automatic generation device 100 may extract a certain number of variables by applying the same extraction probability to each candidate variable. In another embodiment, the variable automatic generation device 100 may give different candidates different extraction probabilities according to the size of the correlation so that a variable highly correlated with the target variable can be better extracted. For example, the higher the correlation, the higher the probability of extraction. An example of a method of extracting by assigning different extraction probability to each candidate variable will be described in FIG. 5.

The variable automatic generation device 100 combines a predetermined number of variables extracted from candidate variables to generate new variables (S330). The coupling between the variables may be various, such as a linear method, a multiplication method, a division method, or a rule method. Examples of various variable combining methods are shown in FIGS. 6 and 8.

The variable automatic generation device 100 stores the new variable in statistical information. That is, the variable automatic generation device 100 grasps the value of the new variable for each sample and reflects it in the statistical information. For example, if new variables G1, G2, ..., G5 are generated, the variable automatic generation device 100 grasps the value of the corresponding variable for each sample in the statistical information of the statistical information database 110 as shown in FIG. And save.

The newly generated variable according to the present embodiment is used for data modeling (ie, prediction model) for predicting a target variable. For example, various conventional modeling methods, including machine learning, can generate more accurate predictive models using newly created variables.

4 is a diagram illustrating an example of a method for determining a correlation between a target variable and each variable of statistical information according to an embodiment of the present invention.

2 and 4 together, the variable automatic generation device 100 grasps a correlation between the target variable Y 220 and each variable X1,...,Xm 210 of statistical information. For example, the variable automatic generation device 100 may grasp the relative importance of each variable 210 with respect to the target variable 220 using f-test. Various methods other than f-test can be applied to this embodiment.

The variable automatic generation device 100 may select a plurality of variables as candidate variables in the order of high correlation with the target variable 220 among the variables 210 of statistical information. For example, if the number of candidate variables is defined as 5 and the order in which the correlation with the target variable is high is X3, X4, X5, X1, X2, the variable automatic generation device 100 shows X3, X4, X5 as shown in FIG. ,X1,X2 are selected as candidate variables.

5 is a diagram illustrating an example of a method of selecting a variable to be used for generating a new variable from candidate variables according to an embodiment of the present invention.

5, candidate variables 500 are X3, X4, X5, X1, and X2. The variable automatic generation device 100 randomly extracts a number of variables from the candidate variable 500 (530). The number of randomly extracted variables may be variously set according to embodiments, and this embodiment is defined as three.

When the variable automatic generation device 100 randomly extracts three variables from the candidate variable 500, the extraction probability of each variable may be different. To this end, first, the variable automatic generation device 100 sequentially arranges the candidate variables 500 in the order of high correlation, and sequentially measures the importance of 5,4,3,2,1 to each candidate variable 500 (510 ). Here, the importance 510 is a value representing the relative importance between each candidate variable 500 and may be expressed in various forms. For example, 10,8,6,4,2 is assigned to the 5 candidate variables 500 according to the size of the correlation, or 100, 50, 25, 12, 5 are assigned according to embodiments. The magnitude of the importance value can be variously modified.

The variable automatic generation device 100 may assign differently the extraction probability of each candidate variable 500 according to the importance (5100) assigned to each candidate variable 500. As in the present embodiment, 5,4,3, When the importance 510 of 2,1 is given, the importance 510 may be normalized 520 so that the total sum is 1 for probabilistic access, that is, the importance 510 of each candidate variable 500 It can be normalized 520 by dividing by the sum of importance (15=5+4+3+2=1) For example, the candidate variable X3 having the highest correlation is normalized 520 to 5/15.

The variable automatic generation device 100 randomly extracts 530 a predetermined number of variables (three in the embodiment) using the normalized value 520 as an extraction probability. In this embodiment, candidate variable X3 has an extraction probability of (5/15 * 100)%, and candidate variable X1 has an extraction probability of (2/15 * 100)%. The probability that each candidate variable will be selected depends on the extraction probability. For example, five beads representing candidate variable X3, four dicts representing candidate variable X4, three beads representing candidate variable X5, two beads representing candidate variable X1, and two representing candidate variable X2 This is the same as randomly pulling a marble from a pocket with 1 marble (15 marbles in total).

6 is a diagram illustrating an example of a method of generating a new variable through combining between variables extracted from candidate variables according to an embodiment of the present invention.

Referring to FIG. 6, the variable combining method is a linear method 600 for linearly combining all or a part of the extracted variables, and a method for combining various variables (multiplication, division, etc.) by applying various methods of computation (610,620), There exist a rule method 630 using a decision tree.

Hereinafter, three variables X3, X4, and X2 extracted from the candidate variable 500 in the example of FIG. 5 will be described.

The linear method 600 linearly combines at least two or more of the extracted variables X3, X4, and X2 to generate a new variable G1. This embodiment shows an example in which all three variables (X3, X4, X2) are linearly combined, but the variable automatic generation device 100 is a new variable consisting of a linear combination of X3 and X4, and linearity of X4 and X2. You can create a new variable consisting of a combination, a new variable consisting of a linear combination of X3 and X2, and so on. If the number of variables extracted from candidate variables is large, the method of combining these variables also increases. In this case, the variable automatic generation device 100 may generate only new variables having a predetermined number or less.

The division method 610 creates a new variable G2 through division between two or more variables. There are many ways to divide variables. This embodiment shows an example of generating a polynomial composed of three terms X3/X4, X3/X2, and X2/X4 as new variables (G2), but (X2*X3)/X4, X2/(X3* X4) There can be many ways to divide between variables, etc. Also, there can be many combinations of terms consisting of one division or a plurality of divisions. The variable automatic generation device 100 may generate all possible combination methods of division as new variables, but when a certain number of variables (for example, 10 or 100, etc.) are generated, new variables are generated. It can also be terminated.

The multiplication method 620 creates a new variable G3 through multiplication between two or more variables. As for the new variable generation method using the multiplication method, there are various combination methods like the salpin division method. The variable automatic generation device 9100 may generate all possible combinations of the multiplication method as new variables, but when a certain number of variables (for example, 10 or 100) are generated, creation of the new variable ends. You can do it.

New variables created by combining variables are expressed as polynomials. The weight for each term the variable of the polynomial _{_{(W 3, W 3 'W}} 3 ", W 4, W 4' W 4", W 2, W 2 ', W 2 ") and the bias (bias) the new variables and It can be determined through regression analysis between target variables, an example of which is shown in FIG. 7.

The rule method 630 is a method using a decision tree, which will be described again in FIG. 8.

The variable combining method of the present embodiment is one example to help understanding, and the present invention is not necessarily limited thereto, and variable combining methods of various methods may be applied to the present embodiment according to embodiments. For example, the variable automatic generation device 100 may generate a new variable by combining a linear method, a division method, and a multiply method.

7 is a diagram illustrating an example of a method for determining the weight and bias of a polynomial defining a newly generated variable according to an embodiment of the present invention.

6 and 7, a polynomial created by combining randomly extracted variables from candidate variables becomes new variables G1, G2, and G3. This embodiment shows an example of obtaining the weight (W ₃ ,X ₄ ,X ₂ ) and bias of each term of the polynomial constituting the new variable G1 combined by the linear method 600.

The variable automatic generation device 100 may determine weights and biases by regressing a model consisting of a polynomial equation 700 and a target variable 200 constituting new variables. For example, the variable automatic generation device 100 may use ridge regression as a regression analysis method. In addition to this, various regression analysis methods for analyzing a model composed of a target variable 200 and a polynomial 700 may be applied to this embodiment.

8 is a diagram illustrating an example of a method of combining variables in a rule manner according to an embodiment of the present invention.

Referring to FIG. 8, the variable automatic generation device 100 may generate a decision tree having a condition for a variable (X3, X4, X2 in FIG. 5) extracted from a candidate variable as a node. Various conventional methods for generating a decision tree can be applied to this embodiment. The variable field generation device 100 may generate a plurality of decision trees according to the location of each variable (root node 800, child nodes 810,820, etc.) and conditions. However, considering the calculation amount, the variable automatic generation device 100 may generate only a certain number of decision trees.

When the decision tree is generated, the variable automatic generation device 100 classifies each sample of the statistical information shown in FIG. 2 according to the decision tree to identify samples corresponding to each leaf 830,840,850,860. Then, the variable automatic generation device 100 obtains the average of the target variables of the samples belonging to each leaf, and generates node conditions of the

paths

870 and 880 toward the highest or lowest leaf as new variables G4 and G5. .

In this embodiment, if the average of each sample of the second leaf 840 is the highest and the average of each sample of the fourth leaf 860 is the lowest, the condition of the path 870 toward the second leaf 840 (W3>0.2 & X4>=1.5) and conditions (W3<=0.2 & X2<=0.1) of the path to the fourth leaf 860 are respectively generated as new variables G4 and G5.

The variable automatic generation device 100 reflects the new variables G4 and G5 determined by the rule method 630 in statistical information as shown in FIG. 9, and determines whether each sample satisfies the condition of the corresponding variable ( For example, a flag) can be input to statistical information. For example, the variable automatic generation device 100 may assign a value of '1' to a sample that satisfies the conditions of the new variables G4 and G5, and a value of '0' to a sample that does not satisfy the condition. have.

9 is a diagram illustrating an example of statistical information in which a new variable is generated according to an embodiment of the present invention.

9, the variable automatic generation device 100 reflects the newly generated variables (G1, G2, G3, G4, G5) 900 in statistical information, and the value of the new variable (910) for each sample. ) And save.

Referring to FIG. 10, the variable automatic generation device 100 includes a correlation analysis unit 1000, a candidate variable generation unit 1010, a variable generation unit 1020, and a data storage unit 1030.

The correlation analysis unit 1000 analyzes the correlation between the target variable to be predicted and each variable of statistical information. F-test can be used as a correlation analysis method.

The candidate variable generator 1010 selects a plurality of variables as candidate variables in the order of high correlation with the target variable among variables of statistical information. The candidate variable generator 1010 may assign different extraction probabilities to each candidate variable in the order of correlation and randomly extract them. For example, a higher correlation probability is given to a highly correlated candidate variable so that the highly correlated candidate variable can be better extracted.

The variable generator 1020 randomly extracts a predetermined number of variables from the candidate variables and creates new variables through combining between the randomly extracted variables. The variable generator 1020 may generate a polynomial, including terms consisting of addition, multiplication, or division between randomly extracted variables, as new variables, or may generate new variables using a decision tree as shown in FIG. 8. In the case of a new variable composed of a polynomial, the variable generator 1020 may grasp the bias of the polynomial and the weight of each term through regression analysis between the new variable and the target variable.

The data storage unit 1030 calculates a value for the new variable and stores it in statistical information. When the new variable is in the form of a conditional statement generated using a decision tree, the data storage unit 1030 determines whether each sample satisfies each node condition included in the new variable in statistical information using predefined numbers or characters. Can be displayed.

The present invention can also be embodied as computer readable codes on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disks, and optical data storage devices. In addition, the computer-readable recording medium can be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

So far, the present invention has been focused on the preferred embodiments. Those skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in terms of explanation, not limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent range should be interpreted as being included in the present invention.

Claims

Analyzing a correlation between a target variable to be predicted and each variable of statistical information;

Selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information;

Randomly extracting a predetermined number of variables from the candidate variables;

Generating a new variable through combining between the randomly extracted variables; And

Storing the new variable in the statistical information; a method for automatically generating a variable for data modeling.
The method of claim 1, wherein the random extraction step,

Assigning different extraction probabilities to the candidate variables in order of correlation; And

Randomly extracting a variable according to the extraction probability; Automatic variable generation method for data modeling comprising a.
According to claim 1, The step of creating the new variable,

Grasping the bias of the polynomial and the weight of each term through regression analysis between the polynomial and the target variable, which are generated through a combination between the randomly extracted variables; And

And generating a polynomial that reflects the bias and weight as a new variable.
According to claim 3,

The polynomial is a variable automatic generation method for data modeling, characterized in that it comprises a term consisting of addition, multiplication or division between the randomly extracted variables.
According to claim 1,

The step of creating the new variable,

Generating at least one decision tree having a condition for each variable extracted as a node;

Calculating an average of target variables for each leaf of the decision tree;

Including the step of generating the node condition of the path toward the leaf with the highest or lowest average as a new variable;

The step of storing in the statistical information,

And displaying whether or not the condition of each node included in the new variable is satisfied in the statistical information with a predefined number or character.
A correlation analysis unit analyzing a correlation between a target variable to be predicted based on statistical information and each variable of the statistical information;

A candidate variable selector for selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information;

A variable generator that generates new variables through combining between the variables randomly extracted from the candidate variables; And

And a data storage unit that stores the new variable in the statistical information.
The method of claim 6, wherein the candidate variable selection unit,

A variable automatic generation device characterized in that the candidate variables are extracted by giving different extraction probabilities in the order of correlation.
The method of claim 6, wherein the variable generating unit,

The polynomial consisting of a plurality of terms generated through the combination between the randomly extracted variables and the regression analysis between the target variable determine the bias of the polynomial and the weight of each term, and the polynomial reflecting the bias and weight is a new variable Variable automatic generation device, characterized in that generated by.
The method of claim 6, wherein the variable generating unit,

At least one decision tree is created with the condition for each variable as the node as a node, and the node condition of the path toward the leaf with the highest or lowest average of the target variables for each leaf of the decision tree is a new variable. Variable automatic generation device, characterized in that generated by.
A computer-readable recording medium recording a program for performing the method according to any one of claims 1 to 5.