WO2020111423A1 - Method for automatically generating variables for data modeling, and device thereof - Google Patents

Method for automatically generating variables for data modeling, and device thereof Download PDF

Info

Publication number
WO2020111423A1
WO2020111423A1 PCT/KR2019/007409 KR2019007409W WO2020111423A1 WO 2020111423 A1 WO2020111423 A1 WO 2020111423A1 KR 2019007409 W KR2019007409 W KR 2019007409W WO 2020111423 A1 WO2020111423 A1 WO 2020111423A1
Authority
WO
WIPO (PCT)
Prior art keywords
variable
variables
statistical information
candidate
new
Prior art date
Application number
PCT/KR2019/007409
Other languages
French (fr)
Korean (ko)
Inventor
김지훈
최유리
유두열
Original Assignee
주식회사 솔리드웨어
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 솔리드웨어 filed Critical 주식회사 솔리드웨어
Publication of WO2020111423A1 publication Critical patent/WO2020111423A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Definitions

  • the present invention relates to data modeling, and more particularly, to a method and apparatus for automatically generating various variables used in data modeling.
  • variables for the predictive model are mostly generated by heuristic judgment of experts who have knowledge in the data field. For example, suppose you create a credit rating model using statistical data such as gender, age, income, number of existing loans, and amount of existing loans. At this time, it is possible to create a credit evaluation model by using each variable present in the statistical data as it is, but experts in the field can create a new variable by dividing the existing loan amount by income to make the credit evaluation model more accurate and accurate.
  • the technical problem to be achieved by the embodiments of the present invention is to provide a method and apparatus for automatically generating a variable that is substantially helpful in data modeling.
  • an example of an automatic variable generation method for data modeling includes: analyzing a correlation between a target variable to be predicted and each variable of statistical information; Selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information; Randomly extracting a predetermined number of variables from the candidate variables; Generating a new variable through combining between the randomly extracted variables; And storing the new variable in the statistical information.
  • An example of a variable automatic generation device for data modeling according to an embodiment of the present invention for achieving the above technical problem is a correlation between a target variable to be predicted based on statistical information and each variable of the statistical information Correlation analysis unit for analyzing the; A candidate variable selector for selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information; A variable generator that generates new variables through combining between the variables randomly extracted from the candidate variables; And a data storage unit that stores the new variable in the statistical information.
  • new variables can be automatically generated through variables existing in the original data without user intervention. Even if there are hundreds or thousands or more of variables included in the raw data, it is possible to generate variables that are practically helpful to the predictive model.
  • various variables can be generated because the candidate variables selected stochastically are utilized. In addition, unnecessary candidates for generating various variables can be effectively reduced because selected candidate variables are utilized rather than all variables.
  • FIG. 1 is a view showing an example of a variable automatic generation device and a statistical information database according to an embodiment of the present invention
  • FIG. 2 is a diagram showing an example of statistical information according to an embodiment of the present invention.
  • FIG. 3 is a flowchart illustrating an example of an automatic variable generation method according to an embodiment of the present invention
  • FIG. 4 is a diagram illustrating an example of a method for determining a correlation between a target variable and each variable of statistical information according to an embodiment of the present invention
  • FIG. 5 is a diagram showing an example of a method of selecting a variable to be used for generating a new variable from candidate variables according to an embodiment of the present invention
  • FIG. 6 is a diagram illustrating an example of a method of generating a new variable through combining between variables extracted from candidate variables according to an embodiment of the present invention
  • FIG. 7 is a diagram illustrating an example of a method for determining the weight and bias of a polynomial defining a newly generated variable according to an embodiment of the present invention
  • FIG. 8 is a diagram illustrating an example of a method of combining variables in a rule manner according to an embodiment of the present invention
  • FIG. 9 is a diagram showing an example of statistical information in which new variables are generated according to an embodiment of the present invention.
  • FIG. 10 is a view showing an example of the configuration of a variable automatic generation device according to an embodiment of the present invention.
  • variable automatic generation method and apparatus for data modeling according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.
  • FIG. 1 is a diagram illustrating an example of a variable automatic generation device and a statistical information database according to an embodiment of the present invention.
  • variable automatic generation device 100 automatically generates at least one new variable for data modeling based on a variable of statistical information stored in the statistical information database 110.
  • the variable automatic generation device 100 may be implemented as a computing device including a memory, a processor, and an input/output device.
  • the memory loads a software program in which a variable auto-generation algorithm is implemented
  • the processor may generate a new variable according to the present embodiment by executing a software program loaded in the memory.
  • FIG. 3 An example of an automatic variable generation method is illustrated in FIG. 3.
  • the statistical information database 110 stores various statistical information.
  • the statistical information means a data set including information about at least one variable.
  • statistical information on bank customers may include information such as gender, age, income, number of existing loans, and amount of existing loans as variables.
  • An example of statistical information is shown in FIG. 2.
  • FIG. 2 is a view showing an example of statistical information according to an embodiment of the present invention.
  • statistical information 200 includes at least one sample 230 including information on at least one variable 210.
  • the statistical information 200 may include a target variable 220 to be predicted through data modeling (ie, a predictive model).
  • the target variable 220 may be customer credit.
  • FIG. 3 is a flowchart illustrating an example of an automatic variable generation method according to an embodiment of the present invention.
  • the variable automatic generation device 100 determines a correlation between a target variable and each variable of statistical information (S300 ). For example, in the example of FIG. 2, the variable automatic generation device 100 grasps a correlation between the target variable Y 220 and each variable X1,...Xm 210 of statistical information. An example of determining the correlation between variables is shown in FIG. 4.
  • the variable automatic generation device 100 selects a predetermined number of variables having a high correlation with the target variable as candidate variables (S310).
  • S310 An example of selecting five candidate variables based on a correlation between a target variable and each variable of statistical information is illustrated in FIG. 5.
  • the number of candidate variables may be variously set according to embodiments.
  • the variable automatic generation device 100 randomly extracts a predetermined number of variables from the candidate variable (S320). 5 shows an example of randomly extracting three of five candidate variables. The number of variables extracted from candidate variables may be variously set according to embodiments.
  • variable automatic generation device 100 may extract a certain number of variables by applying the same extraction probability to each candidate variable.
  • variable automatic generation device 100 may give different candidates different extraction probabilities according to the size of the correlation so that a variable highly correlated with the target variable can be better extracted. For example, the higher the correlation, the higher the probability of extraction. An example of a method of extracting by assigning different extraction probability to each candidate variable will be described in FIG. 5.
  • the variable automatic generation device 100 combines a predetermined number of variables extracted from candidate variables to generate new variables (S330).
  • the coupling between the variables may be various, such as a linear method, a multiplication method, a division method, or a rule method. Examples of various variable combining methods are shown in FIGS. 6 and 8.
  • the variable automatic generation device 100 stores the new variable in statistical information. That is, the variable automatic generation device 100 grasps the value of the new variable for each sample and reflects it in the statistical information. For example, if new variables G1, G2, ..., G5 are generated, the variable automatic generation device 100 grasps the value of the corresponding variable for each sample in the statistical information of the statistical information database 110 as shown in FIG. And save.
  • the newly generated variable according to the present embodiment is used for data modeling (ie, prediction model) for predicting a target variable.
  • data modeling ie, prediction model
  • various conventional modeling methods including machine learning, can generate more accurate predictive models using newly created variables.
  • FIG. 4 is a diagram illustrating an example of a method for determining a correlation between a target variable and each variable of statistical information according to an embodiment of the present invention.
  • variable automatic generation device 100 grasps a correlation between the target variable Y 220 and each variable X1,...,Xm 210 of statistical information.
  • the variable automatic generation device 100 may grasp the relative importance of each variable 210 with respect to the target variable 220 using f-test.
  • f-test Various methods other than f-test can be applied to this embodiment.
  • the variable automatic generation device 100 may select a plurality of variables as candidate variables in the order of high correlation with the target variable 220 among the variables 210 of statistical information. For example, if the number of candidate variables is defined as 5 and the order in which the correlation with the target variable is high is X3, X4, X5, X1, X2, the variable automatic generation device 100 shows X3, X4, X5 as shown in FIG. ,X1,X2 are selected as candidate variables.
  • FIG. 5 is a diagram illustrating an example of a method of selecting a variable to be used for generating a new variable from candidate variables according to an embodiment of the present invention.
  • candidate variables 500 are X3, X4, X5, X1, and X2.
  • the variable automatic generation device 100 randomly extracts a number of variables from the candidate variable 500 (530).
  • the number of randomly extracted variables may be variously set according to embodiments, and this embodiment is defined as three.
  • the variable automatic generation device 100 When the variable automatic generation device 100 randomly extracts three variables from the candidate variable 500, the extraction probability of each variable may be different. To this end, first, the variable automatic generation device 100 sequentially arranges the candidate variables 500 in the order of high correlation, and sequentially measures the importance of 5,4,3,2,1 to each candidate variable 500 (510 ).
  • the importance 510 is a value representing the relative importance between each candidate variable 500 and may be expressed in various forms. For example, 10,8,6,4,2 is assigned to the 5 candidate variables 500 according to the size of the correlation, or 100, 50, 25, 12, 5 are assigned according to embodiments. The magnitude of the importance value can be variously modified.
  • the variable automatic generation device 100 may assign differently the extraction probability of each candidate variable 500 according to the importance (5100) assigned to each candidate variable 500.
  • the candidate variable X3 having the highest correlation is normalized 520 to 5/15.
  • the variable automatic generation device 100 randomly extracts 530 a predetermined number of variables (three in the embodiment) using the normalized value 520 as an extraction probability.
  • candidate variable X3 has an extraction probability of (5/15 * 100)%
  • candidate variable X1 has an extraction probability of (2/15 * 100)%.
  • the probability that each candidate variable will be selected depends on the extraction probability. For example, five beads representing candidate variable X3, four dicts representing candidate variable X4, three beads representing candidate variable X5, two beads representing candidate variable X1, and two representing candidate variable X2 This is the same as randomly pulling a marble from a pocket with 1 marble (15 marbles in total).
  • FIG. 6 is a diagram illustrating an example of a method of generating a new variable through combining between variables extracted from candidate variables according to an embodiment of the present invention.
  • variable combining method is a linear method 600 for linearly combining all or a part of the extracted variables, and a method for combining various variables (multiplication, division, etc.) by applying various methods of computation (610,620), There exist a rule method 630 using a decision tree.
  • the linear method 600 linearly combines at least two or more of the extracted variables X3, X4, and X2 to generate a new variable G1.
  • This embodiment shows an example in which all three variables (X3, X4, X2) are linearly combined, but the variable automatic generation device 100 is a new variable consisting of a linear combination of X3 and X4, and linearity of X4 and X2. You can create a new variable consisting of a combination, a new variable consisting of a linear combination of X3 and X2, and so on. If the number of variables extracted from candidate variables is large, the method of combining these variables also increases. In this case, the variable automatic generation device 100 may generate only new variables having a predetermined number or less.
  • the division method 610 creates a new variable G2 through division between two or more variables.
  • This embodiment shows an example of generating a polynomial composed of three terms X3/X4, X3/X2, and X2/X4 as new variables (G2), but (X2*X3)/X4, X2/(X3* X4)
  • the variable automatic generation device 100 may generate all possible combination methods of division as new variables, but when a certain number of variables (for example, 10 or 100, etc.) are generated, new variables are generated. It can also be terminated.
  • the multiplication method 620 creates a new variable G3 through multiplication between two or more variables.
  • the new variable generation method using the multiplication method there are various combination methods like the salpin division method.
  • the variable automatic generation device 9100 may generate all possible combinations of the multiplication method as new variables, but when a certain number of variables (for example, 10 or 100) are generated, creation of the new variable ends. You can do it.
  • New variables created by combining variables are expressed as polynomials.
  • the weight for each term the variable of the polynomial (W 3, W 3 'W 3 ", W 4, W 4' W 4", W 2, W 2 ', W 2 ") and the bias (bias) the new variables and It can be determined through regression analysis between target variables, an example of which is shown in FIG. 7.
  • the rule method 630 is a method using a decision tree, which will be described again in FIG. 8.
  • variable combining method of the present embodiment is one example to help understanding, and the present invention is not necessarily limited thereto, and variable combining methods of various methods may be applied to the present embodiment according to embodiments.
  • the variable automatic generation device 100 may generate a new variable by combining a linear method, a division method, and a multiply method.
  • FIG. 7 is a diagram illustrating an example of a method for determining the weight and bias of a polynomial defining a newly generated variable according to an embodiment of the present invention.
  • a polynomial created by combining randomly extracted variables from candidate variables becomes new variables G1, G2, and G3.
  • This embodiment shows an example of obtaining the weight (W 3 ,X 4 ,X 2 ) and bias of each term of the polynomial constituting the new variable G1 combined by the linear method 600.
  • the variable automatic generation device 100 may determine weights and biases by regressing a model consisting of a polynomial equation 700 and a target variable 200 constituting new variables.
  • the variable automatic generation device 100 may use ridge regression as a regression analysis method.
  • various regression analysis methods for analyzing a model composed of a target variable 200 and a polynomial 700 may be applied to this embodiment.
  • FIG. 8 is a diagram illustrating an example of a method of combining variables in a rule manner according to an embodiment of the present invention.
  • the variable automatic generation device 100 may generate a decision tree having a condition for a variable (X3, X4, X2 in FIG. 5) extracted from a candidate variable as a node.
  • Various conventional methods for generating a decision tree can be applied to this embodiment.
  • the variable field generation device 100 may generate a plurality of decision trees according to the location of each variable (root node 800, child nodes 810,820, etc.) and conditions. However, considering the calculation amount, the variable automatic generation device 100 may generate only a certain number of decision trees.
  • variable automatic generation device 100 classifies each sample of the statistical information shown in FIG. 2 according to the decision tree to identify samples corresponding to each leaf 830,840,850,860. Then, the variable automatic generation device 100 obtains the average of the target variables of the samples belonging to each leaf, and generates node conditions of the paths 870 and 880 toward the highest or lowest leaf as new variables G4 and G5. .
  • the variable automatic generation device 100 reflects the new variables G4 and G5 determined by the rule method 630 in statistical information as shown in FIG. 9, and determines whether each sample satisfies the condition of the corresponding variable ( For example, a flag) can be input to statistical information.
  • the variable automatic generation device 100 may assign a value of '1' to a sample that satisfies the conditions of the new variables G4 and G5, and a value of '0' to a sample that does not satisfy the condition. have.
  • FIG. 9 is a diagram illustrating an example of statistical information in which a new variable is generated according to an embodiment of the present invention.
  • variable automatic generation device 100 reflects the newly generated variables (G1, G2, G3, G4, G5) 900 in statistical information, and the value of the new variable (910) for each sample. ) And save.
  • FIG. 10 is a view showing an example of the configuration of a variable automatic generation device according to an embodiment of the present invention.
  • variable automatic generation device 100 includes a correlation analysis unit 1000, a candidate variable generation unit 1010, a variable generation unit 1020, and a data storage unit 1030.
  • the correlation analysis unit 1000 analyzes the correlation between the target variable to be predicted and each variable of statistical information.
  • F-test can be used as a correlation analysis method.
  • the candidate variable generator 1010 selects a plurality of variables as candidate variables in the order of high correlation with the target variable among variables of statistical information.
  • the candidate variable generator 1010 may assign different extraction probabilities to each candidate variable in the order of correlation and randomly extract them. For example, a higher correlation probability is given to a highly correlated candidate variable so that the highly correlated candidate variable can be better extracted.
  • the variable generator 1020 randomly extracts a predetermined number of variables from the candidate variables and creates new variables through combining between the randomly extracted variables.
  • the variable generator 1020 may generate a polynomial, including terms consisting of addition, multiplication, or division between randomly extracted variables, as new variables, or may generate new variables using a decision tree as shown in FIG. 8. In the case of a new variable composed of a polynomial, the variable generator 1020 may grasp the bias of the polynomial and the weight of each term through regression analysis between the new variable and the target variable.
  • the data storage unit 1030 calculates a value for the new variable and stores it in statistical information.
  • the new variable is in the form of a conditional statement generated using a decision tree
  • the data storage unit 1030 determines whether each sample satisfies each node condition included in the new variable in statistical information using predefined numbers or characters. Can be displayed.
  • the present invention can also be embodied as computer readable codes on a computer readable recording medium.
  • the computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disks, and optical data storage devices.
  • the computer-readable recording medium can be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Operations Research (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed are a method for automatically generating variables for data modeling, and a device thereof. The device for automatically generating variables analyzes the correlation between a target variable to be predicted and each variable of statistical information, selects a plurality of variables among the variables of the statistical information as candidate variables in descending order of correlation with the target variable, generates a new variable by arbitrarily extracting a certain number of variables from among the candidate variables and combining the arbitrarily extracted variables, and stores the new variable and a value for the new variable in the statistical information.

Description

데이터 모델링을 위한 변수 자동생성방법 및 그 장치Variable automatic generation method and device for data modeling
본 발명은 데이터 모델링에 관한 것으로, 보다 상세하게는 데이터 모델링에 사용되는 다양한 변수를 자동으로 생성하는 방법 및 그 장치에 관한 것이다.The present invention relates to data modeling, and more particularly, to a method and apparatus for automatically generating various variables used in data modeling.
데이터 모델링에서 적절한 변수를 생성하는 피쳐 엔지니어링(feature engineering)은 매우 중요한 과정이다. 예측모델을 위한 변수는 대부분 해당 데이터 분야의 지식이 있는 전문가의 휴리스틱(heuristic)한 판단으로 생성된다. 예를 들어, 성별, 나이, 소득, 기존대출건수, 기존대출금액 등의 통계자료를 이용하여 신용평가 모델을 만든다고 가정하자. 이때 통계자료에 존재하는 각 변수를 그대로 사용하여 신용평가 모델을 만들 수도 있지만 해당 분야의 전문가가 기존대출금액을 소득으로 나눈 새로운 변수를 만들어 신용평가 모델을 보다 정확하고 정밀하게 만들 수 있다. 그러나 이러한 변수의 생성은 해당 분야 전문가의 주관적인 경험에 따라 만들어야 하는 한계가 존재하며, 더구나 수백 또는 수천 개의 변수가 존재하는 통계자료의 경우에 전문가가 이들 각 변수의 관계를 찾아 새로운 변수를 만든다는 것은 현실적으로 거의 불가능하다.In data modeling, feature engineering, which creates appropriate variables, is a very important process. The variables for the predictive model are mostly generated by heuristic judgment of experts who have knowledge in the data field. For example, suppose you create a credit rating model using statistical data such as gender, age, income, number of existing loans, and amount of existing loans. At this time, it is possible to create a credit evaluation model by using each variable present in the statistical data as it is, but experts in the field can create a new variable by dividing the existing loan amount by income to make the credit evaluation model more accurate and accurate. However, there are limitations in the creation of these variables according to the subjective experience of experts in the field, and in the case of statistical data with hundreds or thousands of variables, it is realistic that experts find the relationship between each variable and create new variables. Almost impossible.
PCA(Principal Component Analysis) 또는 LDA(Linear Discriminant Analysis) 등을 이용하여 변수를 자동으로 생성하는 방법이 존재한다. 그러나 이러한 자동 변수 생성 방법은 계산 효율성 향상을 목표로 하고 있어 예측 모델에 실질적으로 도움이 되는 변수를 만드는데 한계가 있다. 특히 차원 축소(dimensionality reduction)를 이용하면 해당 변수의 설명력을 상실하는 문제점이 존재한다. 예를 들어, 앞서 설명한 신용평가 모델에서 나이와 대출금액에 대해 PCA를 적용하여 새로운 변수를 만들면 이는 분산이 가장 큰 방향을 기저(basis)로 하는 새로운 변수가 된다. 새로운 변수는 나이와 대출금액의 분포에서 분산이 큰 방향을 의미하는데 사용자가 이러한 변수의 의미를 직관적으로 떠올리기 어려운 문제점이 있다.There is a method to automatically generate variables using PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis). However, this automatic variable generation method aims to improve computational efficiency, and thus has limitations in creating variables that are practically useful for predictive models. In particular, when dimensionality reduction is used, there is a problem of losing the explanatory power of the corresponding variable. For example, in the credit evaluation model described above, if PCA is applied to age and loan amount to create a new variable, this becomes a new variable with the largest variance as the basis. The new variable means the direction in which variance is large in the distribution of age and loan amount, but it is difficult for the user to intuitively recall the meaning of the variable.
본 발명의 실시 예가 이루고자 하는 기술적 과제는, 데이터 모델링에 실질적으로 도움이 되는 변수를 자동 생성하는 방법 및 그 장치를 제공하는 데 있다.The technical problem to be achieved by the embodiments of the present invention is to provide a method and apparatus for automatically generating a variable that is substantially helpful in data modeling.
상기의 기술적 과제를 달성하기 위한, 본 발명의 실시 예에 따른 데이터 모델링을 위한 변수 자동생성방법의 일 예는, 예측하고자 하는 타겟변수와 통계정보의 각 변수 사이의 상관관계를 분석하는 단계; 상기 통계정보의 변수들 중 상기 타겟변수와 상관관계가 높은 순으로 복수 개의 변수를 후보변수로 선택하는 단계; 상기 후보변수에서 일정 개수의 변수를 임의 추출하는 단계; 상기 임의 추출한 변수들 사이의 결합을 통해 새로운 변수를 생성하는 단계; 및 상기 새로운 변수를 상기 통계정보에 저장하는 단계;를 포함한다.In order to achieve the above technical problem, an example of an automatic variable generation method for data modeling according to an embodiment of the present invention includes: analyzing a correlation between a target variable to be predicted and each variable of statistical information; Selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information; Randomly extracting a predetermined number of variables from the candidate variables; Generating a new variable through combining between the randomly extracted variables; And storing the new variable in the statistical information.
상기의 기술적 과제를 달성하기 위한, 본 발명의 실시 예에 따른 데이터 모델링을 위한 변수자동생성장치의 일 예는, 통계정보를 기초로 예측하고자 하는 타겟변수와 상기 통계정보의 각 변수 사이의 상관관계를 분석하는 상관관계분석부; 상기 통계정보의 변수들 중 상기 타겟변수와 상관관계가 높은 순으로 복수 개의 변수들을 후보변수로 선택하는 후보변수선별부; 상기 후보변수에서 임의 추출한 변수들 사이의 결합을 통해 새로운 변수를 생성하는 변수생성부; 및 상기 새로운 변수를 상기 통계정보에 저장하는 데이터저장부;를 포함한다.An example of a variable automatic generation device for data modeling according to an embodiment of the present invention for achieving the above technical problem is a correlation between a target variable to be predicted based on statistical information and each variable of the statistical information Correlation analysis unit for analyzing the; A candidate variable selector for selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information; A variable generator that generates new variables through combining between the variables randomly extracted from the candidate variables; And a data storage unit that stores the new variable in the statistical information.
본 발명의 실시 예에 따르면, 사용자의 개입 없이 원 데이터에 존재하는 변수를 통해 새로운 변수를 자동으로 생성할 수 있다. 원 데이터에 포함된 변수가 수백 또는 수천 개 이상인 경우에도 예측모델에 실질적으로 도움이 되는 변수를 생성할 수 있다. 원 데이터에 포함된 변수들 중에서 확률적으로 선별된 후보변수를 활용하므로 다양한 변수를 생성할 수 있다. 또한, 전체 변수가 아닌 선별된 후보변수를 활용하므로 다양한 변수 생성을 위한 불필요한 연산을 효과적으로 줄일 수 있다. According to an embodiment of the present invention, new variables can be automatically generated through variables existing in the original data without user intervention. Even if there are hundreds or thousands or more of variables included in the raw data, it is possible to generate variables that are practically helpful to the predictive model. Among the variables included in the raw data, various variables can be generated because the candidate variables selected stochastically are utilized. In addition, unnecessary candidates for generating various variables can be effectively reduced because selected candidate variables are utilized rather than all variables.
도 1은 본 발명의 실시 예에 따른 변수자동생성장치와 통계정보데이터베이스의 일 예를 도시한 도면,1 is a view showing an example of a variable automatic generation device and a statistical information database according to an embodiment of the present invention,
도 2는 본 발명의 실시 예에 따른 통계정보의 일 예를 도시한 도면,2 is a diagram showing an example of statistical information according to an embodiment of the present invention;
도 3은 본 발명의 실시 예에 따른 변수 자동생성방법의 일 예를 도시한 흐름도,3 is a flowchart illustrating an example of an automatic variable generation method according to an embodiment of the present invention,
도 4는 본 발명의 실시 예에 따라 타겟변수와 통계정보의 각 변수 사이의 상관관계를 파악하는 방법의 일 예를 도시한 도면,4 is a diagram illustrating an example of a method for determining a correlation between a target variable and each variable of statistical information according to an embodiment of the present invention;
도 5는 본 발명의 실시 예에 따른 후보변수로부터 새로운 변수 생성에 사용할 변수를 선별하는 방법의 일 예를 도시한 도면,5 is a diagram showing an example of a method of selecting a variable to be used for generating a new variable from candidate variables according to an embodiment of the present invention;
도 6은 본 발명의 실시 예에 따라 후보변수로부터 추출한 변수들 사이의 결합을 통해 새로운 변수를 생성하는 방법의 일 예를 도시한 도면,6 is a diagram illustrating an example of a method of generating a new variable through combining between variables extracted from candidate variables according to an embodiment of the present invention;
도 7은 본 발명의 실시 예에 따라 새롭게 생성된 변수를 정의하는 다항식의 가중치 및 바이어스를 결정하는 방법의 일 예를 도시한 도면,7 is a diagram illustrating an example of a method for determining the weight and bias of a polynomial defining a newly generated variable according to an embodiment of the present invention;
도 8은 본 발명의 실시 예에 따른 룰 방식으로 변수를 결합하는 방법의 일 예를 도시한 도면,8 is a diagram illustrating an example of a method of combining variables in a rule manner according to an embodiment of the present invention,
도 9는 본 발명의 실시 예에 따라 새로운 변수가 생성된 통계정보의 일 예를 도시한 도면, 그리고,9 is a diagram showing an example of statistical information in which new variables are generated according to an embodiment of the present invention; and
도 10은 본 발명의 실시 예에 따른 변수자동생성장치의 구성의 일 예를 도시한 도면이다.10 is a view showing an example of the configuration of a variable automatic generation device according to an embodiment of the present invention.
이하에서, 첨부된 도면들을 참조하여 본 발명의 실시 예에 따른 데이터 모델링을 위한 변수 자동생성방법 및 그 장치에 대해 상세히 살펴본다.Hereinafter, a variable automatic generation method and apparatus for data modeling according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.
도 1은 본 발명의 실시 예에 따른 변수자동생성장치와 통계정보데이터베이스의 일 예를 도시한 도면이다.1 is a diagram illustrating an example of a variable automatic generation device and a statistical information database according to an embodiment of the present invention.
도 1을 참조하면, 변수자동생성장치(100)는 통계정보데이터베이스(110)에 저장된 통계정보의 변수를 기초로 데이터 모델링을 위한 적어도 하나 이상의 새로운 변수를 자동생성한다. Referring to FIG. 1, the variable automatic generation device 100 automatically generates at least one new variable for data modeling based on a variable of statistical information stored in the statistical information database 110.
변수자동생성장치(100)는 메모리, 프로세서, 입출력장치 등을 포함하는 컴퓨팅 장치로 구현될 수 있다. 예를 들어, 메모리는 변수 자동생성 알고리즘이 구현된 소프트웨어 프로그램을 로딩하고, 프로세서는 메모리에 로딩된 소프트웨어 프로그램을 수행하여 본 실시 예에 따른 새로운 변수를 생성할 수 있다. 변수 자동생성 방법의 일 예가 도 3에 도시되어 있다. The variable automatic generation device 100 may be implemented as a computing device including a memory, a processor, and an input/output device. For example, the memory loads a software program in which a variable auto-generation algorithm is implemented, and the processor may generate a new variable according to the present embodiment by executing a software program loaded in the memory. An example of an automatic variable generation method is illustrated in FIG. 3.
통계정보데이터베이스(110)는 각종 통계정보를 저장한다. 여기서, 통계정보는 적어도 하나 이상의 변수에 대한 정보를 포함하는 데이터집합을 의미한다. 예를 들어, 은행 고객에 대한 통계정보는 성별, 나이, 소득, 기존대출건수, 기존대출금액 등을 변수로 하는 정보를 포함할 수 있다. 통계정보의 일 예가 도 2에 도시되어 있다.The statistical information database 110 stores various statistical information. Here, the statistical information means a data set including information about at least one variable. For example, statistical information on bank customers may include information such as gender, age, income, number of existing loans, and amount of existing loans as variables. An example of statistical information is shown in FIG. 2.
도 2는 본 발명의 실시 예에 따른 통계정보의 일 예를 도시한 도면이다.2 is a view showing an example of statistical information according to an embodiment of the present invention.
도 2를 참조하면, 통계정보(200)는 적어도 하나 이상의 변수(210)에 대한 정보를 포함하는 적어도 하나 이상의 샘플(230)을 포함한다. 또한, 통계정보(200)는 데이터 모델링(즉, 예측모델)을 통해 예측하고자 하는 타겟변수(220)를 포함할 수 있다. 예를 들어, 도 1에서 살핀 은행 고객에 대한 통계정보에서, 타겟변수(220)는 고객신용도일 수 있다.Referring to FIG. 2, statistical information 200 includes at least one sample 230 including information on at least one variable 210. Also, the statistical information 200 may include a target variable 220 to be predicted through data modeling (ie, a predictive model). For example, in the statistical information for the salpin bank customer in FIG. 1, the target variable 220 may be customer credit.
도 3은 본 발명의 실시 예에 따른 변수 자동생성방법의 일 예를 도시한 흐름도이다.3 is a flowchart illustrating an example of an automatic variable generation method according to an embodiment of the present invention.
도 3을 참조하면, 변수자동생성장치(100)는 타겟변수와 통계정보의 각 변수 사이의 상관관계를 파악한다(S300). 예를 들어, 도 2의 예에서, 변수자동생성장치(100)는 타겟변수 Y(220)와 통계정보의 각 변수 X1,...Xm(210) 사이의 상관관계를 파악한다. 변수사이의 상관관계파악의 일 예가 도 4에 도시되어 있다.Referring to FIG. 3, the variable automatic generation device 100 determines a correlation between a target variable and each variable of statistical information (S300 ). For example, in the example of FIG. 2, the variable automatic generation device 100 grasps a correlation between the target variable Y 220 and each variable X1,...Xm 210 of statistical information. An example of determining the correlation between variables is shown in FIG. 4.
변수자동생성장치(100)는 타겟변수와 상관관계가 높은 일정 개수의 변수를 후보변수로 선택한다(S310). 타겟변수와 통계정보의 각 변수 사이의 상관관계를 기초로 5개의 후보변수를 선택하는 일 예가 도 5에 도시되어 있다. 후보변수의 개수는 실시 예에 따라 다양하게 설정될 수 있다. The variable automatic generation device 100 selects a predetermined number of variables having a high correlation with the target variable as candidate variables (S310). An example of selecting five candidate variables based on a correlation between a target variable and each variable of statistical information is illustrated in FIG. 5. The number of candidate variables may be variously set according to embodiments.
후보변수가 결정되면, 변수자동생성장치(100)는 후보변수에서 일정 개수의 변수를 무작위 추출한다(S320). 도 5에는 5개의 후보변수 중에서 3개를 임의 추출하는 예가 도시되어 있다. 후보변수에서 추출하는 변수의 개수는 실시 예에 따라 다양하게 설정될 수 있다.When the candidate variable is determined, the variable automatic generation device 100 randomly extracts a predetermined number of variables from the candidate variable (S320). 5 shows an example of randomly extracting three of five candidate variables. The number of variables extracted from candidate variables may be variously set according to embodiments.
일 실시 예로, 변수자동생성장치(100)는 각 후보변수에 대해 동일한 추출 확률을 적용하여 일정 개수의 변수를 추출할 수 있다. 다른 실시 예로, 변수자동생성장치(100)는 타겟변수와 상관관계가 높은 변수가 더 잘 추출될 수 있도록 상관관계의 크기에 따라 각 후보변수에 서로 다른 추출 확률을 부여할 수 있다. 예를 들어, 상관관계가 높을수록 더 높은 추출 확률을 부여할 수 있다. 각 후보변수에 서로 다른 추출확률을 부여하여 추출하는 방법의 일 예에 대해 도 5에서 살펴본다.In one embodiment, the variable automatic generation device 100 may extract a certain number of variables by applying the same extraction probability to each candidate variable. In another embodiment, the variable automatic generation device 100 may give different candidates different extraction probabilities according to the size of the correlation so that a variable highly correlated with the target variable can be better extracted. For example, the higher the correlation, the higher the probability of extraction. An example of a method of extracting by assigning different extraction probability to each candidate variable will be described in FIG. 5.
변수자동생성장치(100)는 후보변수에서 추출한 일정 개수의 변수를 서로 결합하여 새로운 변수를 생성한다(S330). 변수 사이의 결합은 선형 방식, 곱셈 방식, 나눗셈 방식 또는 룰(rule) 방식 등 다양할 수 있다. 다양한 변수 결합 방법의 예가 도 6 및 도 8에 도시되어 있다.The variable automatic generation device 100 combines a predetermined number of variables extracted from candidate variables to generate new variables (S330). The coupling between the variables may be various, such as a linear method, a multiplication method, a division method, or a rule method. Examples of various variable combining methods are shown in FIGS. 6 and 8.
변수자동생성장치(100)는 새로운 변수를 통계정보에 저장한다. 즉, 변수자동생성장치(100)는 각 샘플에 대한 새로운 변수의 값을 파악하여 통계정보에 반영한다. 예를 들어, 새로운 변수 G1,G2,...,G5가 생성되었다면, 변수자동생성장치(100)는 도 9와 같이 통계정보데이터베이스(110)의 통계정보에 각 샘플별 해당 변수의 값을 파악하여 저장한다. The variable automatic generation device 100 stores the new variable in statistical information. That is, the variable automatic generation device 100 grasps the value of the new variable for each sample and reflects it in the statistical information. For example, if new variables G1, G2, ..., G5 are generated, the variable automatic generation device 100 grasps the value of the corresponding variable for each sample in the statistical information of the statistical information database 110 as shown in FIG. And save.
본 실시 예에 따라 새롭게 생성된 변수는 타겟변수를 예측하는 데이터 모델링(즉, 예측모델)에 활용된다. 예를 들어, 머신러닝 등을 포함한 종래의 다양한 모델링 방법은 새롭게 생성된 변수를 활용하여 보다 정확한 예측모델을 생성할 수 있다. The newly generated variable according to the present embodiment is used for data modeling (ie, prediction model) for predicting a target variable. For example, various conventional modeling methods, including machine learning, can generate more accurate predictive models using newly created variables.
도 4는 본 발명의 실시 예에 따라 타겟변수와 통계정보의 각 변수 사이의 상관관계를 파악하는 방법의 일 예를 도시한 도면이다.4 is a diagram illustrating an example of a method for determining a correlation between a target variable and each variable of statistical information according to an embodiment of the present invention.
도 2 및 도 4를 함께 참조하면, 변수자동생성장치(100)는 타겟변수 Y(220)와 통계정보의 각 변수 X1,...,Xm(210) 사이의 상관관계를 파악한다. 예를 들어, 변수자동생성장치(100)는 f-test를 이용하여 타겟변수(220)에 대한 각 변수(210)의 상대적 중요도를 파악할 수 있다. f-test 외의 다양한 방법이 본 실시 예에 적용될 수 있다.2 and 4 together, the variable automatic generation device 100 grasps a correlation between the target variable Y 220 and each variable X1,...,Xm 210 of statistical information. For example, the variable automatic generation device 100 may grasp the relative importance of each variable 210 with respect to the target variable 220 using f-test. Various methods other than f-test can be applied to this embodiment.
변수자동생성장치(100)는 통계정보의 변수들(210) 중 타겟변수(220)와 상관관계가 높은 순으로 복수 개의 변수를 후보변수로 선택할 수 있다. 예를 들어, 후보변수의 개수가 5개로 정의되고 타겟변수와 상관관계가 높은 순서가 X3,X4,X5,X1,X2이면, 변수자동생성장치(100)는 도 5와 같이 X3,X4,X5,X1,X2를 후보변수로 선별한다. The variable automatic generation device 100 may select a plurality of variables as candidate variables in the order of high correlation with the target variable 220 among the variables 210 of statistical information. For example, if the number of candidate variables is defined as 5 and the order in which the correlation with the target variable is high is X3, X4, X5, X1, X2, the variable automatic generation device 100 shows X3, X4, X5 as shown in FIG. ,X1,X2 are selected as candidate variables.
도 5는 본 발명의 실시 예에 따른 후보변수로부터 새로운 변수 생성에 사용할 변수를 선별하는 방법의 일 예를 도시한 도면이다.5 is a diagram illustrating an example of a method of selecting a variable to be used for generating a new variable from candidate variables according to an embodiment of the present invention.
도 5를 참조하면, 후보변수(500)는 X3,X4,X5,X1,X2이다. 변수자동생성장치(100)는 후보변수(500)에서 일정 개수의 변수를 임의 추출(530)한다. 임의 추출하는 변수의 개수는 실시 예에 따라 다양하게 설정될 수 있으며, 본 실시 예는 3개라고 정의한다. 5, candidate variables 500 are X3, X4, X5, X1, and X2. The variable automatic generation device 100 randomly extracts a number of variables from the candidate variable 500 (530). The number of randomly extracted variables may be variously set according to embodiments, and this embodiment is defined as three.
변수자동생성장치(100)는 후보변수(500)에서 3개의 변수를 무작위로 추출할 때 각 변수의 추출확률을 서로 다르게 할 수 있다. 이를 위해 먼저, 변수자동생성장치(100)는 상관관계가 높은 순으로 후보변수(500)를 순차적으로 배열하고 각 후보변수(500)에 순차적으로 5,4,3,2,1의 중요도(510)를 부여할 수 있다. 여기서 중요도(510)는 각 후보변수(500) 사이의 상대적 중요도를 나타내는 값으로 다양한 형태로 표현될 수 있다. 예를 들어, 5개의 후보변수(500)에 상관관계의 크기에 따라 10,8,6,4,2를 부여하거나, 100, 50, 25, 12, 5와 같이 부여하는 등 실시 예에 따라 부여하는 중요도 값의 크기는 다양하게 변형 가능하다. When the variable automatic generation device 100 randomly extracts three variables from the candidate variable 500, the extraction probability of each variable may be different. To this end, first, the variable automatic generation device 100 sequentially arranges the candidate variables 500 in the order of high correlation, and sequentially measures the importance of 5,4,3,2,1 to each candidate variable 500 (510 ). Here, the importance 510 is a value representing the relative importance between each candidate variable 500 and may be expressed in various forms. For example, 10,8,6,4,2 is assigned to the 5 candidate variables 500 according to the size of the correlation, or 100, 50, 25, 12, 5 are assigned according to embodiments. The magnitude of the importance value can be variously modified.
변수자동생성장치(100)는 각 후보변수(500)에 부여된 중요도(5100에 따라 각 후보변수(500)의 추출 확률을 서로 다르게 부여할 수 있다. 본 실시 예와 같이 5,4,3,2,1의 중요도(510)가 부여된 경우에 확률적인 접근을 위해 총 합이 1이 되도록 중요도(510)를 정규화(520)할 수 있다. 즉, 각 후보변수(500)의 중요도(510)를 중요도 총합(15=5+4+3+2=1)으로 나누어 정규화(520)할 수 있다. 예를 들어, 상관관계가 가장 높은 후보변수 X3는 5/15로 정규화(520)된다.The variable automatic generation device 100 may assign differently the extraction probability of each candidate variable 500 according to the importance (5100) assigned to each candidate variable 500. As in the present embodiment, 5,4,3, When the importance 510 of 2,1 is given, the importance 510 may be normalized 520 so that the total sum is 1 for probabilistic access, that is, the importance 510 of each candidate variable 500 It can be normalized 520 by dividing by the sum of importance (15=5+4+3+2=1) For example, the candidate variable X3 having the highest correlation is normalized 520 to 5/15.
변수자동생성장치(100)는 정규화(520)된 값을 추출 확률로 이용하여 일정 개수( 실시 예는 3개)의 변수를 무작위 추출(530)한다. 본 실시 예에서 후보변수X3는 (5/15 * 100)%의 추출확률을 가지며, 후보변수 X1은 (2/15 * 100)%의 추출확률을 가진다. 추출확률에 따라 각 후보변수가 선택될 확률이 서로 달라진다. 구슬을 예로 들면, 후보변수X3를 나타내는 구슬이 5개, 후보변수X4를 나타내는 구술이 4개, 후보변수X5를 나타내는 구슬이 3개, 후보변수X1을 나타내는 구슬이 2개, 후보변수X2를 나타내는 구슬이 1개 든 주머니(총 15개(=중요도 총합)의 구슬)에서 무작위로 구슬을 꺼내는 것과 동일하다. The variable automatic generation device 100 randomly extracts 530 a predetermined number of variables (three in the embodiment) using the normalized value 520 as an extraction probability. In this embodiment, candidate variable X3 has an extraction probability of (5/15 * 100)%, and candidate variable X1 has an extraction probability of (2/15 * 100)%. The probability that each candidate variable will be selected depends on the extraction probability. For example, five beads representing candidate variable X3, four dicts representing candidate variable X4, three beads representing candidate variable X5, two beads representing candidate variable X1, and two representing candidate variable X2 This is the same as randomly pulling a marble from a pocket with 1 marble (15 marbles in total).
도 6은 본 발명의 실시 예에 따라 후보변수로부터 추출한 변수들 사이의 결합을 통해 새로운 변수를 생성하는 방법의 일 예를 도시한 도면이다.6 is a diagram illustrating an example of a method of generating a new variable through combining between variables extracted from candidate variables according to an embodiment of the present invention.
도 6을 참조하면, 변수 결합 방식은 추출한 변수들의 전부 또는 일부를 선형으로 결합하는 선형방식(600)과, 변수들에 곱셈, 나눗셈 등의 각종 연산 방법을 적용하여 결합하는 방식(610,620)과, 의사결정나무를 이용하는 규칙(rule) 방식(630) 등이 존재한다.Referring to FIG. 6, the variable combining method is a linear method 600 for linearly combining all or a part of the extracted variables, and a method for combining various variables (multiplication, division, etc.) by applying various methods of computation (610,620), There exist a rule method 630 using a decision tree.
이하에서, 도 5의 예에서 후보변수(500)로부터 추출한 세 개의 변수 X3,X4,X2를 기준으로 설명한다.Hereinafter, three variables X3, X4, and X2 extracted from the candidate variable 500 in the example of FIG. 5 will be described.
선형방식(600)은 추출한 변수들(X3,X4,X2) 중 적어도 두 개 이상의 변수를 선형으로 결합하여 새로운 변수(G1)를 생성한다. 본 실시 예는 세 개의 변수(X3,X4,X2) 모두를 선형으로 결합하는 예를 도시하고 있으나, 변수자동생성장치(100)는 X3와 X4의 선형결합으로 이루어진 새로운 변수, X4와 X2의 선형결합으로 이루어진 새로운 변수, X3와 X2의 선형결합으로 이루어진 새로운 변수 등을 생성할 수 있다. 만약 후보변수로부터 추출한 변수의 개수가 많다면 이들 변수의 조합 방법도 많아진다. 이 경우에 변수자동생성장치(100)는 일정 개수 이하의 새로운 변수만을 생성할 수 있다. The linear method 600 linearly combines at least two or more of the extracted variables X3, X4, and X2 to generate a new variable G1. This embodiment shows an example in which all three variables (X3, X4, X2) are linearly combined, but the variable automatic generation device 100 is a new variable consisting of a linear combination of X3 and X4, and linearity of X4 and X2. You can create a new variable consisting of a combination, a new variable consisting of a linear combination of X3 and X2, and so on. If the number of variables extracted from candidate variables is large, the method of combining these variables also increases. In this case, the variable automatic generation device 100 may generate only new variables having a predetermined number or less.
나눗셈 방식(610)은 두 개 이상의 변수 사이의 나눗셈을 통해 새로운 변수(G2)를 생성한다. 변수를 나누는 방식은 매우 다양할 수 있다. 본 실시 예는, X3/X4, X3/X2, X2/X4의 세 항으로 구성된 다항식을 새로운 변수(G2)로 생성하는 예를 도시하고 있으나, (X2*X3)/X4, X2/(X3*X4) 등 변수들 사이를 나누는 방식은 매우 많이 존재할 수 있으며, 또한 나눗셈으로 구성된 항을 한 개로 구성할지 복수 개로 구성할지 매우 많은 조합 형태가 존재할 수 있다. 변수자동생성장치(100)는 나눗셈 방식의 조합 가능한 모든 방법을 각각 새로운 변수로 생성할 수 있으나, 일정 개수(예를 들어, 10개 또는 100개 등)의 새로운 변수가 생성되면 새로운 변수의 생성을 종료하도록 할 수도 있다.The division method 610 creates a new variable G2 through division between two or more variables. There are many ways to divide variables. This embodiment shows an example of generating a polynomial composed of three terms X3/X4, X3/X2, and X2/X4 as new variables (G2), but (X2*X3)/X4, X2/(X3* X4) There can be many ways to divide between variables, etc. Also, there can be many combinations of terms consisting of one division or a plurality of divisions. The variable automatic generation device 100 may generate all possible combination methods of division as new variables, but when a certain number of variables (for example, 10 or 100, etc.) are generated, new variables are generated. It can also be terminated.
곱셈 방식(620)은 두 개 이상의 변수 사이의 곱셈을 통해 새로운 변수(G3)를 생성한다. 곱셈 방식을 이용한 새로운 변수 생성 방법은 앞서 살핀 나눗셈 방식과 같이 매우 다양한 조합 방법이 존재한다. 변수자동생성장치9100)는 곱셈 방식의 조합 가능한 모든 방법을 각각 새로운 변수로 생성할 수 있으나, 일정 개수(예를 들어, 10개 또는 100개 등)의 새로운 변수가 생성되면 새로운 변수의 생성을 종료하도록 할 수 있다.The multiplication method 620 creates a new variable G3 through multiplication between two or more variables. As for the new variable generation method using the multiplication method, there are various combination methods like the salpin division method. The variable automatic generation device 9100 may generate all possible combinations of the multiplication method as new variables, but when a certain number of variables (for example, 10 or 100) are generated, creation of the new variable ends. You can do it.
변수를 결합하여 생성되는 새로운 변수는 다항식으로 표현된다. 이때 다항식의 각 항의 변수에 대한 가중치(W3,W3'W3",W4,W4'W4",W2,W2',W2") 및 바이어스(bias)는 새로운 변수와 타겟변수 사이의 회귀분석을 통해 결정될 수 있다. 이에 대한 일 예가 도 7에 도시되어 있다. New variables created by combining variables are expressed as polynomials. The weight for each term the variable of the polynomial (W 3, W 3 'W 3 ", W 4, W 4' W 4", W 2, W 2 ', W 2 ") and the bias (bias) the new variables and It can be determined through regression analysis between target variables, an example of which is shown in FIG. 7.
규칙방식(630)은 의사결정나무를 이용하는 방법으로 이에 대해서는 도 8에서 다시 살펴본다.The rule method 630 is a method using a decision tree, which will be described again in FIG. 8.
본 실시 예의 변수 결합 방법은 이해를 돕기 위한 하나의 예이며, 본 발명이 반드시 이에 한정되는 것은 아니며 실시 예에 따라 다양한 방법의 변수 결합 방법이 본 실시 예에 적용될 수 있다. 예를 들어, 변수자동생성장치(100)는 선형방식, 나눗셈방식, 곱셉방식을 서로 조합하여 새로운 변수를 생성할 수도 있다. The variable combining method of the present embodiment is one example to help understanding, and the present invention is not necessarily limited thereto, and variable combining methods of various methods may be applied to the present embodiment according to embodiments. For example, the variable automatic generation device 100 may generate a new variable by combining a linear method, a division method, and a multiply method.
도 7은 본 발명의 실시 예에 따라 새롭게 생성된 변수를 정의하는 다항식의 가중치 및 바이어스를 결정하는 방법의 일 예를 도시한 도면이다.7 is a diagram illustrating an example of a method for determining the weight and bias of a polynomial defining a newly generated variable according to an embodiment of the present invention.
도 6 및 도 7을 참조하면, 후보변수에서 임의 추출한 변수들을 결합하여 만들어지는 다항식이 새로운 변수(G1,G2,G3)가 된다. 본 실시 예는 선형방식(600)으로 결합된 새로운 변수(G1)를 구성하는 다항식의 각 항의 가중치(W3,X4,X2)와 바이어스를 구하는 예를 도시하고 있다.6 and 7, a polynomial created by combining randomly extracted variables from candidate variables becomes new variables G1, G2, and G3. This embodiment shows an example of obtaining the weight (W 3 ,X 4 ,X 2 ) and bias of each term of the polynomial constituting the new variable G1 combined by the linear method 600.
변수자동생성장치(100)는 새로운 변수를 구성하는 다항식(700)과 타겟변수(200)로 이루어진 모델을 회귀분석하여 가중치 및 바이어스를 결정할 수 있다. 예를 들어, 변수자동생성장치(100)는 회귀분석방법으로 릿지 회귀분석(ridge regression)을 이용할 수 있다. 이 외에도 타겟변수(200)와 다항식(700)으로 구성된 모델을 분석하는 다양한 회귀분석이 방법이 본 실시 예에 적용될 수 있다.The variable automatic generation device 100 may determine weights and biases by regressing a model consisting of a polynomial equation 700 and a target variable 200 constituting new variables. For example, the variable automatic generation device 100 may use ridge regression as a regression analysis method. In addition to this, various regression analysis methods for analyzing a model composed of a target variable 200 and a polynomial 700 may be applied to this embodiment.
도 8은 본 발명의 실시 예에 따른 룰 방식으로 변수를 결합하는 방법의 일 예를 도시한 도면이다.8 is a diagram illustrating an example of a method of combining variables in a rule manner according to an embodiment of the present invention.
도 8을 참조하면, 변수자동생성장치(100)는 후보변수에서 추출한 변수(도 5의 X3,X4,X2)에 대한 조건을 노드로 하는 의사결정나무(decision tree)를 생성할 수 있다. 의사결정나무를 생성하는 종래의 다양한 방법이 본 실시 예에 적용될 수 있다. 변수장동생성장치(100)는 각 변수의 배치 위치(루트 노드(800), 자식 노드(810,820) 등)와 조건 등에 따라 복수 개의 의사결정나무를 생성할 수 있다. 다만, 계산량 등을 고려하여 변수자동생성장치(100)는 일정 개수의 의사결정나무만을 생성할 수 있다.Referring to FIG. 8, the variable automatic generation device 100 may generate a decision tree having a condition for a variable (X3, X4, X2 in FIG. 5) extracted from a candidate variable as a node. Various conventional methods for generating a decision tree can be applied to this embodiment. The variable field generation device 100 may generate a plurality of decision trees according to the location of each variable (root node 800, child nodes 810,820, etc.) and conditions. However, considering the calculation amount, the variable automatic generation device 100 may generate only a certain number of decision trees.
의사결정나무가 생성되면, 변수자동생성장치(100)는 도 2에서 살펴본 통계정보의 각 샘플을 의사결정나무에 따라 분류하여 각 리프(830,840,850,860)에 해당하는 샘플을 파악한다. 그리고 변수자동생성장치(100)는 각 리프에 속한 샘플들의 타겟변수의 평균을 구하고, 평균이 가장 높거나 가장 낮은 리프로 향하는 경로(870,880)의 노드 조건을 새로운 변수(G4,G5)로 생성한다.When the decision tree is generated, the variable automatic generation device 100 classifies each sample of the statistical information shown in FIG. 2 according to the decision tree to identify samples corresponding to each leaf 830,840,850,860. Then, the variable automatic generation device 100 obtains the average of the target variables of the samples belonging to each leaf, and generates node conditions of the paths 870 and 880 toward the highest or lowest leaf as new variables G4 and G5. .
본 실시 예에서, 제2 리프(840)의 각 샘플의 평균이 가장 높고, 제4 리프(860)의 각 샘플의 평균이 가장 낮다면, 제2 리프(840)로 향하는 경로(870)의 조건(W3>0.2 & X4>=1.5)과 제4 리프(860)로 향하는 경로의 조건(W3<=0.2 & X2<=0.1)을 각각 새로운 변수(G4,G5)로 생성한다.In this embodiment, if the average of each sample of the second leaf 840 is the highest and the average of each sample of the fourth leaf 860 is the lowest, the condition of the path 870 toward the second leaf 840 (W3>0.2 & X4>=1.5) and conditions (W3<=0.2 & X2<=0.1) of the path to the fourth leaf 860 are respectively generated as new variables G4 and G5.
변수자동생성장치(100)는 규칙 방식(630)으로 결정된 새로운 변수(G4,G5)를 도 9와 같이 통계정보에 반영하고, 각 샘플이 해당 변수의 조건을 만족하는지 여부를 기 정의된 값(예를 들어, 플래그)으로 통계정보에 입력할 수 있다. 예를 들어, 변수자동생성장치(100)는 새로운 변수(G4,G5)의 조건을 만족하는 샘플에 대해 '1'을 부여하고 조건을 만족하지 않은 샘픔에 대해 '0'의 값을 부여할 수 있다.The variable automatic generation device 100 reflects the new variables G4 and G5 determined by the rule method 630 in statistical information as shown in FIG. 9, and determines whether each sample satisfies the condition of the corresponding variable ( For example, a flag) can be input to statistical information. For example, the variable automatic generation device 100 may assign a value of '1' to a sample that satisfies the conditions of the new variables G4 and G5, and a value of '0' to a sample that does not satisfy the condition. have.
도 9는 본 발명의 실시 예에 따라 새로운 변수가 생성된 통계정보의 일 예를 도시한 도면이다.9 is a diagram illustrating an example of statistical information in which a new variable is generated according to an embodiment of the present invention.
도 9를 참조하면, 변수자동생성장치(100)는 새롭게 생성된 변수들(G1,G2,G3,G4,G5)(900)를 통계정보에 반영하고, 각 샘플에 대해 새로운 변수의 값(910)을 파악하여 저장한다. 9, the variable automatic generation device 100 reflects the newly generated variables (G1, G2, G3, G4, G5) 900 in statistical information, and the value of the new variable (910) for each sample. ) And save.
도 10은 본 발명의 실시 예에 따른 변수자동생성장치의 구성의 일 예를 도시한 도면이다.10 is a view showing an example of the configuration of a variable automatic generation device according to an embodiment of the present invention.
도 10을 참조하면, 변수자동생성장치(100)는 상관관계분석부(1000), 후보변수생성부(1010), 변수생성부(1020) 및 데이터저장부(1030)를 포함한다.Referring to FIG. 10, the variable automatic generation device 100 includes a correlation analysis unit 1000, a candidate variable generation unit 1010, a variable generation unit 1020, and a data storage unit 1030.
상관관계분석부(1000)는 예측하고자 하는 타겟변수와 통계정보의 각 변수 사이의 상관관계를 분석한다. 상관관계의 분석방법으로 f-test가 이용될 수 있다.The correlation analysis unit 1000 analyzes the correlation between the target variable to be predicted and each variable of statistical information. F-test can be used as a correlation analysis method.
후보변수생성부(1010)는 통계정보의 변수들 중 타겟변수와 상관관계가 높은 순으로 복수 개의 변수를 후보변수로 선택한다. 후보변수생성부(1010)는 각 후보변수에 상관관계의 순서로 서로 다른 추출 확률을 부여하고 임의 추출할 수 있다. 예를 들어, 상관관계가 높은 후보변수에 더 높은 추출 확률을 부여하여 상관관계가 높은 후보변수가 더 잘 추출될 수 있도록 한다.The candidate variable generator 1010 selects a plurality of variables as candidate variables in the order of high correlation with the target variable among variables of statistical information. The candidate variable generator 1010 may assign different extraction probabilities to each candidate variable in the order of correlation and randomly extract them. For example, a higher correlation probability is given to a highly correlated candidate variable so that the highly correlated candidate variable can be better extracted.
변수생성부(1020)는 후보변수에서 일정 개수의 변수를 임의 추출하고, 임의 추출한 변수들 사이의 결합을 통해 새로운 변수를 생성한다. 변수생성부(1020)는 임의 추출한 변수들 사이의 덧셈, 곱셈 또는 나눗셈으로 구성된 항을 포함하는 다항식을 새로운 변수로 생성하거나, 도 8과 같은 의사결정나무를 이용하여 새로운 변수를 생성할 수 있다. 변수생성부(1020)는 다항식으로 구성된 새로운 변수의 경우에 새로운 변수와 타겟변수 사이의 회귀분석을 통해 다항식의 바이어스 및 각 항의 가중치를 파악할 수 있다.The variable generator 1020 randomly extracts a predetermined number of variables from the candidate variables and creates new variables through combining between the randomly extracted variables. The variable generator 1020 may generate a polynomial, including terms consisting of addition, multiplication, or division between randomly extracted variables, as new variables, or may generate new variables using a decision tree as shown in FIG. 8. In the case of a new variable composed of a polynomial, the variable generator 1020 may grasp the bias of the polynomial and the weight of each term through regression analysis between the new variable and the target variable.
데이터저장부(1030)는 새로운 변수에 대한 값을 계산하여 통계정보에 저장한다. 새로운 변수가 의사결정나무를 이용하여 생성된 조건문 형태인 경우에, 데이터저장부(1030)는 각 샘플이 새로운 변수에 포함된 각 노드 조건을 만족하는지 여부를 기 정의된 숫자 또는 문자로 통계정보에 표시할 수 있다.The data storage unit 1030 calculates a value for the new variable and stores it in statistical information. When the new variable is in the form of a conditional statement generated using a decision tree, the data storage unit 1030 determines whether each sample satisfies each node condition included in the new variable in statistical information using predefined numbers or characters. Can be displayed.
본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can also be embodied as computer readable codes on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disks, and optical data storage devices. In addition, the computer-readable recording medium can be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been focused on the preferred embodiments. Those skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered in terms of explanation, not limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent range should be interpreted as being included in the present invention.

Claims (10)

  1. 예측하고자 하는 타겟변수와 통계정보의 각 변수 사이의 상관관계를 분석하는 단계;Analyzing a correlation between a target variable to be predicted and each variable of statistical information;
    상기 통계정보의 변수들 중 상기 타겟변수와 상관관계가 높은 순으로 복수 개의 변수를 후보변수로 선택하는 단계;Selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information;
    상기 후보변수에서 일정 개수의 변수를 임의 추출하는 단계;Randomly extracting a predetermined number of variables from the candidate variables;
    상기 임의 추출한 변수들 사이의 결합을 통해 새로운 변수를 생성하는 단계; 및Generating a new variable through combining between the randomly extracted variables; And
    상기 새로운 변수를 상기 통계정보에 저장하는 단계;를 포함하는 것을 특징으로 하는 데이터 모델링을 위한 변수 자동생성방법.Storing the new variable in the statistical information; a method for automatically generating a variable for data modeling.
  2. 제 1항에 있어서, 상기 임의 추출하는 단계는,The method of claim 1, wherein the random extraction step,
    상기 후보변수에 대해 상관관계의 순서로 서로 다른 추출 확률을 부여하는 단계; 및Assigning different extraction probabilities to the candidate variables in order of correlation; And
    상기 추출 확률에 따라 변수를 임의 추출하는 단계;를 포함하는 것을 특징으로 하는 데이터 모델링을 위한 변수 자동생성방법.Randomly extracting a variable according to the extraction probability; Automatic variable generation method for data modeling comprising a.
  3. 제 1항에 있어서, 상기 새로운 변수를 생성하는 단계는,According to claim 1, The step of creating the new variable,
    상기 임의 추출한 변수들 사이의 결합을 통해 생성되는 복수의 항으로 구성된 다항식과 상기 타겟변수 사이의 회귀분석을 통해 상기 다항식의 바이어스 및 각 항의 가중치를 파악하는 단계; 및Grasping the bias of the polynomial and the weight of each term through regression analysis between the polynomial and the target variable, which are generated through a combination between the randomly extracted variables; And
    상기 바이어스 및 가중치가 반영된 다항식을 새로운 변수로 생성하는 단계;를 포함하는 것을 특징으로 데이터 모델링을 위한 변수 자동생성방법.And generating a polynomial that reflects the bias and weight as a new variable.
  4. 제 3항에 있어서,According to claim 3,
    상기 다항식은 상기 임의 추출한 변수들 사이의 덧셈, 곱셈 또는 나눗셈으로 구성된 항을 포함하는 것을 특징으로 하는 데이터 모델링을 위한 변수 자동생성방법.The polynomial is a variable automatic generation method for data modeling, characterized in that it comprises a term consisting of addition, multiplication or division between the randomly extracted variables.
  5. 제 1항에 있어서, According to claim 1,
    상기 새로운 변수를 생성하는 단계는,The step of creating the new variable,
    상기 임의 추출한 각 변수에 대한 조건을 노드로 하는 적어도 하나 이상의 의사결정나무를 생성하는 단계;Generating at least one decision tree having a condition for each variable extracted as a node;
    상기 의사결정나무의 각 리프에 대한 타겟변수의 평균을 산출하는 단계; Calculating an average of target variables for each leaf of the decision tree;
    상기 평균이 가장 높거나 낮은 리프로 향하는 경로의 노드 조건을 새로운 변수로 생성하는 단계;를 포함하고,Including the step of generating the node condition of the path toward the leaf with the highest or lowest average as a new variable;
    상기 통계정보에 저장하는 단계는,The step of storing in the statistical information,
    기 정의된 숫자 또는 문자로 상기 새로운 변수에 포함된 각 노드 조건의 만족 여부를 상기 통계정보에 표시하는 단계;를 포함하는 것을 특징으로 하는 데이터 모델링을 위한 변수 자동생성방법.And displaying whether or not the condition of each node included in the new variable is satisfied in the statistical information with a predefined number or character.
  6. 통계정보를 기초로 예측하고자 하는 타겟변수와 상기 통계정보의 각 변수 사이의 상관관계를 분석하는 상관관계분석부;A correlation analysis unit analyzing a correlation between a target variable to be predicted based on statistical information and each variable of the statistical information;
    상기 통계정보의 변수들 중 상기 타겟변수와 상관관계가 높은 순으로 복수 개의 변수들을 후보변수로 선택하는 후보변수선별부;A candidate variable selector for selecting a plurality of variables as candidate variables in order of high correlation with the target variable among variables of the statistical information;
    상기 후보변수에서 임의 추출한 변수들 사이의 결합을 통해 새로운 변수를 생성하는 변수생성부; 및A variable generator that generates new variables through combining between the variables randomly extracted from the candidate variables; And
    상기 새로운 변수를 상기 통계정보에 저장하는 데이터저장부;를 포함하는 것을 특징으로 하는 변수자동생성장치.And a data storage unit that stores the new variable in the statistical information.
  7. 제 6항에 있어서, 상기 후보변수선별부는,The method of claim 6, wherein the candidate variable selection unit,
    상기 후보변수에 대해 상관관계의 순서로 서로 다른 추출 확률을 부여하여 추출하는 것을 특징으로 하는 변수자동생성장치.A variable automatic generation device characterized in that the candidate variables are extracted by giving different extraction probabilities in the order of correlation.
  8. 제 6항에 있어서, 상기 변수생성부는,The method of claim 6, wherein the variable generating unit,
    상기 임의 추출한 변수들 사이의 결합을 통해 생성되는 복수의 항으로 구성된 다항식과 상기 타겟변수 사이의 회귀분석을 통해 상기 다항식의 바이어스 및 각 항의 가중치를 파악하고, 상기 바이어스 및 가중치가 반영된 다항식을 새로운 변수로 생성하는 것을 특징으로 하는 변수자동생성장치.The polynomial consisting of a plurality of terms generated through the combination between the randomly extracted variables and the regression analysis between the target variable determine the bias of the polynomial and the weight of each term, and the polynomial reflecting the bias and weight is a new variable Variable automatic generation device, characterized in that generated by.
  9. 제 6항에 있어서, 상기 변수생성부는,The method of claim 6, wherein the variable generating unit,
    상기 임의 추출한 각 변수에 대한 조건을 노드로 하는 적어도 하나 이상의 의사결정나무를 생성하고, 상기 의사결정나무의 각 리프에 대한 타겟변수의 평균이 가장 높거나 낮은 리프로 향하는 경로의 노드 조건을 새로운 변수로 생성하는 것을 특징으로 하는 변수자동생성장치.At least one decision tree is created with the condition for each variable as the node as a node, and the node condition of the path toward the leaf with the highest or lowest average of the target variables for each leaf of the decision tree is a new variable. Variable automatic generation device, characterized in that generated by.
  10. 제 1항 내지 제 5항 중 어느 한 항에 기재된 방법을 수행하기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체.A computer-readable recording medium recording a program for performing the method according to any one of claims 1 to 5.
PCT/KR2019/007409 2018-11-29 2019-06-19 Method for automatically generating variables for data modeling, and device thereof WO2020111423A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020180151225A KR101976689B1 (en) 2018-11-29 2018-11-29 Method and apparatus for automatically generating variables for data modeling
KR10-2018-0151225 2018-11-29

Publications (1)

Publication Number Publication Date
WO2020111423A1 true WO2020111423A1 (en) 2020-06-04

Family

ID=66546245

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2019/007409 WO2020111423A1 (en) 2018-11-29 2019-06-19 Method for automatically generating variables for data modeling, and device thereof

Country Status (2)

Country Link
KR (1) KR101976689B1 (en)
WO (1) WO2020111423A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599403B (en) * 2020-05-22 2023-03-14 电子科技大学 Parallel drug-target correlation prediction method based on sequencing learning
KR20230053384A (en) * 2021-10-14 2023-04-21 주식회사 솔리드웨어 Data visualization method and device
KR20230087097A (en) * 2021-12-09 2023-06-16 주식회사 카카오뱅크 Method for operating credit scoring model using two-stage logistic regression

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007329415A (en) * 2006-06-09 2007-12-20 Fujitsu Ltd Data processing method, data processing program, recording medium recording same program, and data processor
JP2014081750A (en) * 2012-10-16 2014-05-08 Hitachi Ltd Data integration and analysis system
JP2016004525A (en) * 2014-06-19 2016-01-12 株式会社日立製作所 Data analysis system and data analysis method
KR101688412B1 (en) * 2015-09-01 2016-12-21 주식회사 에스원 Method and System for Modeling Prediction of Dependent Variable
KR20180079995A (en) * 2017-01-03 2018-07-11 주식회사 데일리인텔리전스 Method for generating a program that analyzes data based on machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007329415A (en) * 2006-06-09 2007-12-20 Fujitsu Ltd Data processing method, data processing program, recording medium recording same program, and data processor
JP2014081750A (en) * 2012-10-16 2014-05-08 Hitachi Ltd Data integration and analysis system
JP2016004525A (en) * 2014-06-19 2016-01-12 株式会社日立製作所 Data analysis system and data analysis method
KR101688412B1 (en) * 2015-09-01 2016-12-21 주식회사 에스원 Method and System for Modeling Prediction of Dependent Variable
KR20180079995A (en) * 2017-01-03 2018-07-11 주식회사 데일리인텔리전스 Method for generating a program that analyzes data based on machine learning

Also Published As

Publication number Publication date
KR101976689B1 (en) 2019-05-09

Similar Documents

Publication Publication Date Title
WO2020111423A1 (en) Method for automatically generating variables for data modeling, and device thereof
US7801837B2 (en) Network analyzer
US20080097937A1 (en) Distributed method for integrating data mining and text categorization techniques
US6394263B1 (en) Autognomic decision making system and method
US6389406B1 (en) Semiotic decision making system for responding to natural language queries and components thereof
CN112270545A (en) Financial risk prediction method and device based on migration sample screening and electronic equipment
US20030177105A1 (en) Gene expression programming algorithm
US20060200436A1 (en) Gene expression programming with enhanced preservation of attributes contributing to fitness
CN111930366B (en) Rule engine implementation method and system based on JIT real-time compilation
US7991617B2 (en) Optimum design management apparatus from response surface calculation and method thereof
CN114386879B (en) Grading and ranking method and system based on multi-product multi-dimensional performance indexes
CN111683010B (en) Method and device for generating double routes based on optical cable network optical path
Liefooghe et al. Dominance, indicator and decomposition based search for multi-objective QAP: landscape analysis and automated algorithm selection
CN116702157B (en) Intelligent contract vulnerability detection method based on neural network
CN110717537B (en) Method and device for training user classification model and executing user classification prediction
binti Oseman et al. Data mining in churn analysis model for telecommunication industry
WO2018172221A1 (en) Method for computer-implemented determination of the performance of a classification model
CN115134294B (en) Method and device for determining standby route and computer readable storage medium
CN114529108B (en) Tree model based prediction method, apparatus, device, medium, and program product
KR20200121039A (en) Electronic device for generating a gene feature vector for gene distributed representation based on a correlation between genes according to cancer and operating method thereof
CN114090721B (en) Method and device for querying and updating data based on natural language data
CN111881287B (en) Classification ambiguity analysis method and device
WO2022211179A1 (en) Optimal model seeking method, and device therefor
CN115134247A (en) Node identification method and device, electronic equipment and computer readable storage medium
CN117041073A (en) Network behavior prediction method, system, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19890908

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19890908

Country of ref document: EP

Kind code of ref document: A1