KR102645454B1

KR102645454B1 - System and method for generating synthetic data automatically

Info

Publication number: KR102645454B1
Application number: KR1020230158899A
Authority: KR
Inventors: 김도현; 이미희; 예지혜; 정성규; 김용재; 성기헌; 박건웅
Original assignee: 주식회사 베가스
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-03-08

Abstract

재현자료 자동 생성 시스템 및 방법이 제공된다. 본 발명의 실시예들에 따르면, 재현자료의 생성과정에서 원본자료 내 복수 개의 변수들에 대한 선형 종속관계, 대소관계 및 상하관계를 나타내는 제약조건들을 자동으로 탐지하고 이와 같이 탐지된 제약조건들을 기초로 원본자료의 전처리 및 재현자료의 후처리를 수행함으로써, 보다 고품질의 재현자료를 빠르게 생성할 수 있다. 또한, 본 발명의 실시예들에 따르면, 재현자료의 생성과정에서 원본자료 내 복수 개의 변수들에 대한 조건부 독립 테스트 결과에 기초하여 변수들의 생성순서를 결정하고 결정된 생성순서에 따라 재현자료를 생성하도록 함으로써, 보다 고품질의 재현자료를 빠르게 생성할 수 있다.A system and method for automatically generating reproduction data are provided. According to embodiments of the present invention, in the process of generating reproduction data, constraints indicating linear dependency relationships, magnitude relationships, and hierarchical relationships for a plurality of variables in the original data are automatically detected and the detected constraints are used as a basis. By performing pre-processing of the original data and post-processing of the reproduction data, higher quality reproduction data can be generated quickly. In addition, according to embodiments of the present invention, in the process of generating reproduction data, the generation order of variables is determined based on the conditional independence test results for a plurality of variables in the original data, and reproduction data is generated according to the determined generation order. By doing so, higher quality reproduction data can be generated quickly.

Description

System and method for automatically generating reproduction data {SYSTEM AND METHOD FOR GENERATING SYNTHETIC DATA AUTOMATICALLY}

본 발명의 실시예들은 다양한 분야에서 활용되는 학습 데이터를 확보하기 위해 원본자료로부터 재현자료를 자동으로 생성하는 기술과 관련된다.Embodiments of the present invention relate to technology for automatically generating reproduction data from original data to secure learning data used in various fields.

최근, 인공지능(AI) 및 빅데이터를 활용한 분석기술이 발달함에 따라 다양한 분야에서 학습 데이터를 확보하고자 하는 노력이 지속되고 있다. 다만, 학습 데이터의 기반이 되는 원본자료는 개인의 질병, 급여 등과 같은 민감한 정보를 다량 포함하고 있어 이를 그대로 활용하는 데 어려움이 있다.Recently, as analysis technology using artificial intelligence (AI) and big data has developed, efforts to secure learning data in various fields are continuing. However, the original data that forms the basis of the learning data contains a large amount of sensitive information such as personal illness, salary, etc., making it difficult to utilize it as is.

이에 따라, 종래에는 원본자료 내 특정 데이터를 기준으로 원본자료를 그룹화하거나 원본자료 내 특정 데이터를 공란으로 대체(마스킹)하는 방식으로 비식별자료를 생성하여 활용하였다. 그러나, 이 경우 원본자료의 데이터를 잃게 되므로, 비식별자료가 원본자료를 대표하지 못하는 문제점이 있다.Accordingly, in the past, non-identified data was created and utilized by grouping original data based on specific data in the original data or replacing (masking) specific data in the original data with blank spaces. However, in this case, the original data is lost, so there is a problem in that the de-identified data does not represent the original data.

또한, 최근 이러한 비식별자료의 문제점을 해결하기 위해 통계적 기법이나 뉴럴 네트워크를 활용하여 재현자료를 생성하여 활용하는 사례가 증가하고 있다. 그러나, 이러한 재현자료 생성 기법의 경우 관리자가 재현자료를 생성하는 과정에서 재현자료 내 변수들 간의 상호 연관관계를 일일이 확인한 후 이상치, 결측치 등을 보정한다는 점에서 재현자료의 생성에 많은 시간이 소요되며 재현자료의 품질 또한 떨어지는 문제점이 있다. 또한, 이 경우 관리자가 재현자료 생성모델에 입력되는 재현자료 내 변수들의 순서를 임의로 지정하여야 한다는 점에서 최적의 재현자료의 생성에 어려움이 있다.In addition, recently, in order to solve the problem of such non-identified data, there has been an increasing number of cases of generating and utilizing reproduction data using statistical techniques or neural networks. However, in the case of this technique for generating reproduction data, it takes a lot of time to generate reproduction data because the manager checks the correlation between variables in the reproduction data one by one and then corrects outliers and missing values in the process of generating reproduction data. There is also a problem of poor quality of reproduction data. Additionally, in this case, there is difficulty in generating optimal reproduction data in that the manager must arbitrarily designate the order of variables in the reproduction data input to the reproduction data generation model.

한국등록특허공보 제10-2560468호(2023.07.24)Korean Patent Publication No. 10-2560468 (2023.07.24)

본 발명의 실시예들은 재현자료의 생성과정을 자동화하면서 고품질의 재현자료를 보다 빠르게 생성하는 수단을 제공하기 위한 것이다.Embodiments of the present invention are intended to provide a means of generating high-quality reproduction data more quickly while automating the process of generating reproduction data.

예시적인 실시예에 따르면, 서로 다른 복수 개의 변수들에 대한 데이터를 포함하는 원본자료로부터 상기 원본자료와 대응되는 통계적 특성을 갖는 재현자료를 자동으로 생성하는 재현자료 자동 생성 시스템으로서, 상기 원본자료 내 상기 복수 개의 변수들에 대한 선형 종속관계를 나타내는 제1 제약조건, 및 상기 원본자료 내 상기 복수 개의 변수들의 조합 상호간의 대소관계 및 상기 원본자료 내 상기 복수 개의 변수들에 대한 상하관계를 나타내는 제2 제약조건을 탐지하는 제약조건 탐지부; 상기 제1 제약조건에 기초하여 상기 선형 종속관계가 탐지된 변수를 상기 원본자료에서 제거하는 전처리부; 상기 원본자료에 포함된 상기 복수 개의 변수들에 대한 조건부 독립 테스트(Conditional Independence Test)를 수행하고, 상기 조건부 독립 테스트의 결과에 기초하여 상기 재현자료의 생성을 위한 변수들의 생성순서를 결정하는 생성순서 결정부; 상기 복수 개의 변수들 중 상기 원본자료에서 제거된 변수를 제외한 나머지 변수들에 대한 데이터를 상기 생성순서에 따라 설정된 재현자료 생성모델에 순차적으로 입력하여 상기 재현자료를 생성하는 재현자료 생성부; 및 상기 제1 제약조건에 기초하여 상기 전처리부에서 제거된 변수를 복원하고, 상기 제2 제약조건에 기초하여 상기 재현자료 내 이상치를 제거하는 후처리부를 포함하는, 재현자료 자동 생성 시스템이 제공된다. According to an exemplary embodiment, an automatic reproduction data generation system that automatically generates reproduction data with statistical characteristics corresponding to the original data from original data including data on a plurality of different variables, A first constraint indicating a linear dependency relationship for the plurality of variables, and a second constraint indicating a magnitude relationship between combinations of the plurality of variables in the original data and a hierarchical relationship between the plurality of variables in the original data. A constraint detection unit that detects constraints; a preprocessor that removes variables for which the linear dependency relationship is detected based on the first constraints from the original data; A generation order for performing a conditional independence test on the plurality of variables included in the original data and determining the generation order of variables for generating the reproduced data based on the results of the conditional independence test. decision part; a reproduction data generation unit that generates the reproduction data by sequentially inputting data for the remaining variables among the plurality of variables, excluding variables removed from the original data, into a reproduction data generation model set according to the generation order; and a post-processing unit that restores variables removed in the pre-processing unit based on the first constraints and removes outliers in the reproduction data based on the second constraints. .

상기 제약조건 탐지부는, 상기 원본자료 내 상기 복수 개의 변수들에 대해 QR 분해(QR Decomposition)를 수행하고, 상기 QR 분해로부터 획득되는 상삼각행렬의 대각선 성분의 값에 기초하여 상기 선형 종속관계를 나타내는 제1 제약조건을 탐지할 수 있다. The constraint detection unit performs QR decomposition on the plurality of variables in the original data, and represents the linear dependency relationship based on the value of the diagonal component of the upper triangular matrix obtained from the QR decomposition. The first constraint can be detected.

상기 상감각행렬은, N X N 행렬이며, 상기 제약조건 탐지부는, 상기 상삼각행렬의 대각선 성분의 값들 중 0의 값을 갖는 n(단, n≤N)번째 행과 열에 대응되는 변수를 상기 선형 종속관계가 탐지된 변수로 결정하고, 상기 상삼각행렬로부터 상기 선형 종속관계가 탐지된 변수를 종속변수로 하는 n-1개의 선형 방정식을 도출한 후 상기 n-1개의 선형 방정식으로부터 상기 선형 종속관계를 만족시키는 계수를 계산함으로써 상기 제1 제약조건을 탐지할 수 있다. The upper triangular matrix is an N Determine the variable for which a relationship has been detected, derive n-1 linear equations from the upper triangular matrix using the variables for which the linear dependency relationship has been detected as dependent variables, and then derive the linear dependency relationship from the n-1 linear equations. The first constraint can be detected by calculating the coefficients that satisfy it.

상기 제약조건 탐지부는, 상기 원본자료로부터 설정된 개수 이하의 변수들에 대한 조합을 복수 개 추출한 후 추출된 상기 조합 각각의 최소값과 최대값에 기초하여 상기 조합 상호간의 대소관계를 탐지하고, 상기 원본자료로부터 설정된 개수 이하의 변수들에 대한 상하관계를 학습할 수 있다. The constraint detection unit extracts a plurality of combinations of variables of a set number or less from the original data, and then detects the size relationship between the combinations based on the minimum and maximum values of each of the extracted combinations, and detects the size relationship between the original data. You can learn the hierarchical relationships for variables below the set number.

상기 제약조건 탐지부는, 상기 원본자료로부터 랜덤 샘플링(Random Sampling)을 반복적으로 수행하고, 랜덤 샘플링된 데이터 각각에 대해 상기 제1 제약조건 및 상기 제2 제약조건이 탐지되는지 여부를 각각 판단한 후 상기 제1 제약조건 및 상기 제2 제약조건이 탐지되는 비율이 임계치 이상인 경우에만 상기 제1 제약조건 및 상기 제2 제약조건을 최종 제약조건으로 각각 결정할 수 있다. The constraint detection unit repeatedly performs random sampling from the original data, determines whether the first constraint and the second constraint are detected for each randomly sampled data, and then determines whether the first constraint and the second constraint are detected for each randomly sampled data. Only when the detection rate of the first constraint and the second constraint is greater than or equal to a threshold, the first constraint and the second constraint can be determined as final constraints, respectively.

상기 생성순서 결정부는, 상기 조건부 독립 테스트를 수행함에 따라 상기 복수 개의 변수들에 대한 조건부 독립정보 및 분리 세트(Separating Set)를 획득하고, 상기 조건부 독립정보 및 상기 분리 세트를 기반으로 상기 복수 개의 변수들에 대한 CPDAG(Completed Partially Directed Acyclic Graph)를 도출하며, 상기 CPDAG를 이용하여 상기 재현자료의 생성을 위한 변수들의 생성순서를 결정할 수 있다. The generation order determination unit obtains conditional independence information and a separating set for the plurality of variables by performing the conditional independence test, and determines the plurality of variables based on the conditional independence information and the separating set. A Completed Partially Directed Acyclic Graph (CPDAG) is derived, and the CPDAG can be used to determine the generation order of variables for generating the reproduction data.

상기 CPDAG 내의 각 꼭지점은, 상기 복수 개의 변수들 중 하나를 나타내며, 상기 생성순서 결정부는, 상기 CPDAG 상에서 특정 꼭지점으로부터 나머지 꼭지점 각각에 도달하기 위한 선분의 개수를 각각 계산하되, 상기 CPDAG 상에서 상기 특정 꼭지점으로부터 상기 나머지 꼭지점으로 도달할 수 없는 경우의 수를 제1 NA 개수로 카운팅하고, 상기 생성순서 결정부는, 상기 CPDAG 상에서 상기 나머지 꼭지점 각각으로부터 상기 특정 꼭지점에 도달하기 위한 선분의 개수를 각각 계산하되, 상기 CPDAG 상에서 상기 나머지 꼭지점으로부터 상기 특정 꼭지점으로 도달할 수 없는 경우의 수를 제2 NA 개수로 카운팅하며, 상기 생성순서 결정부는, 상기 제1 NA 개수 및 상기 제2 NA 개수에 기초하여 상기 재현자료의 생성을 위한 변수들의 생성순서를 결정할 수 있다. Each vertex in the CPDAG represents one of the plurality of variables, and the generation order determining unit calculates the number of line segments to reach each of the remaining vertices from a specific vertex in the CPDAG, where the specific vertex in the CPDAG The number of cases in which the remaining vertices cannot be reached is counted as the number of first NAs, and the generation order determination unit calculates the number of line segments to reach the specific vertex from each of the remaining vertices on the CPDAG, The number of cases in which the specific vertex cannot be reached from the remaining vertices on the CPDAG is counted as the number of second NAs, and the generation order determination unit determines the reproduction data based on the number of first NAs and the number of second NAs. The creation order of variables for creation can be determined.

상기 생성순서 결정부는, 상기 특정 꼭지점에 대응되는 변수에 대해 상기 제1 NA 개수가 적을수록, 상기 제2 NA 개수가 많을수록 상기 재현자료의 생성과정에서 먼저 생성되도록 상기 생성순서를 결정할 수 있다. The generation order determination unit may determine the generation order so that the variable corresponding to the specific vertex is generated first in the generation process of the reproduction data as the number of the first NA is smaller and the number of the second NA is larger.

상기 후처리부는, 상기 재현자료의 데이터들 중 상기 제2 제약조건을 만족하지 않는 데이터를 상기 이상치로 판단하여 상기 이상치를 제거할 수 있다.The post-processing unit may determine data that does not satisfy the second constraint condition among the data of the reproduction data as the outlier and remove the outlier.

다른 예시적인 실시예에 따르면, 서로 다른 복수 개의 변수들에 대한 데이터를 포함하는 원본자료로부터 상기 원본자료와 대응되는 통계적 특성을 갖는 재현자료를 자동으로 생성하는 재현자료 자동 생성 방법으로서, 제약조건 탐지부에서, 상기 원본자료 내 상기 복수 개의 변수들에 대한 선형 종속관계를 나타내는 제1 제약조건을 탐지하는 단계; 상기 제약조건 탐지부에서, 상기 원본자료 내 상기 복수 개의 변수들의 조합 상호간의 대소관계 및 상기 원본자료 내 상기 복수 개의 변수들에 대한 상하관계를 나타내는 제2 제약조건을 탐지하는 단계; 전처리부에서, 상기 제1 제약조건에 기초하여 상기 선형 종속관계가 탐지된 변수를 상기 원본자료에서 제거하는 단계; 생성순서 결정부에서, 상기 원본자료에 포함된 상기 복수 개의 변수들에 대한 조건부 독립 테스트(Conditional Independence Test)를 수행하는 단계; 상기 생성순서 결정부에서, 상기 조건부 독립 테스트의 결과에 기초하여 상기 재현자료의 생성을 위한 변수들의 생성순서를 결정하는 단계; 재현자료 생성부에서, 상기 복수 개의 변수들 중 상기 원본자료에서 제거된 변수를 제외한 나머지 변수들에 대한 데이터를 상기 생성순서에 따라 설정된 재현자료 생성모델에 순차적으로 입력하여 상기 재현자료를 생성하는 단계; 후처리부에서, 상기 제1 제약조건에 기초하여 상기 전처리부에서 제거된 변수를 복원하는 단계; 및 상기 후처리부에서, 상기 제2 제약조건에 기초하여 상기 재현자료 내 이상치를 제거하는 단계를 포함하는, 재현자료 자동 생성 방법이 제공된다. According to another exemplary embodiment, a method for automatically generating reproduction data having statistical characteristics corresponding to the original data from original data including data on a plurality of different variables, which includes constraint detection. In part, detecting a first constraint indicating a linear dependency relationship for the plurality of variables in the original data; detecting, in the constraint detection unit, a second constraint indicating a major relationship between combinations of the plurality of variables in the original data and a hierarchical relationship between the plurality of variables in the original data; In a preprocessor, removing variables for which the linear dependency relationship is detected based on the first constraint from the original data; In the generation order determination unit, performing a conditional independence test on the plurality of variables included in the original data; determining, in the generation order determination unit, a generation order of variables for generating the reproduction data based on a result of the conditional independence test; In the reproduction data generation unit, generating the reproduction data by sequentially inputting data for the remaining variables among the plurality of variables, excluding the variables removed from the original data, into the reproduction data generation model set according to the generation order. ; In a post-processing unit, restoring variables removed in the pre-processing unit based on the first constraints; and, in the post-processing unit, removing outliers from the reproduction data based on the second constraints.

상기 제1 제약조건을 탐지하는 단계는, 상기 원본자료 내 상기 복수 개의 변수들에 대해 QR 분해(QR Decomposition)를 수행하고, 상기 QR 분해로부터 획득되는 상삼각행렬의 대각선 성분의 값에 기초하여 상기 선형 종속관계를 나타내는 제1 제약조건을 탐지할 수 있다. The step of detecting the first constraint includes performing QR decomposition on the plurality of variables in the original data, and based on the value of the diagonal component of the upper triangular matrix obtained from the QR decomposition. The first constraint indicating a linear dependency relationship can be detected.

상기 상감각행렬은, N X N 행렬이며, 상기 제1 제약조건을 탐지하는 단계는, 상기 상삼각행렬의 대각선 성분의 값들 중 0의 값을 갖는 n(단, n≤N)번째 행과 열에 대응되는 변수를 상기 선형 종속관계가 탐지된 변수로 결정하고, 상기 상삼각행렬로부터 상기 선형 종속관계가 탐지된 변수를 종속변수로 하는 n-1개의 선형 방정식을 도출한 후 상기 n-1개의 선형 방정식으로부터 상기 선형 종속관계를 만족시키는 계수를 계산함으로써 상기 제1 제약조건을 탐지할 수 있다. The upper triangular matrix is an N Determine the variable as the variable for which the linear dependency relationship is detected, derive n-1 linear equations using the variables for which the linear dependency relationship is detected as dependent variables from the upper triangular matrix, and then derive n-1 linear equations from the n-1 linear equations. The first constraint condition can be detected by calculating a coefficient that satisfies the linear dependency relationship.

상기 제2 제약조건을 탐지하는 단계는, 상기 원본자료로부터 설정된 개수 이하의 변수들에 대한 조합을 복수 개 추출한 후 추출된 상기 조합 각각의 최소값과 최대값에 기초하여 상기 조합 상호간의 대소관계를 탐지하고, 상기 원본자료로부터 설정된 개수 이하의 변수들에 대한 상하관계를 학습할 수 있다. The step of detecting the second constraint includes extracting a plurality of combinations of variables less than the set number from the original data and then detecting the size relationship between the combinations based on the minimum and maximum values of each of the extracted combinations. And, it is possible to learn the hierarchical relationships of variables below the set number from the original data.

상기 재현자료 자동 생성 방법은, 상기 제1 제약조건을 탐지하는 단계 및 제2 제약조건을 탐지하는 단계 이후, 상기 제약조건 탐지부에서, 상기 원본자료로부터 랜덤 샘플링(Random Sampling)을 반복적으로 수행하는 단계; 및 상기 제약조건 탐지부에서, 랜덤 샘플링된 데이터 각각에 대해 상기 제1 제약조건 및 상기 제2 제약조건이 탐지되는지 여부를 각각 판단한 후 상기 제1 제약조건 및 상기 제2 제약조건이 탐지되는 비율이 임계치 이상인 경우에만 상기 제1 제약조건 및 상기 제2 제약조건을 최종 제약조건으로 각각 결정하는 단계를 더 포함할 수 있다. The method of automatically generating reproduction data includes repeatedly performing random sampling from the original data in the constraint detection unit after the step of detecting the first constraint condition and the step of detecting the second constraint condition. step; And in the constraint detection unit, after determining whether the first constraint and the second constraint are detected for each of the randomly sampled data, the rate at which the first constraint and the second constraint are detected is The step of determining the first constraint and the second constraint as final constraints may be further included only when the value is greater than or equal to a threshold.

상기 재현자료의 생성을 위한 변수들의 생성순서를 결정하는 단계는, 상기 조건부 독립 테스트를 수행함에 따라 상기 복수 개의 변수들에 대한 조건부 독립정보 및 분리 세트(Separating Set)를 획득하고, 상기 조건부 독립정보 및 상기 분리 세트를 기반으로 상기 복수 개의 변수들에 대한 CPDAG(Completed Partially Directed Acyclic Graph)를 도출하며, 상기 CPDAG를 이용하여 상기 재현자료의 생성을 위한 변수들의 생성순서를 결정할 수 있다. The step of determining the generation order of variables for generating the reproduction data includes obtaining conditional independence information and a separating set for the plurality of variables by performing the conditional independence test, and obtaining conditional independence information and a separating set for the plurality of variables. And based on the separation set, a Completed Partially Directed Acyclic Graph (CPDAG) is derived for the plurality of variables, and the CPDAG can be used to determine the generation order of variables for generating the reproduction data.

상기 CPDAG 내의 각 꼭지점은, 상기 복수 개의 변수들 중 하나를 나타내며, 상기 재현자료의 생성을 위한 변수들의 생성순서를 결정하는 단계는, 상기 CPDAG 상에서 특정 꼭지점으로부터 나머지 꼭지점 각각에 도달하기 위한 선분의 개수를 각각 계산하되, 상기 CPDAG 상에서 상기 특정 꼭지점으로부터 상기 나머지 꼭지점으로 도달할 수 없는 경우의 수를 제1 NA 개수로 카운팅하고, 상기 CPDAG 상에서 상기 나머지 꼭지점 각각으로부터 상기 특정 꼭지점에 도달하기 위한 선분의 개수를 각각 계산하되, 상기 CPDAG 상에서 상기 나머지 꼭지점으로부터 상기 특정 꼭지점으로 도달할 수 없는 경우의 수를 제2 NA 개수로 카운팅하며, 상기 제1 NA 개수 및 상기 제2 NA 개수에 기초하여 상기 재현자료의 생성을 위한 변수들의 생성순서를 결정할 수 있다. Each vertex in the CPDAG represents one of the plurality of variables, and the step of determining the generation order of variables for generating the reproduction data includes the number of line segments required to reach each of the remaining vertices from a specific vertex in the CPDAG. Calculate each, where the number of cases in which the remaining vertices cannot be reached from the specific vertex on the CPDAG is counted as the number of first NAs, and the number of line segments for reaching the specific vertex from each of the remaining vertices on the CPDAG is counted. are calculated respectively, and the number of cases in which the specific vertex cannot be reached from the remaining vertices on the CPDAG is counted as the number of second NAs, and the reproduction data is calculated based on the first number of NAs and the number of second NAs. The creation order of variables for creation can be determined.

상기 재현자료의 생성을 위한 변수들의 생성순서를 결정하는 단계는, 상기 특정 꼭지점에 대응되는 변수에 대해 상기 제1 NA 개수가 적을수록, 상기 제2 NA 개수가 많을수록 상기 재현자료의 생성과정에서 먼저 생성되도록 상기 생성순서를 결정할 수 있다. The step of determining the order of generating variables for generating the reproduction data is that, for variables corresponding to the specific vertex, the smaller the number of the first NA and the greater the number of the second NAs, the first in the generation process of the reproduction data. The creation order can be determined so that it is created.

상기 재현자료 내 이상치를 제거하는 단계는, 상기 재현자료의 데이터들 중 상기 제2 제약조건을 만족하지 않는 데이터를 상기 이상치로 판단하여 상기 이상치를 제거할 수 있다.In the step of removing outliers in the reproduction data, data that does not satisfy the second constraint condition among the data in the reproduction data may be determined to be the outliers and the outliers may be removed.

본 발명의 실시예들에 따르면, 재현자료의 생성과정에서 원본자료 내 복수 개의 변수들에 대한 선형 종속관계, 대소관계 및 상하관계를 나타내는 제약조건들을 자동으로 탐지하고 이와 같이 탐지된 제약조건들을 기초로 원본자료의 전처리 및 재현자료의 후처리를 수행함으로써, 보다 고품질의 재현자료를 빠르게 생성할 수 있다.According to embodiments of the present invention, in the process of generating reproduction data, constraints indicating linear dependency relationships, magnitude relationships, and hierarchical relationships for a plurality of variables in the original data are automatically detected and the detected constraints are used as a basis. By performing pre-processing of the original data and post-processing of the reproduction data, higher quality reproduction data can be generated quickly.

또한, 본 발명의 실시예들에 따르면, 재현자료의 생성과정에서 원본자료 내 복수 개의 변수들에 대한 조건부 독립 테스트 결과에 기초하여 변수들의 생성순서를 결정하고 결정된 생성순서에 따라 재현자료를 생성하도록 함으로써, 보다 고품질의 재현자료를 빠르게 생성할 수 있다.In addition, according to embodiments of the present invention, in the process of generating reproduction data, the generation order of variables is determined based on the conditional independence test results for a plurality of variables in the original data, and reproduction data is generated according to the determined generation order. By doing so, higher quality reproduction data can be generated quickly.

이러한 제약조건의 탐지, 전처리, 변수들의 생성순서 결정, 재현자료의 생성 및 후처리는 모두 자동화된 프로세스에 따라 수행되며, 이에 따라 일련의 재현자료의 생성과정을 자동화할 수 있다.Detection of these constraints, preprocessing, determination of the generation order of variables, generation of reproduction data, and post-processing are all performed according to an automated process, thereby automating the process of generating a series of reproduction data.

도 1은 본 발명의 일 실시예에 따른 원본자료, 비식별자료 및 재현자료를 설명하기 위한 개략도
도 2는 본 발명의 일 실시예에 따른 원본자료의 예시
도 3은 종래 기술에 따른 비식별자료의 예시
도 4는 본 발명의 일 실시예에 따른 재현자료의 예시
도 5는 본 발명의 일 실시예에 따른 재현자료 자동 생성 시스템의 상세 구성을 나타낸 블록도
도 6은 본 발명의 일 실시예에 따른 제1 제약조건 탐지부에서 제1 제약조건을 탐지하는 과정을 나타낸 예시
도 7은 본 발명의 일 실시예에 따른 제2 제약조건 탐지부에서 제2 제약조건의 대소관계를 탐지하는 과정을 나타낸 예시
도 8은 본 발명의 일 실시예에 따른 제2 제약조건 탐지부에서 제2 제약조건의 상하관계를 탐지하는 과정을 나타낸 예시
도 9는 본 발명의 일 실시예에 따른 원본자료에 이상치가 있는 경우 제약조건을 탐지하는 과정을 나타낸 예시
도 10은 본 발명의 일 실시예에 따른 분리 세트(Separating Set)를 나타낸 예시
도 11은 본 발명의 일 실시예에 따른 UAG(Undirected Acyclic Graph)를 나타낸 예시
도 12는 본 발명의 일 실시예에 따른 CPDAG(Completed Partially Directed Acyclic Graph)를 나타낸 예시
도 13은 본 발명의 일 실시예에 따른 CPDAG에 대응되는 행렬을 나타낸 예시
도 14는 본 발명의 일 실시예에 따른 재현자료 자동 생성 방법을 설명하기 위한 흐름도
도 15는 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도1 is a schematic diagram illustrating original data, de-identified data, and reproduced data according to an embodiment of the present invention.
Figure 2 is an example of original data according to an embodiment of the present invention
Figure 3 is an example of non-identifying data according to the prior art
Figure 4 is an example of reproduction data according to an embodiment of the present invention
Figure 5 is a block diagram showing the detailed configuration of the automatic reproduction data generation system according to an embodiment of the present invention.
Figure 6 is an example showing the process of detecting the first constraint condition in the first constraint detection unit according to an embodiment of the present invention.
Figure 7 is an example showing the process of detecting the magnitude relationship of the second constraint condition in the second constraint detection unit according to an embodiment of the present invention.
Figure 8 is an example showing the process of detecting the hierarchical relationship of the second constraint condition in the second constraint detection unit according to an embodiment of the present invention.
Figure 9 is an example showing the process of detecting constraints when there are outliers in the original data according to an embodiment of the present invention.
Figure 10 is an example of a separating set according to an embodiment of the present invention
Figure 11 is an example showing UAG (Undirected Acyclic Graph) according to an embodiment of the present invention.
Figure 12 is an example of CPDAG (Completed Partially Directed Acyclic Graph) according to an embodiment of the present invention.
Figure 13 is an example showing a matrix corresponding to CPDAG according to an embodiment of the present invention
Figure 14 is a flowchart illustrating a method for automatically generating reproduction data according to an embodiment of the present invention.
15 is a block diagram illustrating and illustrating a computing environment including a computing device suitable for use in example embodiments.

이하, 도면을 참조하여 본 발명의 구체적인 실시형태를 설명하기로 한다. 이하의 상세한 설명은 본 명세서에서 기술된 방법, 장치 및/또는 시스템에 대한 포괄적인 이해를 돕기 위해 제공된다. 그러나 이는 예시에 불과하며 본 발명은 이에 제한되지 않는다.Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The detailed description below is provided to provide a comprehensive understanding of the methods, devices and/or systems described herein. However, this is only an example and the present invention is not limited thereto.

본 발명의 실시예들을 설명함에 있어서, 본 발명과 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 본 발명의 실시예들을 기술하기 위한 것이며, 결코 제한적이어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, "포함" 또는 "구비"와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하도록 해석되어서는 안 된다.In describing the embodiments of the present invention, if it is determined that a detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the terms described below are terms defined in consideration of functions in the present invention, and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification. The terminology used in the detailed description is merely for describing embodiments of the present invention and should in no way be limiting. Unless explicitly stated otherwise, singular forms include plural meanings. In this description, expressions such as “comprising” or “comprising” are intended to indicate certain features, numbers, steps, operations, elements, parts or combinations thereof, and one or more than those described. It should not be construed to exclude the existence or possibility of any other characteristic, number, step, operation, element, or part or combination thereof.

도 1은 본 발명의 일 실시예에 따른 원본자료, 비식별자료 및 재현자료를 설명하기 위한 개략도이다. 또한, 도 2는 본 발명의 일 실시예에 따른 원본자료의 예시이고, 도 3은 종래 기술에 따른 비식별자료의 예시이며, 도 4는 본 발명의 일 실시예에 따른 재현자료의 예시이다.Figure 1 is a schematic diagram illustrating original data, de-identified data, and reproduced data according to an embodiment of the present invention. In addition, Figure 2 is an example of original data according to an embodiment of the present invention, Figure 3 is an example of non-identified data according to the prior art, and Figure 4 is an example of reproduced data according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 원본자료로부터 비식별자료 및 재현자료가 생성될 수 있다. 여기서, 원본자료는 본 발명의 일 실시예에 따른 재현자료의 생성을 위한 기초자료로서, 서로 다른 복수 개의 변수들에 대한 데이터를 포함할 수 있다.Referring to Figure 1, non-identified data and reproduced data can be generated from original data according to an embodiment of the present invention. Here, the original data is basic data for generating reproduction data according to an embodiment of the present invention, and may include data on a plurality of different variables.

도 2에 도시된 바와 같이, 원본자료는 예를 들어, 데이터 아이디(ID), 지역코드, 연령, 성별, 질병, 급여 등에 대한 데이터를 포함할 수 있다. 본 실시예들에 있어서, 데이터 아이디(ID), 지역코드, 연령, 성별, 질병 및 급여를 각각 “변수”라 지칭하며, 13053, 28, A, 전립선염, 30 등을 각각 해당 변수에 대한 “데이터”라 지칭한다. 다만, 원본자료에 포함된 변수들의 종류가 특별히 한정되는 것은 아니며, 원본자료에 포함된 변수들은 도 2에 도시된 항목들 외에도 키, 몸무게, 자녀 수, 가족 수, 거주지 등을 더 포함할 수 있다.As shown in Figure 2, the original data may include, for example, data on data ID, area code, age, gender, disease, salary, etc. In these embodiments, the data ID (ID), area code, age, gender, disease, and salary are each referred to as “variables,” and 13053, 28, A, prostatitis, 30, etc. are each referred to as “variables.” It is referred to as “data.” However, the types of variables included in the original data are not particularly limited, and the variables included in the original data may include height, weight, number of children, number of family members, place of residence, etc. in addition to the items shown in Figure 2. .

이때, 변수들 중 지역코드, 연령, 급여, 키, 자녀 수, 가족 수 등과 같이 수치로 표현 가능한 변수를 “수치형 변수(양적 변수)”라 지칭할 수 있다. 또한, 수치형 변수 중 키, 몸무게 등과 같이 연속적인 수치로 표현 가능한 변수를 “연속형 변수”, 수치형 변수 중 연령, 자녀 수, 가족 수 등과 같이 이산 수치로 표현 가능한 변수를 “이산형 변수”라 지칭할 수 있다. 또한, 변수들 중 질병, 거주지 등과 같이 수치가 아닌 범주로 표현 가능한 변수를 “범주형 변수(질적 변수)”라 지칭할 수 있다.At this time, among the variables, variables that can be expressed numerically, such as area code, age, salary, height, number of children, number of family members, etc., can be referred to as “numeric variables (quantitative variables).” In addition, among numeric variables, variables that can be expressed as continuous numbers, such as height and weight, are “continuous variables,” and among numeric variables, variables that can be expressed as discrete numbers, such as age, number of children, and number of family members, are “discrete variables.” It can be referred to as . Additionally, among variables, variables that can be expressed as categories rather than numbers, such as disease, place of residence, etc., can be referred to as “categorical variables (qualitative variables).”

이러한 원본자료는 개인의 질병, 급여 등과 같은 개인정보, 즉 민감한 정보를 포함할 수 있으므로, 종래에는 원본자료 내 특정 데이터를 기준으로 원본자료를 그룹화하거나 원본자료 내 특정 데이터를 공란으로 대체(마스킹)하는 방식으로 비식별자료를 생성하여 활용하였다.Since these original data may contain personal information, such as personal illness, salary, etc., that is, sensitive information, conventionally, the original data is grouped based on specific data in the original data or specific data in the original data is replaced with blank (masking). Non-identifiable data was created and utilized in this way.

도 3에 도시된 바와 같이, 지역코드 13053을 갖는 데이터들이 1305*로 그룹핑되고 지역코드 13068을 갖는 데이터들이 1306*로 그룹핑됨으로써 원본자료가 비식별자료로 대체될 수 있다. 그러나, 이 과정에서 원본자료의 데이터를 잃게 되므로, 비식별자료가 원본자료를 대표하지 못하는 문제점이 있다.As shown in FIG. 3, data with area code 13053 are grouped into 1305* and data with area code 13068 are grouped into 1306*, so that the original data can be replaced with de-identified data. However, since the original data is lost in this process, there is a problem in that the de-identified data does not represent the original data.

이에, 최근에는 도 4에 도시된 바와 같이 원본자료와 대응되는 통계적 특성을 갖는 재현자료를 생성하여 활용하는 사례가 증가하고 있다. 본 실시예들에 있어서, 재현자료란 원본자료의 통계적인 특성(또는 분포)과 설정된 오차범위 이내(예를 들어, 5% 이내)의 통계적인 특성을 갖도록 가상으로 생성한 자료를 의미한다. 일 예시로서, 원본자료에서의 성별에 대한 통계적 분포가 남:여 = 5:5 인 경우, 재현자료에서의 성별에 대한 통계적 분포 또한 남:여 = 5:5 일 수 있다. 이러한 재현자료는 통계적으로 유용한 각종 정보들을 제공하는 데 활용되거나, 다양한 서비스 제공을 위한 훈련 데이터 등으로 활용되는 등 그 활용도가 매우 크다.Accordingly, recently, as shown in FIG. 4, the number of cases of generating and utilizing reproduced data with statistical characteristics corresponding to the original data is increasing. In the present embodiments, reproduced data refers to data virtually generated to have statistical characteristics (or distribution) of the original data and statistical characteristics within a set error range (for example, within 5%). As an example, if the statistical distribution of gender in the original data is M:F = 5:5, the statistical distribution of gender in the reproduced data may also be M:F = 5:5. These reproduction data have great utility, such as being used to provide various statistically useful information or as training data to provide various services.

이하에서는, 도 5 내지 도 13을 참조하여 재현자료의 생성과정을 자동화하면서 고품질의 재현자료를 보다 빠르게 생성하기 위한 재현자료 자동 생성 시스템(100)에 대해 구체적으로 설명하기로 한다.Hereinafter, with reference to FIGS. 5 to 13, the automatic reproduction data generation system 100 for generating high quality reproduction data more quickly while automating the process of generating reproduction data will be described in detail.

도 5는 본 발명의 일 실시예에 따른 재현자료 자동 생성 시스템(100)의 상세 구성을 나타낸 블록도이다. Figure 5 is a block diagram showing the detailed configuration of the automatic reproduction data generation system 100 according to an embodiment of the present invention.

도 5에 도시된 바와 같이, 본 발명의 일 실시예에 따른 재현자료 자동 생성 시스템(100)은 제약조건 탐지부(110), 전처리부(120), 생성순서 결정부(130), 재현자료 생성부(140) 및 후처리부(150)를 포함한다.As shown in Figure 5, the automatic reproduction data generation system 100 according to an embodiment of the present invention includes a constraint detection unit 110, a preprocessing unit 120, a generation order determination unit 130, and a reproduction data generation unit. It includes a unit 140 and a post-processing unit 150.

제약조건 탐지부(110)는 원본자료 내 변수들 간의 제약조건(constraint)을 탐지하는 모듈로서, 제1 제약조건 탐지부(102) 및 제2 제약조건 탐지부(104)를 포함할 수 있다. 본 실시예들에 있어서, 변수들 간의 제약조건이란 변수들 간의 수치적 또는 범주적인 연관관계를 만족시키는 조건으로서 예를 들어, 변수들 간의 선형 종속관계, 대소관계, 상하관계(또는 포함관계) 등이 될 수 있다.The constraint detection unit 110 is a module that detects constraints between variables in the original data, and may include a first constraint detection unit 102 and a second constraint detection unit 104. In the present embodiments, constraints between variables are conditions that satisfy numerical or categorical relationships between variables, for example, linear dependency relationships, magnitude relationships, superior-subordinate relationships (or inclusion relationships), etc. between variables. This can be.

제1 제약조건 탐지부(102)는 원본자료 내 복수 개의 변수들에 대한 선형 종속관계를 나타내는 제1 제약조건을 탐지한다. 본 실시예들에 있어서, 선형 종속관계는 어느 하나의 변수가 다른 변수들의 일차 결합으로 정의되는 관계를 의미한다. 일 예시로서, 선형 종속관계는 예를 들어, X₄ = 2X₁ - X₂ + X₃ (여기서, X₁, X₂, X₃, X₄는 각각 서로 다른 변수를 나타냄)와 같이 표현될 수 있다. 이러한 선형 종속관계는 상기 원본자료 내에서 하나 이상 존재할 수 있다. 또한, 선형 종속관계를 이루는 변수들의 개수는 특별히 한정되지 않는다. The first constraint detection unit 102 detects a first constraint indicating a linear dependency relationship for a plurality of variables in the original data. In the present embodiments, a linear dependency relationship means a relationship in which one variable is defined as a linear combination of other variables. As an example, a linear dependency relationship can be expressed as _, for example, X ₄ = _2X ₁ - X ₂ + _X ₃ (where X ₁ , there is. One or more such linear dependencies may exist within the original data. Additionally, the number of variables forming a linear dependent relationship is not particularly limited.

후술할 바와 같이, 제1 제약조건 탐지부(102)는 원본자료 내 상기 복수 개의 변수들에 대해 QR 분해(QR Decomposition)를 수행하고, 상기 QR 분해로부터 획득되는 상삼각행렬의 대각선 성분의 값에 기초하여 상기 선형 종속관계를 나타내는 제1 제약조건을 탐지할 수 있다. As will be described later, the first constraint detection unit 102 performs QR decomposition on the plurality of variables in the original data, and matches the value of the diagonal component of the upper triangular matrix obtained from the QR decomposition. Based on this, the first constraint representing the linear dependency relationship can be detected.

제2 제약조건 탐지부(104)는 원본자료 내 복수 개의 변수들 간의 조합에 대한 대소관계 및 원본자료 내 복수 개의 변수들에 대한 상하관계를 나타내는 제2 제약조건을 탐지한다. 본 실시예들에 있어서, 대소관계는 예를 들어, X < Y, X < Y + Z, X > Y + Z (여기서, X, Y, Z는 각각 서로 다른 변수를 나타냄) 등과 같이 원본자료 내 복수 개의 변수들의 조합 상호간의 크거나 작은 정도를 나타내는 관계를 의미한다. 또한, 상하관계는 건설업 ⊃ 종합 건설업 ⊃ 건물 건설업(여기서, 건설업, 종합 건설업 및 건물 건설업은 각각 서로 다른 변수임) 등과 같이 원본자료 내 복수 개의 변수들에 대한 계층적 구조를 나타내는 관계를 의미한다. The second constraint detection unit 104 detects a second constraint that represents a magnitude relationship for a combination of a plurality of variables in the original data and a hierarchical relationship for the plurality of variables in the original data. In these embodiments, the size relationship is within the original data, for example, X < Y, It refers to a relationship that indicates the degree to which a combination of multiple variables is greater or lesser. In addition, the hierarchical relationship refers to a relationship that represents the hierarchical structure of multiple variables in the original data, such as construction industry ⊃ general construction industry ⊃ building construction industry (here, construction industry, general construction industry, and building construction industry are each different variables).

후술할 바와 같이, 제2 제약조건 탐지부(104)는 원본자료로부터 설정된 개수 이하의 변수들에 대한 조합을 복수 개 추출한 후 추출된 상기 조합 각각의 최소값과 최대값에 기초하여 추출된 조합 상호간의 대소관계를 탐지하고, 상기 원본자료로부터 설정된 개수 이하의 변수들에 대한 상하관계를 학습함으로써 상기 대소관계 및 상하관계를 나타내는 제2 제약조건을 탐지할 수 있다.As will be described later, the second constraint detection unit 104 extracts a plurality of combinations of variables less than the set number from the original data, and then determines the extracted combinations based on the minimum and maximum values of each of the extracted combinations. By detecting the magnitude relationship and learning the hierarchical relationship for variables less than the set number from the original data, a second constraint condition representing the magnitude relationship and the hierarchical relationship can be detected.

한편, 제1 제약조건 탐지부(102) 및 제2 제약조건 탐지부(104)는 원본자료로부터 랜덤 샘플링(Random Sampling)을 반복적으로 수행하고, 랜덤 샘플링된 데이터 각각에 대해 제1 제약조건 및 제2 제약조건이 탐지되는지 여부를 각각 판단한 후 제1 제약조건 및 제2 제약조건이 탐지되는 비율이 각각 임계치 이상인 경우에만 제1 제약조건 및 제2 제약조건을 최종 제약조건으로 결정할 수 있다. 원본자료 내에 이상치가 존재하는 경우 상술한 제1 제약조건 및 제2 제약조건이 정상적으로 탐지되지 않을 수 있으므로, 본 발명에서는 원본자료 내 데이터 일부만을 임의로 랜덤 샘플링하여 반복적으로 제약조건을 탐지한 후 일정 비율 이상 상기 제약조건이 탐지되는 경우에만 해당 제약조건을 최종 제약조건으로 결정할 수 있도록 하였다. Meanwhile, the first constraint detection unit 102 and the second constraint detection unit 104 repeatedly perform random sampling from the original data, and set the first constraint and second constraints for each of the randomly sampled data. After determining whether two constraints are detected, the first constraint and the second constraint can be determined as the final constraints only when the detection rates of the first constraint and the second constraint are each higher than the threshold. If there are outliers in the original data, the above-mentioned first and second constraints may not be detected properly, so in the present invention, only a portion of the data in the original data is randomly sampled, the constraints are repeatedly detected, and then a certain percentage is used. Above, only when the above constraint condition is detected, the corresponding constraint condition can be determined as the final constraint condition.

일 예시로서, 제1 제약조건 탐지부(102)는 원본자료로부터 랜덤 샘플링을 1,000번 수행하고, 랜덤 샘플링된 데이터 각각에 대해 제1 제약조건이 탐지되는지 여부를 각각 판단한 후 제1 제약조건이 탐지되는 비율이 95% 이상인 경우(즉, 950번 이상인 경우)에만 제1 제약조건을 최종 제약조건으로 결정할 수 있다. As an example, the first constraint detection unit 102 performs random sampling 1,000 times from the original data, determines whether the first constraint is detected for each randomly sampled data, and then detects the first constraint. The first constraint can be determined as the final constraint only when the ratio is 95% or more (i.e., more than 950 times).

다른 예시로서, 제2 제약조건 탐지부(104)는 원본자료로부터 랜덤 샘플링을 500번 수행하고, 랜덤 샘플링된 데이터 각각에 대해 제2 제약조건이 탐지되는지 여부를 각각 판단한 후 제2 제약조건이 탐지되는 비율이 90% 이상인 경우(즉, 450번 이상인 경우)에만 제2 제약조건을 최종 제약조건으로 결정할 수 있다. As another example, the second constraint detection unit 104 performs random sampling 500 times from the original data, determines whether the second constraint is detected for each randomly sampled data, and then detects the second constraint. Only when the ratio is 90% or more (i.e., more than 450 times) can the second constraint be determined as the final constraint.

전처리부(120)는 제1 제약조건에 기초하여 상기 선형 종속관계가 탐지된 변수를 원본자료에서 제거한다. 일 예시로서, 제1 제약조건 탐지부(102)에서 X₄ = 2X₁ - X₂ + X₃ (여기서, X₁, X₂, X₃, X₄는 각각 서로 다른 변수를 나타냄)라는 선형 종속관계가 탐지된 경우, 전처리부(120)는 원본자료 A에서 선형 종속관계가 탐지된 변수 X₄(즉, 종속변수)를 제거할 수 있다. 이하에서는, 종속변수(예를 들어, 변수 X₄)가 제거된 원본자료, 즉 필터링된 원본자료를 편의상 A'라 칭하기로 한다. The preprocessor 120 removes variables for which the linear dependency relationship is detected based on the first constraint condition from the original data. As an example, in the first constraint detection unit 102, _a linear dependence of X ₄ = 2X ₁ - _X ₂ + _X ₃ (where X ₁ , If a relationship is detected, the preprocessor ₁₂₀ may remove the variable Hereinafter, the original data from which the dependent variable (for example, _variable

후술할 바와 같이, 재현자료 생성부(140)는 원본자료에 포함된 복수 개의 변수들 중 적어도 하나에 대한 데이터를 설정된 재현자료 생성모델에 순차적으로 입력함으로써 재현자료를 생성할 수 있다. 이때, 원본자료에 포함된 복수 개의 변수들 모두에 대한 데이터를 재현자료 생성모델에 입력하는 경우, 생성된 재현자료 내 변수들이 상술한 선형 종속관계를 만족하지 못하는 등 재현자료의 품질이 떨어지는 문제가 발생될 수 있다. 또한, 이 경우 재현자료 생성모델에 입력되는 변수들이 많아 재현자료의 생성에 과도한 시간이 소요될 수 있다. 이에, 본 발명에서는 원본자료 내 복수 개의 변수들 중 상기 원본자료에서 제거된 변수를 제외한 나머지 변수들에 대한 데이터만을 재현자료 생성모델에 입력할 수 있도록 함으로써, 재현자료의 품질을 향상시키고 재현자료의 생성에 소요되는 시간을 최소화할 수 있다.As will be described later, the reproduction data generation unit 140 may generate reproduction data by sequentially inputting data for at least one of a plurality of variables included in the original data into a set reproduction data generation model. At this time, when data on all of the plurality of variables included in the original data are input into the reproduction data generation model, the quality of the reproduction data deteriorates, such as the variables in the generated reproduction data not satisfying the above-mentioned linear dependency relationship. It can happen. Additionally, in this case, there are many variables input into the reproduction data generation model, so it may take excessive time to generate the reproduction data. Accordingly, the present invention improves the quality of the reproduction data and improves the quality of the reproduction data by allowing only data for the remaining variables, excluding the variables removed from the original data, among a plurality of variables in the original data to be input into the reproduction data generation model. The time required for creation can be minimized.

생성순서 결정부(130)는 원본자료에 포함된 복수 개의 변수들에 대한 조건부 독립 테스트(Conditional Independence Test)를 수행하고, 상기 조건부 독립 테스트의 결과에 기초하여 재현자료의 생성을 위한 변수들의 생성순서를 결정한다. 일반적으로, 동일한 변수들을 기초로 재현자료를 생성하더라도 재현자료 생성모델에 입력되는 변수들의 순서에 따라 재현자료의 품질이 달라질 수 있다. 특히, 다른 변수들에 미치는 영향도가 상대적으로 작은 변수들을 먼저 생성하는 경우 재현자료의 품질이 더욱 떨어지는 경향이 있다. 이에, 본 발명에서는 원본자료에 포함된 복수 개의 변수들 중 다른 변수들에 대한 영향도가 큰 변수부터 먼저 생성되도록 함으로써 고품질의 재현자료를 생성할 수 있도록 하였다. 생성순서 결정부(130)에서 변수들의 생성순서를 결정하는 구체적인 방법은 도 10 내지 도 13을 참조하여 구체적으로 후술하기로 한다.The generation order determination unit 130 performs a conditional independence test on a plurality of variables included in the original data, and determines the generation order of variables for generating reproduced data based on the results of the conditional independence test. Decide. In general, even if reproduction data is generated based on the same variables, the quality of the reproduction data may vary depending on the order of variables input into the reproduction data generation model. In particular, when variables that have relatively small influence on other variables are created first, the quality of the reproduction data tends to deteriorate further. Accordingly, in the present invention, among a plurality of variables included in the original data, variables with a greater influence on other variables are generated first, thereby making it possible to generate high-quality reproduced data. A specific method of determining the creation order of variables in the creation order determination unit 130 will be described in detail later with reference to FIGS. 10 to 13.

재현자료 생성부(140)는 원본자료 내 복수 개의 변수들 중 상기 원본자료에서 제거된 변수를 제외한 나머지 변수들에 대한 데이터를 상기 생성순서에 따라 설정된 재현자료 생성모델에 순차적으로 입력하여 재현자료를 생성한다. 즉, 재현자료 생성부(140)는 필터링된 원본자료로부터 재현자료를 생성하되, 생성순서 결정부(130)에서 결정된 생성순서에 따라 필터링된 원본자료 내 변수들에 대한 데이터를 재현자료 생성모델에 순차적으로 입력하여 재현자료를 생성할 수 있다. 이와 같이 생성된 재현자료를 편의상 B'라 칭하기로 한다.The reproduction data generation unit 140 sequentially inputs data for the remaining variables, excluding the variables removed from the original data, among a plurality of variables in the original data into the reproduction data generation model set according to the generation order to generate the reproduction data. Create. In other words, the reproduction data generation unit 140 generates reproduction data from the filtered original data, and provides data on variables in the filtered original data to the reproduction data generation model according to the generation order determined by the generation order determination unit 130. Reproducible data can be generated by sequential input. For convenience, the reproduction data generated in this way will be referred to as 'B'.

여기서, 재현자료 생성모델은 예를 들어, synthpop(https://www.synthpop.org.uk/), sms(https://cran.r-project.org/web/packages/sms/index.html), simPop(https://cran.r-project.org/web/packages/simPop/index.html) 등과 같은 소프트웨어를 포함하는 모델일 수 있으나, 특별히 한정되는 것은 아니다. Here, the reproduction data generation model is, for example, synthpop (https://www.synthpop.org.uk/), sms (https://cran.r-project.org/web/packages/sms/index.html ), simPop (https://cran.r-project.org/web/packages/simPop/index.html), etc., but is not particularly limited.

후처리부(150)는 상술한 제1 제약조건 및 제2 제약조건에 기초하여 재현자료 생성부(140)에서 생성된 재현자료(B')를 후처리한다. The post-processing unit 150 post-processes the reproduction data (B') generated by the reproduction data generation unit 140 based on the first and second constraints described above.

먼저, 후처리부(150)는 제1 제약조건에 기초하여 전처리부(120)에서 제거된 변수를 복원할 수 있다. 위 예시에서, 후처리부(150)는 제1 제약조건에 기초하여 전처리부(120)에서 제거된 변수, 즉 X₄를 복원할 수 있다. 후처리부(150)는 X₁, X₂, X₃, X₄에 대한 선형 종속관계를 알고 있으므로, 상기 선형 종속관계로부터 X₄를 복원할 수 있다.First, the post-processing unit 150 may restore variables removed in the pre-processing unit 120 based on the first constraint condition. In the above example, the post-processing unit 150 may restore the variable removed in the pre-processing unit 120, that is, X ₄ , based on the first constraint condition. Since the post-processing unit 150 knows the linear dependence relationship for X ₁ , X ₂ , X ₃ , and X ₄ , it can restore X ₄ from the linear dependence relationship.

다음으로, 후처리부(150)는 제2 제약조건에 기초하여 재현자료 내 이상치를 제거할 수 있다. 구체적으로, 후처리부(150)는 상기 재현자료의 데이터들 중 제2 제약조건을 만족하지 않는 데이터를 이상치로 판단하여 상기 이상치를 제거할 수 있다. 일 예시로서, 후처리부(150)는 “서울시 분당구”와 같이 제2 제약조건을 만족하지 않는 데이터를 이상치로 판단하여 재현자료(B')에서 상기 이상치를 제거할 수 있다. 이와 같이 후처리부(150)에서 후처리가 완료된 재현자료를 편의상 B라 칭하기로 한다.Next, the post-processing unit 150 may remove outliers in the reproduction data based on the second constraint condition. Specifically, the post-processing unit 150 may determine that data that does not satisfy the second constraint condition among the data of the reproduction data is an outlier and remove the outlier. As an example, the post-processing unit 150 may determine data that does not satisfy the second constraint condition, such as “Bundang-gu, Seoul”, as an outlier and remove the outlier from the reproduction data (B'). For convenience, the reproduced data for which post-processing has been completed in the post-processing unit 150 will be referred to as B.

도 6은 본 발명의 일 실시예에 따른 제1 제약조건 탐지부(102)에서 제1 제약조건을 탐지하는 과정을 나타낸 예시이다. Figure 6 is an example showing a process for detecting a first constraint condition in the first constraint detection unit 102 according to an embodiment of the present invention.

상술한 바와 같이, 제1 제약조건 탐지부(102)는 원본자료 내 상기 복수 개의 변수들에 대해 QR 분해를 수행하고, 상기 QR 분해로부터 획득되는 상삼각행렬의 대각선 성분의 값에 기초하여 상기 선형 종속관계를 나타내는 제1 제약조건을 탐지할 수 있다. 여기서, QR 분해란 실수 행렬을 직교행렬과 상삼각행렬의 곱으로 나타내는 과정을 의미한다. As described above, the first constraint detection unit 102 performs QR decomposition on the plurality of variables in the original data, and based on the value of the diagonal component of the upper triangular matrix obtained from the QR decomposition, the first constraint detection unit 102 The first constraint indicating a dependency relationship can be detected. Here, QR decomposition refers to the process of representing a real matrix as the product of an orthogonal matrix and an upper triangular matrix.

도 6의 (a)는 QR 분해 전 원본자료 내 복수 개의 변수들에 대한 데이터의 행렬 예시를 나타내며, 도 6의 (b)는 (a)에 도시된 행렬이 QR 분해 후 직교행렬과 상삼각행렬로 분해된 결과를 나타낸 예시이다.Figure 6 (a) shows an example of a matrix of data for a plurality of variables in the original data before QR decomposition, and Figure 6 (b) shows that the matrix shown in (a) is an orthogonal matrix and an upper triangular matrix after QR decomposition. This is an example showing the result decomposed into .

원본자료 내 변수 X₁, X₂, X₃, X₄에 대한 데이터의 행렬이 도 6의 (a)와 같이 주어진 상태에서 QR 분해를 수행하는 경우 도 6의 (b)와 같이 직교행렬과 상삼각행렬의 곱으로 분해될 수 있다. 여기서, 상감각행렬은 N X N 행렬(예를 들어, 4 X 4 행렬)일 수 있다. When QR decomposition is performed with the data matrix for the variables X ₁ , X ₂ , X ₃ _, and It can be decomposed into the product of a triangular matrix. Here, the diagonal matrix may be an NXN matrix (for example, a 4 X 4 matrix).

이때, 제1 제약조건 탐지부(102)는 상삼각행렬의 대각선 성분의 값들 중 0의 값을 갖는 n(단, n≤N)번째 행과 열에 대응되는 변수를 선형 종속관계가 탐지된 변수로 결정할 수 있다. At this time, the first constraint detection unit 102 selects the variable corresponding to the n (where n≤N)th row and column with a value of 0 among the values of the diagonal components of the upper triangular matrix as the variable for which a linear dependency relationship is detected. You can decide.

도 6의 (b)를 참조하면, 상삼각행렬의 대각선 성분은 -10, -4, 5, 0의 값들로 나타남을 확인할 수 있다. 이에 따라, 제1 제약조건 탐지부(102)는 상삼각행렬의 대각선 성분의 값들 중 0의 값을 갖는 4번째 행과 열에 대응되는 변수, 즉 X₄를 선형 종속관계가 탐지된 변수로 결정할 수 있다.Referring to (b) of FIG. 6, it can be seen that the diagonal components of the upper triangular matrix appear as values of -10, -4, 5, and 0. Accordingly, the first constraint detection unit 102 may determine the variable corresponding to the fourth row and column with a value of 0 among the values of _the diagonal components of the upper triangular matrix, that is, there is.

이후, 제1 제약조건 탐지부(102)는 상기 상삼각행렬로부터 상기 선형 종속관계가 탐지된 변수를 종속변수로 하는 n-1개의 선형 방정식을 도출한 후 상기 n-1개의 선형 방정식으로부터 상기 선형 종속관계를 만족시키는의 계수를 계산함으로써 상기 제1 제약조건을 탐지할 수 있다. Thereafter, the first constraint detection unit 102 derives n-1 linear equations using the variables for which the linear dependency relationship is detected as dependent variables from the upper triangular matrix, and then derives the linear equations from the n-1 linear equations. The first constraint can be detected by calculating the coefficient of satisfying the dependency relationship.

위 예시에서, 제1 제약조건 탐지부(102)는 상기 선형 종속관계가 탐지된 변수, 즉 X₄를 종속변수로 하는 3개의 선형 방정식을 아래와 같이 도출한 후 상기 3개의 선형 방정식으로부터 상기 선형 종속관계를 만족시키는의 계수(a, b, c)를 계산함으로써 상기 제1 제약조건을 탐지할 수 있다. In the above example, _the first constraint detection unit 102 derives three linear equations with the variable for which the linear dependency relationship is detected, that is, The first constraint can be detected by calculating the coefficients (a, b, c) that satisfy the relationship.

<탐지하고자 하는 선형 종속관계><Linear dependency relationship to be detected>

X₄ = a*X₁ + b*X₂ + c*X₃ X ₄ = a*X ₁ + b*X ₂ + c*X ₃

(여기서, a, b, c는 각각 선형 종속관계를 만족하는 X₁, X₂, X₃의 계수임)(Here, a, b, and c are the coefficients of X ₁ , X ₂ , _and

이를 수학식으로 일반화하면 아래와 같다.If we generalize this mathematically, it is as follows.

X_n = a*X₁ + b*X₂ + c*X₃…+ + k*X_n-1 X _n = a*X ₁ + b*X ₂ + c*X ₃ … + + k*X _n-1

(여기서, a, b, c,…k는 각각 선형 종속관계를 만족하는 X₁, X₂, X₃,… X_n-1의 계수임)(Here, a, b, c,…k are the coefficients of X ₁ , X ₂ , X ₃ _, …

<선형 방정식><Linear equation>

-21 = a*(-10) + b*(-3) + c*(-4)-21 = a*(-10) + b*(-3) + c*(-4)

3 = a*(0) + b*(-4) + c*(-1)3 = a*(0) + b*(-4) + c*(-1)

5 = a*(0) + b*(0) + c*(5)5 = a*(0) + b*(0) + c*(5)

<선형 방정식의 계수><Coefficients of linear equation>

a = 2a = 2

b = -1b = -1

c = 1c = 1

<탐지된 선형 종속관계><Detected linear dependencies>

X₄ = 2X₁ - X₂ + X₃ X ₄ = 2X ₁ - X ₂ + X ₃

한편, 도 6에서는 설명의 편의상 1개의 제1 제약조건이 탐지된 것으로 설명하였으나 이는 일 예시에 불과하며, 복수 개의 제1 제약조건이 탐지될 수도 있다. 만약, 원본자료 내에서 복수 개의 제1 제약조건이 존재하는 경우 상삼각행렬의 대각선 성분에서 0의 값 또한 복수 개 나타날 수 있다. 즉, 상삼각행렬의 대각선 성분 내 0의 값의 개수가 제1 제약조건의 개수를 나타낸다.Meanwhile, in FIG. 6, for convenience of explanation, it is explained that one first constraint condition has been detected, but this is only an example, and a plurality of first constraints may be detected. If multiple first constraints exist in the original data, multiple 0 values may also appear in the diagonal component of the upper triangular matrix. That is, the number of 0 values in the diagonal component of the upper triangular matrix represents the number of first constraints.

일 예시로서, 상삼각행렬 상에서 4번째, 5번째 행과 열에서 0의 값이 각각 나타나는 경우, 제1 제약조건 탐지부(102)는 아래와 같은 선형 종속관계를 나타내는 2개의 제1 제약조건을 탐지할 수 있다. As an example, when a value of 0 appears in the fourth and fifth rows and columns of the upper triangular matrix, the first constraint detection unit 102 detects two first constraints indicating the linear dependency relationship as follows. can do.

X₄ = a*X₁ + b*X₂ + c*X₃ X ₄ = a*X ₁ + b*X ₂ + c*X ₃

X₅ = d*X₁ + e*X₂ + f*X₃ + g*X₄ X ₅ = d*X ₁ + e*X ₂ + f*X ₃ + g*X ₄

(여기서, d, e, f, g는 각각 선형 종속관계를 만족하는 X₁, X₂, X₃, X₄의 계수임)(Here, d, e, f, and g are coefficients of X ₁ , X ₂ , X ₃ , and X ₄ that satisfy linear dependency relationships)

또한, 변수들 중 적어도 하나를 특정 수학식을 이용하여 다른 값으로 치환하는 경우, 비선형 관계에 있는 변수들 간의 종속관계 또한 탐지할 수 있다. Additionally, when at least one of the variables is replaced with another value using a specific equation, dependency relationships between variables in a non-linear relationship can also be detected.

일 예시로서, X₁ X X₂ = X₃ 라는 곱셈관계가 존재한다고 가정할 때, Log X₁, Log X₂, Log X₃를 각각 X₁', X₂', X₃'로 치환한 상태에서 앞선 과정을 반복하는 경우 변수 X₁, X₂, X₃ 간의 곱셈관계 또한 탐지할 수 있다. _As an example, assuming _that _a multiplication relationship _of X ₁ XX ₂ = _X ₃ exists, with _Log If the previous process is repeated, the multiplicative relationship between variables X ₁ , X ₂ , and X ₃ can also be detected.

Log X₁ + Log X₂ = Log X₃ _Log

→ X₁' + X₂' = X₃'→ X ₁ ' + X ₂ ' = X ₃ '

이와 같이, 제1 제약조건 탐지부(102)는 원본자료 내 복수 개의 변수들에 대한 QR 분해를 통해 각 변수들 간 제1 제약조건을 보다 빠르고 간편하게 자동 탐지할 수 있다. 제1 제약조건 탐지부(102)는 이와 같이 탐지된 제1 제약조건을 데이터베이스(미도시)에 저장할 수 있다. 전처리부(120)는 데이터베이스에 저장된 제1 제약조건에 따라 상기 선형 종속관계가 탐지된 변수를 상기 원본자료에서 제거함으로써 필터링된 원본자료(A')를 획득할 수 있다. In this way, the first constraint detection unit 102 can automatically detect the first constraint between each variable more quickly and easily through QR decomposition of a plurality of variables in the original data. The first constraint detection unit 102 may store the detected first constraint in a database (not shown). The preprocessor 120 may obtain filtered original data (A') by removing variables for which the linear dependency relationship is detected from the original data according to the first constraint stored in the database.

도 7은 본 발명의 일 실시예에 따른 제2 제약조건 탐지부(104)에서 제2 제약조건의 대소관계를 탐지하는 과정을 나타낸 예시이다.Figure 7 is an example showing the process of detecting the magnitude relationship of the second constraint condition in the second constraint detection unit 104 according to an embodiment of the present invention.

상술한 바와 같이, 제2 제약조건 탐지부(104)는 원본자료 내 복수 개의 변수들 간의 조합에 대한 대소관계를 탐지할 수 있다.As described above, the second constraint detection unit 104 can detect the size relationship for the combination of a plurality of variables in the original data.

도 7의 (a), (b), (c)는 제2 제약조건 탐지부(104)에서 탐지되는 복수 개의 변수들의 조합 상호간의 대소관계 예시를 각각 나타낸다. Figures 7 (a), (b), and (c) respectively show examples of size relationships between combinations of a plurality of variables detected by the second constraint detection unit 104.

도 7에 도시된 바와 같이, 제2 제약조건 탐지부(104)는 X < Y, X < Y + Z, X > Y + Z (여기서, X, Y, Z는 각각 서로 다른 변수를 나타냄) 등과 같이 원본자료 내 복수 개의 변수들의 조합 상호간의 대소관계를 탐지할 수 있다. As shown in FIG. 7, the second constraint detection unit 104 has X < Y, X < Y + Z, Likewise, it is possible to detect the relationship between combinations of multiple variables in the original data.

이때, 제2 제약조건 탐지부(104)는 원본자료로부터 설정된 개수 이하(예를 들어, 3개 이하)의 변수들에 대한 조합을 복수 개 추출한 후 추출된 상기 조합 각각의 최소값과 최대값에 기초하여 상기 조합 상호간의 대소관계를 탐지할 수 있다. At this time, the second constraint detection unit 104 extracts a plurality of combinations of variables less than the set number (for example, 3 or less) from the original data, and then extracts a plurality of combinations based on the minimum and maximum values of each of the extracted combinations. Thus, the size relationship between the combinations can be detected.

일 예시로서, 제2 제약조건 탐지부(104)는 변수 X, 변수 Y 각각의 최소값과 최대값을 각각 추출하고, 추출된 변수 X, 변수 Y 각각의 최소값과 최대값을 상호 비교함으로써 X < Y 라는 대소관계를 탐지할 수 있다. As an example, the second constraint detection unit 104 extracts the minimum and maximum values of each variable The large-small relationship can be detected.

다른 예시로서, 제2 제약조건 탐지부(104)는 변수 X, 변수 Y + 변수 Z 각각의 최소값과 최대값을 각각 추출하고, 추출된 변수 X, 변수 Y + 변수 Z 각각의 최소값과 최대값을 상호 비교함으로써 X < Y + Z 또는 X > Y + Z 라는 대소관계를 탐지할 수 있다.As another example, the second constraint detection unit 104 extracts the minimum and maximum values of variable X, variable Y + variable Z, respectively, and extracts the minimum and maximum values of each of the extracted variable By comparing them, you can detect the relationship X < Y + Z or X > Y + Z.

이러한 변수들 조합 상호간의 대소관계는 수치형 변수에만 적용되므로, 제2 제약조건 탐지부(104)는 원본자료 내 복수 개의 변수들 중 수치형 변수를 먼저 추출한 후 추출된 수치형 변수들에 대한 조합 각각의 최소값과 최대값에 기초하여 조합 상호간의 대소관계를 탐지할 수 있다. Since the magnitude relationship between combinations of these variables applies only to numeric variables, the second constraint detection unit 104 first extracts numeric variables among a plurality of variables in the original data and then performs a combination of the extracted numeric variables. Based on each minimum and maximum value, the relationship between combinations can be detected.

도 8은 본 발명의 일 실시예에 따른 제2 제약조건 탐지부(104)에서 제2 제약조건의 상하관계를 탐지하는 과정을 나타낸 예시이다.Figure 8 is an example showing the process of detecting the hierarchical relationship of the second constraint condition in the second constraint detection unit 104 according to an embodiment of the present invention.

상술한 바와 같이, 제2 제약조건 탐지부(104)는 원본자료로부터 설정된 개수 이하의 변수들에 대한 상하관계를 학습할 수 있다. 이러한 변수들에 대한 상하관계는 범주형 변수에만 적용되므로, 제2 제약조건 탐지부(104)는 원본자료 내 복수 개의 변수들 중 범주형 변수를 먼저 추출한 후 추출된 범주형 변수들에 대한 상하관계를 학습할 수 있다.As described above, the second constraint detection unit 104 can learn the hierarchical relationships of variables less than or equal to the set number from the original data. Since the hierarchical relationship between these variables applies only to categorical variables, the second constraint detection unit 104 first extracts the categorical variable among the plurality of variables in the original data and then extracts the hierarchical relationship between the extracted categorical variables. You can learn.

도 8의 (a) 및 (b)를 참조하면, 제2 제약조건 탐지부(104)는 원본자료로부터 제조업 ⊃ 전기장비 제조업 ⊃ 가정용 기기 제조업, 제조업 ⊃ 식료품 제조업 ⊃ 수산물 가공 및 저장 처리업...등과 같은 상하관계를 학습할 수 있다. 제2 제약조건 탐지부(104)는 원본자료 내 변수들 간의 상하관계에 대한 패턴(예를 들어, 서울시 강남구, 성남시 분당구 등)을 분석함으로써 변수들 간의 계층적인 구조를 학습할 수 있다. Referring to Figures 8 (a) and (b), the second constraint detection unit 104 detects from the original data: manufacturing industry ⊃ electrical equipment manufacturing industry ⊃ home appliance manufacturing industry, manufacturing industry ⊃ food manufacturing industry ⊃ fisheries product processing and storage industry... You can learn hierarchical relationships such as The second constraint detection unit 104 can learn the hierarchical structure between variables by analyzing patterns of hierarchical relationships between variables in the original data (for example, Gangnam-gu, Seoul, Bundang-gu, Seongnam-si, etc.).

이와 같은 방식으로, 제약조건 탐지부(110)는 원본자료 내 상 복수 개의 변수들에 대한 선형 종속관계를 나타내는 제1 제약조건, 및 원본자료 내 복수 개의 변수들의 조합 상호간의 대소관계 및 원본자료 내 복수 개의 변수들에 대한 상하관계를 나타내는 제2 제약조건을 각각 탐지할 수 있다. 제약조건 탐지부(110)에서 탐지된 제약조건은 상술한 전처리 과정뿐 아니라 재현자료(B')를 후처리하는 과정에서도 활용될 수 있다. In this way, the constraint detection unit 110 generates a first constraint indicating a linear dependency relationship for a plurality of variables in the original data, and a size relationship between the combination of a plurality of variables in the original data and within the original data. Second constraints representing a hierarchical relationship between a plurality of variables can be detected, respectively. The constraints detected by the constraint detection unit 110 can be used not only in the above-described pre-processing process but also in the post-processing process of the reproduction data (B').

도 9는 본 발명의 일 실시예에 따른 원본자료에 이상치가 있는 경우 제약조건을 탐지하는 과정을 나타낸 예시이다.Figure 9 is an example showing the process of detecting constraints when there is an outlier in the original data according to an embodiment of the present invention.

상술한 바와 같이, 원본자료 내에 이상치가 존재하는 경우 상술한 제1 제약조건 및 제2 제약조건이 정상적으로 탐지되지 않을 수 있으므로, 본 발명에서는 원본자료 내 데이터 일부만을 임의로 랜덤 샘플링하여 반복적으로 제약조건을 탐지한 후 일정 비율 이상 상기 제약조건이 탐지되는 경우에만 해당 제약조건을 최종 제약조건으로 결정할 수 있도록 하였다. 여기서는, 도 9의 (a)에 도시된 바와 같이 원본자료 내 73라는 이상치가 존재하는 것으로 가정한다.As mentioned above, if there are outliers in the original data, the above-mentioned first and second constraints may not be detected properly, so in the present invention, only part of the data in the original data is randomly sampled and the constraints are repeatedly applied. After detection, the constraint can be determined as the final constraint only when the constraint is detected in a certain percentage or more. Here, it is assumed that an outlier of 73 exists in the original data, as shown in (a) of FIG. 9.

일 예시로서, 제1 제약조건 탐지부(102)는 원본자료로부터 랜덤 샘플링을 1,000번 수행하고, 랜덤 샘플링된 데이터 각각에 대해 제1 제약조건이 탐지되는지 여부를 각각 판단한 후 제1 제약조건이 탐지되는 비율이 95% 이상인 경우(즉, 950번 이상인 경우)에만 제1 제약조건을 최종 제약조건으로 결정할 수 있다. 제2 제약조건 탐지부(104) 또한 상술한 방법과 동일한 방법으로 제2 제약조건을 탐지할 수 있다.As an example, the first constraint detection unit 102 performs random sampling 1,000 times from the original data, determines whether the first constraint is detected for each randomly sampled data, and then detects the first constraint. The first constraint can be determined as the final constraint only when the ratio is 95% or more (i.e., more than 950 times). The second constraint detection unit 104 may also detect the second constraint using the same method as described above.

도 10은 본 발명의 일 실시예에 따른 분리 세트(Separating Set)를 나타낸 예시이다.Figure 10 is an example of a separating set according to an embodiment of the present invention.

생성순서 결정부(130)는 원본자료에 포함된 복수 개의 변수들에 대해 PC 알고리즘(Peter Clark Algorithm)을 적용할 수 있다. PC 알고리즘이란 변수들 간의 인과추론을 위한 알고리즘으로서, 후술할 CPDAG를 도출하는 데 활용될 수 있다. 이를 위해, 생성순서 결정부(130)는 먼저 원본자료에 포함된 복수 개의 변수들에 대한 조건부 독립 테스트를 수행할 수 있다. 조건부 독립 테스트는 복수 개의 변수들에 대한 조건부 독립정보와 분리 세트를 획득하기 위한 테스트로서, 예를 들어 G squared 검정(Neapolitan, 2004), Conditional Distance Correlation(Wang, Xueqin et al, 2015) 등의 테스트 기법일 수 있다. 이때, 생성순서 결정부(130)는 원본자료에 포함된 복수 개의 변수들의 종류에 따라 서로 다른 테스트 기법을 적용하여 조건부 독립 테스트를 수행할 수 있다. 일 예시로서, 생성순서 결정부(130)는 수치형 변수와 범주형 변수 각각에 대해 서로 다른 테스트 기법을 적용할 수 있다. The generation order determination unit 130 may apply the PC algorithm (Peter Clark Algorithm) to a plurality of variables included in the original data. The PC algorithm is an algorithm for causal inference between variables and can be used to derive CPDAG, which will be described later. To this end, the generation order determination unit 130 may first perform a conditional independence test on a plurality of variables included in the original data. The conditional independence test is a test to obtain conditional independence information and a separate set for multiple variables, for example, tests such as the G squared test (Neapolitan, 2004) and Conditional Distance Correlation (Wang, Xueqin et al, 2015). It could be a technique. At this time, the generation order determination unit 130 may perform a conditional independence test by applying different test techniques depending on the types of a plurality of variables included in the original data. As an example, the generation order determination unit 130 may apply different test techniques to numeric variables and categorical variables, respectively.

본 실시예들에 있어서, 조건부 독립정보는 하나 이상의 변수가 주어졌을 때 다른 변수들이 서로 어떠한 영향을 끼치지 않는 상태를 의미한다. 예를 들어, 변수 C가 주어졌을 때 변수 A가 변수 B에 대해 또는 변수 B가 변수 A에 대해 서로 어떠한 영향을 끼치지 않는다면, 변수 A와 변수 B가 변수 C에 대해 조건부 독립적이라 볼 수 있다. 또한, 이 경우 생성순서 결정부(130)는 P(A,B|C) = P(A|C)P(B|C)라는 조건부 독립정보를 획득할 수 있다. 또한, 위 예시에서 변수 C를 변수 A와 변수 B에 대한 분리 세트라 정의할 수 있으며, 생성순서 결정부(130)는 A와 변수 B에 대한 분리 세트로서 변수 C를 획득할 수 있다.In the present embodiments, conditional independent information means a state in which other variables do not have any influence on each other when one or more variables are given. For example, given variable C, if variable A has no effect on variable B or variable B has no effect on variable A, variable A and variable B can be considered conditionally independent with respect to variable C. Also, in this case, the generation order determination unit 130 can obtain conditional independent information that P(A,B|C) = P(A|C)P(B|C). Additionally, in the example above, variable C can be defined as a separate set for variable A and variable B, and the creation order determination unit 130 can obtain variable C as a separate set for variable A and variable B.

도 10을 참조하면, 변수 A와 변수 D는 변수 B와 변수 C가 주어졌을 때 그래프 상에서 조건부 독립을 만족하게 됨을 확인할 수 있다. 이 경우, 변수 B와 변수 C는 변수 A와 변수 D의 분리 세트이다. 그래프 상에서 변수 B와 변수 C를 제거하는 경우 변수 A와 변수 D 간의 연결된 경로가 사라지기 때문이다.Referring to FIG. 10, it can be seen that variable A and variable D satisfy conditional independence on the graph when variable B and variable C are given. In this case, variable B and variable C are disjoint sets of variable A and variable D. This is because when variable B and variable C are removed from the graph, the connected path between variable A and variable D disappears.

이와 같이, 생성순서 결정부(130)는 원본자료에 포함된 복수 개의 변수들에 대한 조건부 독립 테스트를 수행함으로써 복수 개의 변수들에 대한 조건부 독립정보 및 분리 세트를 획득할 수 있다. 이후, 생성순서 결정부(130)는 조건부 독립정보 및 분리 세트를 기반으로 제1 그래프, 즉 데이터의 방향성이 없는 그래프(Undirected Acyclic Graph, UAG)를 획득할 수 있다.In this way, the generation order determination unit 130 can obtain conditional independence information and a separate set for a plurality of variables by performing a conditional independence test on a plurality of variables included in the original data. Thereafter, the generation order determination unit 130 may obtain a first graph, that is, an undirected acyclic graph (UAG) of data, based on the conditional independent information and the separation set.

도 11은 본 발명의 일 실시예에 따른 UAG(제1 그래프)를 나타낸 예시이다. 상술한 바와 같이, 생성순서 결정부(130)는 조건부 독립정보 및 분리 세트를 기반으로 데이터의 방향성이 없는 그래프, 즉 UAG를 획득할 수 있다. 생성순서 결정부(130)는 각 변수들 간의 조건부 독립정보와 분리 세트를 알고 있으므로, 이를 기반으로 각 변수들 UAG를 획득할 수 있다. Figure 11 is an example showing UAG (first graph) according to an embodiment of the present invention. As described above, the generation order determination unit 130 may obtain an undirected graph of data, that is, UAG, based on the conditional independent information and the separated set. Since the generation order determination unit 130 knows the conditional independence information and separation set between each variable, it can obtain the UAG of each variable based on this.

도 11을 참조하면, 생성순서 결정부(130)는 변수 1, 변수 2, 변수 3, 변수 4 및 변수 5 간의 조건부 독립정보와 분리 세트를 기반으로 이들 5개 변수 간의 UAG를 자동으로 드로잉(drawing)할 수 있다. 이때, 생성순서 결정부(130)는 조건부 독립정보, 분리 세트 및 UAG를 기반으로 제2 그래프, 즉 데이터의 방향성이 있는 그래프로서 CPDAG(Completed Partially Directed Acyclic Graph, CPDAG)를 도출할 수 있다. Referring to FIG. 11, the generation order determination unit 130 automatically draws UAG between these five variables based on the conditional independence information and separation set between variable 1, variable 2, variable 3, variable 4, and variable 5. )can do. At this time, the generation order determination unit 130 may derive a second graph, that is, a Completed Partially Directed Acyclic Graph (CPDAG), as a graph in which data is directed, based on the conditional independent information, the disjoint set, and the UAG.

도 12는 본 발명의 일 실시예에 따른 CPDAG(제2 그래프)를 나타낸 예시이다.Figure 12 is an example showing CPDAG (second graph) according to an embodiment of the present invention.

상술한 바와 같이, 생성순서 결정부(130)는 조건부 독립정보, 분리 세트 및 UAG를 기반으로 데이터의 방향성이 있는 그래프, 즉 CPDAG를 도출할 수 있다. 여기서, CPDAG 내의 각 꼭지점은 상기 복수 개의 변수들 중 하나를 나타낸다.As described above, the generation order determination unit 130 can derive a directed graph of data, that is, CPDAG, based on conditional independent information, separate sets, and UAG. Here, each vertex in CPDAG represents one of the plurality of variables.

도 12를 참조하면, 아래와 같은 방향성을 갖는 CPDAG를 도출할 수 있다.Referring to FIG. 12, CPDAG with the following direction can be derived.

변수 2 → 변수 3 → 변수 1 → 변수 4 Variable 2 → Variable 3 → Variable 1 → Variable 4

변수 3 → 변수 2 → 변수 4 Variable 3 → Variable 2 → Variable 4

변수 5 → 변수 4Variable 5 → Variable 4

상술한 바와 같이, 생성순서 결정부(130)는 각 변수들에 대해 PC 알고리즘을 적용하여 각 변수들 간의 인과추론을 수행하고, 이에 따른 CPDAG를 도출할 수 있다. As described above, the generation order determination unit 130 may apply the PC algorithm to each variable to perform causal inference between each variable and derive CPDAG accordingly.

이후, 생성순서 결정부(130)는 상기 CPDAG를 이에 대응되는 행렬로 변환하고, 변환된 상기 행렬을 이용하여 재현자료의 생성을 위한 변수들의 생성순서를 결정할 수 있다.Afterwards, the generation order determination unit 130 can convert the CPDAG into a corresponding matrix and use the converted matrix to determine the generation order of variables for generating reproduction data.

도 13은 본 발명의 일 실시예에 따른 CPDAG에 대응되는 행렬을 나타낸 예시이다.Figure 13 is an example showing a matrix corresponding to CPDAG according to an embodiment of the present invention.

도 13을 참조하면, CPDAG에 대응되는 행렬은 CPDAG 상의 각 꼭지점(행)으로부터 다른 꼭지점들(열)로의 거리로 표현되며, 이는 도 12의 CPDAG 상의 각 꼭지점과 이들을 잇는 선분의 화살표를 기반으로 생성될 수 있다. 이때, 어느 한 꼭지점에서 동일한 꼭지점으로의 거리는 0으로 정의되며, 어느 한 꼭지점으로부터 다른 꼭지점으로 도달할 수 없는 경우는 NA(결측치)로 정의될 수 있다. Referring to FIG. 13, the matrix corresponding to CPDAG is expressed as the distance from each vertex (row) on CPDAG to other vertices (columns), which is generated based on each vertex on CPDAG in FIG. 12 and the arrows of the line segments connecting them. It can be. At this time, the distance from one vertex to the same vertex is defined as 0, and if it is impossible to reach from one vertex to another vertex, it can be defined as NA (missing value).

일 예시로서, 꼭지점 1에서 1, 2에서 2, 3에서 3, 4에서 4, 5에서 5로의 거리는 상기 행렬 상에서 0으로 표현될 수 있다. 또한, 꼭지점 2에서 꼭지점 1로의 거리는 2 → 3 → 1이므로 2로 표현되며, 꼭지점 2에서 꼭지점 4로의 거리는 2 → 4이므로 1로 표현될 수 있다. 또한, 꼭지점 1에서 꼭지점 2로는 도달할 수 없으므로 NA로 표현될 수 있으며, 꼭지점 4에서 꼭지점 1로도 도달할 수 없으므로 NA로 표현될 수 있다.As an example, the distance from vertices 1 to 1, 2 to 2, 3 to 3, 4 to 4, and 5 to 5 may be expressed as 0 in the matrix. Additionally, the distance from vertex 2 to vertex 1 is 2 → 3 → 1, so it can be expressed as 2, and the distance from vertex 2 to vertex 4 is 2 → 4, so it can be expressed as 1. Additionally, since it cannot be reached from vertex 1 to vertex 2, it can be expressed as NA, and since it cannot be reached from vertex 4 to vertex 1, it can be expressed as NA.

생성순서 결정부(130)는 이러한 방법으로 도 12에 도시된 CPDAG를 도 13에 도시된 행렬로 변환할 수 있다. 즉, 생성순서 결정부(130)는 CPDAG 상에서 특정 꼭지점으로부터 나머지 꼭지점 각각에 도달하기 위한 선분의 개수를 각각 계산하고, 상기 나머지 꼭지점 각각으로부터 상기 특정 꼭지점에 도달하기 위한 선분의 개수를 각각 계산함으로써 상기 행렬을 생성할 수 있다. The generation order determination unit 130 can convert the CPDAG shown in FIG. 12 into the matrix shown in FIG. 13 using this method. That is, the generation order determination unit 130 calculates the number of line segments to reach each of the remaining vertices from a specific vertex on the CPDAG, and calculates the number of line segments to reach the specific vertex from each of the remaining vertices. You can create a matrix.

이때, 생성순서 결정부(130)는 CPDAG 상에서 상기 특정 꼭지점으로부터 상기 나머지 꼭지점으로 도달할 수 없는 경우의 수를 제1 NA 개수로 카운팅하고, CPDAG 상에서 상기 나머지 꼭지점으로부터 상기 특정 꼭지점으로 도달할 수 없는 경우의 수를 제2 NA 개수로 카운팅할 수 있다. At this time, the generation order determination unit 130 counts the number of cases in which the remaining vertices cannot be reached from the specific vertex on the CPDAG as the number of first NAs, and determines the number of cases in which the remaining vertices cannot be reached from the remaining vertices on the CPDAG. The number of cases can be counted as the second number of NAs.

도 13의 예시에서, 꼭지점 1로부터 나머지 꼭지점으로 도달할 수 없는 경우의 수는 제1 NA 개수 = 3 이며, 꼭지점 2로부터 나머지 꼭지점으로 도달할 수 없는 경우의 수는 제1 NA 개수 = 1 이며, 꼭지점 3으로부터 나머지 꼭지점으로 도달할 수 없는 경우의 수는 제1 NA 개수 = 1 이며, 꼭지점 4로부터 나머지 꼭지점으로 도달할 수 없는 경우의 수는 제1 NA 개수 = 4 이며, 꼭지점 5로부터 나머지 꼭지점으로 도달할 수 없는 경우의 수는 제1 NA 개수 = 3 이 될 수 있다.In the example of Figure 13, the number of cases in which it is impossible to reach from vertex 1 to the remaining vertices is the number of first NAs = 3, and the number of cases in which it is impossible to reach the remaining vertices from vertex 2 is the number of first NAs = 1, The number of cases in which it is not possible to reach from vertex 3 to the remaining vertices is the number of first NAs = 1, and the number of cases in which it is not possible to reach the remaining vertices from vertex 4 is the number of first NAs = 4, and the number of cases in which it is not possible to reach the remaining vertices from vertex 5 is 1. The number of unreachable cases may be the number of first NAs = 3.

또한, 도 13의 예시에서, 나머지 꼭지점으로부터 꼭지점 1로 도달할 수 없는 경우의 수는 제2 NA 개수 = 2 이며, 나머지 꼭지점으로부터 꼭지점 2로 도달할 수 없는 경우의 수는 제2 NA 개수 = 3 이며, 나머지 꼭지점으로부터 꼭지점 3으로 도달할 수 없는 경우의 수는 제2 NA 개수 = 3 이며, 나머지 꼭지점으로부터 꼭지점 4로 도달할 수 없는 경우의 수는 제2 NA 개수 = 0 이며, 나머지 꼭지점으로부터 꼭지점 5로 도달할 수 없는 경우의 수는 제2 NA 개수 = 4 일 수 있다.Additionally, in the example of FIG. 13, the number of cases in which vertex 1 cannot be reached from the remaining vertices is the number of second NAs = 2, and the number of cases in which vertices 2 cannot be reached from the remaining vertices is the number of second NAs = 3. , the number of cases in which vertex 3 cannot be reached from the remaining vertices is the second NA number = 3, and the number of cases in which vertex 4 cannot be reached from the remaining vertices is the second NA number = 0, and the number of cases in which vertex 4 cannot be reached from the remaining vertices is 0. The number of cases that cannot be reached by 5 may be the second number of NAs = 4.

즉, 특정 꼭지점에 대한 제1 NA 개수는 도 13의 파란색 화살표 방향의 NA 개수를 의미하며, 특정 꼭지점에 대한 제2 NA 개수는 도 13의 빨간색 화살표 방향의 NA 개수를 의미한다.That is, the first number of NAs for a specific vertex means the number of NAs in the direction of the blue arrow in FIG. 13, and the second number of NAs for a specific vertex means the number of NAs in the direction of the red arrow in FIG. 13.

이와 같이, 생성순서 결정부(130)는 행렬 상의 각 꼭지점에 대한 제1 NA 개수 및 제2 NA 개수를 파악하고, 제1 NA 개수 및 제2 NA 개수에 기초하여 재현자료의 생성을 위한 변수들의 생성순서를 결정할 수 있다.In this way, the generation order determination unit 130 determines the number of first NAs and the number of second NAs for each vertex on the matrix, and sets variables for generating reproduction data based on the first number of NAs and the number of second NAs. The creation order can be determined.

위 예시에서, 행렬 상의 각 꼭지점에 대한 제1 NA 개수 및 제2 NA 개수는 아래 표 1과 같다.In the above example, the number of first NAs and second NAs for each vertex on the matrix are shown in Table 1 below.

제1 NA 개수Number of first NAs 제2 NA 개수Number of 2nd NA 꼭지점 1vertex 1 33 22 꼭지점 2vertex 2 1One 33 꼭지점 3vertex 3 1One 33 꼭지점 4vertex 4 44 00 꼭지점 5vertex 5 33 44

생성순서 결정부(130)는 상기 특정 꼭지점에 대응되는 변수에 대해 제1 NA 개수가 적을수록, 제2 NA 개수가 많을수록 재현자료의 생성과정에서 먼저 생성되도록 생성순서를 결정할 수 있다. 여기서, 제1 NA 개수가 적다는 것은 특정 꼭지점에 대응되는 변수의 정보를 더 많은 다른 변수들이 사용한다는 것을 의미하며, 제1 NA 개수가 적은 변수일수록 변수 생성순서에서 앞 순서에 위치하는 것이 바람직하다. 또한, 제2 NA 개수가 많다는 것은 특정 꼭지점에 대응되는 변수가 더 적은 변수들로부터 영향을 받는다는 것을 의미하며, 제2 NA 개수가 많은 변수일수록 변수 생성순서에서 앞 순서에 위치하는 것이 바람직하다.The generation order determination unit 130 may determine the generation order so that the variables corresponding to the specific vertex are generated first in the process of generating reproduction data as the number of first NAs decreases and the number of second NAs increases. Here, a small number of first NAs means that more other variables use the information of the variable corresponding to a specific vertex, and the smaller the number of first NAs, the more desirable it is to be located earlier in the variable creation order. . In addition, a large number of second NAs means that the variable corresponding to a specific vertex is influenced by fewer variables, and it is desirable for variables with a large number of second NAs to be located earlier in the variable creation order.

또한, 생성순서 결정부(130)는 제1 NA 개수 및 제2 NA 개수 중 제2 NA 개수를 우선적으로 고려하여 생성순서를 결정할 수 있다.Additionally, the generation order determination unit 130 may determine the generation order by preferentially considering the number of second NAs among the number of first NAs and the number of second NAs.

위 표 1에서, 제2 NA 개수가 많은 순으로 꼭지점들을 정렬하면, In Table 1 above, if the vertices are sorted in order of the number of second NAs,

꼭지점 5 → 꼭지점 2, 꼭지점 3 → 꼭지점 1 → 꼭지점 4Vertex 5 → Vertex 2, Vertex 3 → Vertex 1 → Vertex 4

가 된다.It becomes.

또한, 제1 NA 개수가 적은 순으로 꼭지점들을 정렬하면,Additionally, if the vertices are sorted in descending order of the number of first NAs,

꼭지점 2, 꼭지점 3 → 꼭지점 1, 꼭지점 5 → 꼭지점 4Vertex 2, Vertex 3 → Vertex 1, Vertex 5 → Vertex 4

가 된다.It becomes.

이에 따라, 생성순서 결정부(130)는 제2 NA 개수를 우선적으로 고려하여 생성순서를 결정하되, 꼭지점 2, 꼭지점 3에 대해서는 제2 NA 개수가 동일하므로 제1 NA 개수를 차순위로 고려할 수 있다. 즉, 생성순서 결정부(130)는 제2 NA 개수를 우선적으로 고려하여 생성순서를 결정하되, 제2 NA 개수가 동일한 변수에 대해서는 제1 NA 개수를 고려하여 변수들의 생성순서를 결정할 수 있다. 다만, 위 예시에서 꼭지점 2, 꼭지점 3에 대한 제1 NA 개수 또한 동일하므로, 생성순서 결정부(130)는 꼭지점 2와 꼭지점 3 중 어느 하나를 무작위로 선별할 수 있다.Accordingly, the generation order decision unit 130 determines the generation order by preferentially considering the number of second NAs, but since the number of second NAs is the same for vertex 2 and vertex 3, the number of first NAs can be considered as the next priority. . That is, the generation order determination unit 130 may determine the generation order by preferentially considering the number of second NAs, but for variables with the same number of second NAs, the generation order of variables may be determined by considering the number of first NAs. However, in the above example, since the number of first NAs for vertex 2 and vertex 3 is also the same, the generation order decision unit 130 may randomly select one of vertex 2 and vertex 3.

이에 따라, 생성순서 결정부(130)는 제1 NA 개수 및 제2 NA 개수를 고려하여 아래와 같은 생성순서대로 변수를 생성할 수 있다.Accordingly, the generation order determination unit 130 may generate variables in the following generation order by considering the number of first NAs and the number of second NAs.

<변수 생성순서><Variable creation order>

변수 5 → 변수 2 → 변수 3 → 변수 1 → 변수 4Variable 5 → Variable 2 → Variable 3 → Variable 1 → Variable 4

위 예시에서, 변수 5가 나머지 변수들과 비교하여 상대적으로 다른 변수들에 대한 영향도가 가장 큰 것으로 파악할 수 있다. 즉, 생성순서 결정부(130)는 제1 NA 개수 및 제2 NA 개수를 고려하여 원본자료에 포함된 복수 개의 변수들 중 다른 변수들에 대한 영향도가 큰 변수부터 먼저 생성되도록 변수들의 생성순서를 자동으로 결정할 수 있다. In the above example, it can be seen that variable 5 has the greatest influence on other variables compared to the remaining variables. That is, the generation order determination unit 130 determines the generation order of variables so that variables with greater influence on other variables are generated first among a plurality of variables included in the original data, considering the number of first NAs and the number of second NAs. can be determined automatically.

상술한 바와 같이, 재현자료 생성부(140)는 필터링된 원본자료(A')로부터 재현자료를 생성하되, 생성순서 결정부(130)에서 결정된 생성순서에 따라 필터링된 원본자료 내 변수들에 대한 데이터를 재현자료 생성모델에 순차적으로 입력하여 재현자료(B')를 생성할 수 있다. 또한, 후처리부(150)는 제1 제약조건에 기초하여 전처리부(120)에서 제거된 변수를 복원하고, 제2 제약조건에 기초하여 재현자료(B') 내 이상치를 제거함으로써 최종 재현자료(B)를 생성할 수 있다. As described above, the reproduction data generation unit 140 generates reproduction data from the filtered original data (A'), and generates reproduction data for the variables in the filtered original data according to the generation order determined in the generation order determination unit 130. Reproduction data (B') can be generated by sequentially inputting data into the reproduction data generation model. In addition, the post-processing unit 150 restores the variables removed in the pre-processing unit 120 based on the first constraints and removes outliers in the reproduction data (B') based on the second constraints, thereby removing the final reproduction data ( B) can be created.

본 발명의 실시예들에 따르면, 변수들의 생성순서를 임의로 결정한 상태에서 재현자료를 생성하는 경우와 비교하여 변수들 간의 결합 확률분포를 정확히 추론할 수 있으며 이에 따라 서로 연관되어 있지 않은 변수들이 서로의 변수 생성과정에 참여하지 않도록 함으로써 재현자료의 생성 속도를 향상시킬 수 있다.According to embodiments of the present invention, compared to the case of generating reproduction data with the generation order of variables arbitrarily determined, the joint probability distribution between variables can be accurately inferred, and thus unrelated variables can interact with each other. The speed of generating reproduction data can be improved by not participating in the variable creation process.

도 14는 본 발명의 일 실시예에 따른 재현자료 자동 생성 방법을 설명하기 위한 흐름도이다. 도시된 흐름도에서는 상기 방법을 복수 개의 단계로 나누어 기재하였으나, 적어도 일부의 단계들은 순서를 바꾸어 수행되거나, 다른 단계와 결합되어 함께 수행되거나, 생략되거나, 세부 단계들로 나뉘어 수행되거나, 또는 도시되지 않은 하나 이상의 단계가 부가되어 수행될 수 있다.Figure 14 is a flowchart illustrating a method for automatically generating reproduction data according to an embodiment of the present invention. In the illustrated flow chart, the method is divided into a plurality of steps, but at least some of the steps are performed in a different order, combined with other steps, omitted, divided into detailed steps, or not shown. One or more steps may be added and performed.

S102 단계에서, 제약조건 탐지부(110)에 원본자료(A)가 입력된다.In step S102, original data (A) is input to the constraint detection unit 110.

S104 단계에서, 제약조건 탐지부(110)는 원본자료 내 상기 복수 개의 변수들에 대한 제약조건을 탐지한다. 구체적으로, 제1 제약조건 탐지부(102)는 원본자료 내 복수 개의 변수들에 대한 선형 종속관계를 나타내는 제1 제약조건을 탐지하고, 제2 제약조건 탐지부(104)는 원본자료 내 복수 개의 변수들의 조합 상호간의 대소관계 및 원본자료 내 복수 개의 변수들에 대한 상하관계를 나타내는 제2 제약조건을 탐지할 수 있다.In step S104, the constraint detection unit 110 detects constraints on the plurality of variables in the original data. Specifically, the first constraint detection unit 102 detects a first constraint indicating a linear dependency relationship for a plurality of variables in the original data, and the second constraint detection unit 104 detects a plurality of variables in the original data. It is possible to detect the second constraint condition that represents the magnitude relationship between combinations of variables and the hierarchical relationship between multiple variables in the original data.

S106 단계에서, 전처리부(120)는 제1 제약조건에 기초하여 선형 종속관계가 탐지된 변수를 원본자료(A)에서 제거한다.In step S106, the preprocessor 120 removes variables for which a linear dependency relationship is detected based on the first constraint condition from the original data (A).

S108 단계에서, 생성순서 결정부(130)는 원본자료에 포함된 복수 개의 변수들에 대한 조건부 독립 테스트를 수행하고, 상기 조건부 독립 테스트의 결과에 기초하여 재현자료의 생성을 위한 변수들의 생성순서를 결정한다.In step S108, the generation order determination unit 130 performs a conditional independence test on a plurality of variables included in the original data, and determines the generation order of variables for generating reproduced data based on the results of the conditional independence test. decide

S110 단계에서, 재현자료 생성부(140)는 복수 개의 변수들 중 원본자료(A)에서 제거된 변수를 제외한 나머지 변수들에 대한 데이터(A')를 상기 생성순서에 따라 재현자료 생성모델에 순차적으로 입력하여 재현자료(B')를 생성한다.In step S110, the reproduction data generation unit 140 sequentially sequentially generates data (A') for the remaining variables excluding the variables removed from the original data (A) among the plurality of variables into the reproduction data generation model according to the above generation order. Enter to generate reproduction data (B').

S112 단계에서, 후처리부(150)는 제1 제약조건에 기초하여 전처리부(120)에서 제거된 변수를 복원하고, 제2 제약조건에 기초하여 재현자료(B') 내 이상치를 제거함으로써 최종 재현자료(B)를 생성한다.In step S112, the post-processing unit 150 restores the variables removed in the pre-processing unit 120 based on the first constraints, and removes outliers in the reproduction data (B') based on the second constraints to achieve final reproduction. Generate data (B).

도 15는 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술되지 않은 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.FIG. 15 is a block diagram illustrating and illustrating a computing environment including a computing device suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities in addition to those described below, and may include additional components in addition to those not described below.

도시된 컴퓨팅 환경(10)은 컴퓨팅 장치(12)를 포함한다. 일 실시예에서, 컴퓨팅 장치(12)는 재현자료 자동 생성 시스템(100), 또는 재현자료 자동 생성 시스템(100)에 포함되는 하나 이상의 컴포넌트일 수 있다.The illustrated computing environment 10 includes a computing device 12 . In one embodiment, computing device 12 may be the automatic reproduction material generation system 100, or one or more components included in the automatic reproduction material generation system 100.

컴퓨팅 장치(12)는 적어도 하나의 프로세서(14), 컴퓨터 판독 가능 저장 매체(16) 및 통신 버스(18)를 포함한다. 프로세서(14)는 컴퓨팅 장치(12)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(14)는 컴퓨터 판독 가능 저장 매체(16)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(14)에 의해 실행되는 경우 컴퓨팅 장치(12)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.Computing device 12 includes at least one processor 14, a computer-readable storage medium 16, and a communication bus 18. Processor 14 may cause computing device 12 to operate in accordance with the example embodiments noted above. For example, processor 14 may execute one or more programs stored on computer-readable storage medium 16. The one or more programs may include one or more computer-executable instructions, which, when executed by the processor 14, cause computing device 12 to perform operations according to example embodiments. It can be.

컴퓨터 판독 가능 저장 매체(16)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능 저장 매체(16)에 저장된 프로그램(20)은 프로세서(14)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독 가능 저장 매체(16)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(12)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by the processor 14. In one embodiment, computer-readable storage medium 16 includes memory (volatile memory, such as random access memory, non-volatile memory, or an appropriate combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash It may be memory devices, another form of storage medium that can be accessed by computing device 12 and store desired information, or a suitable combination thereof.

통신 버스(18)는 프로세서(14), 컴퓨터 판독 가능 저장 매체(16)를 포함하여 컴퓨팅 장치(12)의 다른 다양한 컴포넌트들을 상호 연결한다.Communication bus 18 interconnects various other components of computing device 12, including processor 14 and computer-readable storage medium 16.

컴퓨팅 장치(12)는 또한 하나 이상의 입출력 장치(24)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(22) 및 하나 이상의 네트워크 통신 인터페이스(26)를 포함할 수 있다. 입출력 인터페이스(22) 및 네트워크 통신 인터페이스(26)는 통신 버스(18)에 연결된다. 입출력 장치(24)는 입출력 인터페이스(22)를 통해 컴퓨팅 장치(12)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치(24)는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(24)는 컴퓨팅 장치(12)를 구성하는 일 컴포넌트로서 컴퓨팅 장치(12)의 내부에 포함될 수도 있고, 컴퓨팅 장치(12)와는 구별되는 별개의 장치로 컴퓨팅 장치(12)와 연결될 수도 있다.Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide an interface for one or more input/output devices 24. The input/output interface 22 and the network communication interface 26 are connected to the communication bus 18. Input/output device 24 may be coupled to other components of computing device 12 through input/output interface 22. Exemplary input/output devices 24 include, but are not limited to, a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touch screen), a voice or sound input device, various types of sensor devices, and/or imaging devices. It may include input devices and/or output devices such as display devices, printers, speakers, and/or network cards. The exemplary input/output device 24 may be included within the computing device 12 as a component constituting the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12. It may be possible.

이상에서 대표적인 실시예를 통하여 본 발명에 대하여 상세하게 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 전술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리범위는 설명된 실시예에 국한되어 정해져서는 안 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다. Although the present invention has been described in detail above through representative embodiments, those skilled in the art will recognize that various modifications to the above-described embodiments are possible without departing from the scope of the present invention. You will understand. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined not only by the claims described later but also by equivalents to the claims.

100 : 재현자료 자동 생성 시스템
102 : 제1 제약조건 탐지부
104 : 제2 제약조건 탐지부
110 : 제약조건 탐지부
120 : 전처리부
130 : 생성순서 결정부
140 : 재현자료 생성부
150 : 후처리부100: Automatic generation system for reproduction data
102: first constraint detection unit
104: Second constraint detection unit
110: Constraint detection unit
120: preprocessing unit
130: Generation order decision unit
140: Reproduction data generation unit
150: Post-processing unit

Claims

An automatic reproduction data generation system that automatically generates reproduction data with statistical characteristics corresponding to the original data from original data containing data on a plurality of different variables, comprising:
A first constraint indicating a linear dependency relationship for the plurality of variables in the original data, and a magnitude relationship between combinations of the plurality of variables in the original data and a hierarchical relationship between the plurality of variables in the original data. a constraint detection unit that detects a second constraint representing;
a preprocessor that removes variables for which the linear dependency relationship is detected based on the first constraints from the original data;
A generation order for performing a conditional independence test on the plurality of variables included in the original data and determining the generation order of variables for generating the reproduced data based on the results of the conditional independence test. decision part;
a reproduction data generation unit that generates the reproduction data by sequentially inputting data for the remaining variables among the plurality of variables, excluding variables removed from the original data, into a reproduction data generation model set according to the generation order; and
A system for automatically generating reproduction data, including a post-processing unit that restores variables removed in the pre-processing unit based on the first constraints and removes outliers in the reproduction data based on the second constraints.

In claim 1,
The constraint detection unit performs QR decomposition on the plurality of variables in the original data, and represents the linear dependency relationship based on the value of the diagonal component of the upper triangular matrix obtained from the QR decomposition. An automatic reproduction data generation system that detects the first constraint.

In claim 2,
The upper triangular matrix is an NXN matrix,
The constraint detection unit determines a variable corresponding to the n (where n≤N)th row and column with a value of 0 among the values of the diagonal components of the upper triangular matrix as the variable for which the linear dependency relationship is detected, and The first constraint is obtained by deriving n-1 linear equations using variables for which the linear dependency relationship is detected as dependent variables from an upper triangular matrix, and then calculating coefficients that satisfy the linear dependency relationship from the n-1 linear equations. A system for automatically generating reproduction data that detects conditions.

In claim 1,
The constraint detection unit extracts a plurality of combinations of variables of a set number or less from the original data and detects the size relationship between the combinations based on the minimum and maximum values of each extracted combination, and detects the size relationship between the original data. A system for automatically generating reproduction data that learns the hierarchical relationships of variables below the set number.

In claim 1,
The constraint detection unit repeatedly performs random sampling from the original data, determines whether the first constraint and the second constraint are detected for each randomly sampled data, and then determines whether the first constraint and the second constraint are detected for each randomly sampled data. 1. An automatic reproduction data generation system that determines the first constraint and the second constraint as final constraints only when the detection rate of the first constraint and the second constraint is greater than or equal to a threshold.

In claim 1,
The generation order determination unit obtains conditional independence information and a separating set for the plurality of variables by performing the conditional independence test, and determines the plurality of variables based on the conditional independence information and the separating set. An automatic reproduction data generation system that derives CPDAG (Completed Partially Directed Acyclic Graph) for the data and uses the CPDAG to determine the generation order of variables for generating the reproduction data.

In claim 6,
Each vertex in the CPDAG represents one of the plurality of variables,
The generation order determination unit calculates the number of line segments to reach each of the remaining vertices from a specific vertex on the CPDAG, but the number of cases in which it is impossible to reach the remaining vertices from the specific vertex on the CPDAG is the number of first NAs. Counting with
The generation order determination unit calculates the number of line segments to reach the specific vertex from each of the remaining vertices on the CPDAG, and calculates the number of cases in which the specific vertex cannot be reached from the remaining vertices on the CPDAG. Counting by the number of NAs,
The generation order determination unit determines the generation order of variables for generating the reproduction data based on the first number of NAs and the second number of NAs.

In claim 7,
The generation order determination unit determines the generation order so that the variable corresponding to the specific vertex is generated first in the generation process of the reproduction data as the number of the first NA is small and the number of the second NAs is large. Automatic generation system.

In claim 1,
The post-processing unit determines data that does not satisfy the second constraint condition among the data of the reproduction data as the outlier and removes the outlier.

An automatic reproduction data generation method that automatically generates reproduction data with statistical characteristics corresponding to the original data from original data containing data on a plurality of different variables, comprising:
Detecting, in a constraint detection unit, a first constraint indicating a linear dependency relationship for the plurality of variables in the original data;
detecting, in the constraint detection unit, a second constraint indicating a major relationship between combinations of the plurality of variables in the original data and a hierarchical relationship between the plurality of variables in the original data;
In a preprocessor, removing variables for which the linear dependency relationship is detected based on the first constraint from the original data;
In the generation order determination unit, performing a conditional independence test on the plurality of variables included in the original data;
determining, in the generation order determination unit, a generation order of variables for generating the reproduction data based on a result of the conditional independence test;
In the reproduction data generation unit, generating the reproduction data by sequentially inputting data for the remaining variables among the plurality of variables, excluding the variables removed from the original data, into the reproduction data generation model set according to the generation order. ;
In a post-processing unit, restoring variables removed in the pre-processing unit based on the first constraints; and
A method for automatically generating reproduction data, comprising, in the post-processing unit, removing outliers from the reproduction data based on the second constraints.

In claim 10,
The step of detecting the first constraint includes performing QR decomposition on the plurality of variables in the original data, and based on the value of the diagonal component of the upper triangular matrix obtained from the QR decomposition. A method for automatically generating reproduction data that detects the first constraint representing a linear dependency relationship.

In claim 11,
The upper triangular matrix is an NXN matrix,
In the step of detecting the first constraint, the variable corresponding to the n (where n≤N)th row and column with a value of 0 among the values of the diagonal components of the upper triangular matrix is selected as the variable for which the linear dependency relationship is detected. Determine, derive n-1 linear equations using the variables for which the linear dependency relationship is detected as dependent variables from the upper triangular matrix, and then calculate the coefficient that satisfies the linear dependency relationship from the n-1 linear equations. A method for automatically generating reproduction data that detects the first constraint condition by doing so.

In claim 10,
The step of detecting the second constraint includes extracting a plurality of combinations of variables less than the set number from the original data and then detecting the size relationship between the combinations based on the minimum and maximum values of each of the extracted combinations. A method of automatically generating reproduction data, which learns the hierarchical relationships of variables below the set number from the original data.

In claim 10,
After detecting the first constraint and detecting the second constraint,
In the constraint detection unit, repeatedly performing random sampling from the original data; and
In the constraint detection unit, after determining whether the first constraint and the second constraint are detected for each of the randomly sampled data, the rate at which the first constraint and the second constraint are detected is a threshold value. A method for automatically generating reproduction data, further comprising the step of determining the first constraint and the second constraint as final constraints, respectively, only if the above conditions are met.

In claim 10,
The step of determining the generation order of variables for generating the reproduction data includes obtaining conditional independence information and a separating set for the plurality of variables by performing the conditional independence test, and obtaining conditional independence information and a separating set for the plurality of variables. And automatic generation of reproduction data, which derives a Completed Partially Directed Acyclic Graph (CPDAG) for the plurality of variables based on the separation set, and determines the generation order of variables for generating the reproduction data using the CPDAG. method.

In claim 15,
Each vertex in the CPDAG represents one of the plurality of variables,
The step of determining the generation order of variables for generating the reproduction data is,
Calculate the number of line segments to reach each of the remaining vertices from a specific vertex on the CPDAG, and count the number of cases in which it is impossible to reach the remaining vertices from the specific vertex on the CPDAG as the number of first NAs,
Calculate the number of line segments to reach the specific vertex from each of the remaining vertices on the CPDAG, and count the number of cases in which the specific vertex cannot be reached from the remaining vertices on the CPDAG as the second number of NAs,
A method for automatically generating reproduction data, which determines the generation order of variables for generating the reproduction data based on the first number of NAs and the second number of NAs.

In claim 16,
The step of determining the order of generating variables for generating the reproduction data is that, for variables corresponding to the specific vertex, the smaller the number of the first NA and the greater the number of the second NAs, the first in the generation process of the reproduction data. A method for automatically generating reproduction data, which determines the generation order so that it is generated.

In claim 10,
The step of removing outliers in the reproduction data includes determining data that does not satisfy the second constraint condition among the data in the reproduction data to be the outlier and removing the outliers.