KR102456177B1

KR102456177B1 - Method and Device of generating synthetic data with differential privacy

Info

Publication number: KR102456177B1
Application number: KR1020210003251A
Authority: KR
Inventors: 박유랑; 성민동
Original assignee: 연세대학교 산학협력단
Priority date: 2021-01-11
Filing date: 2021-01-11
Publication date: 2022-10-19
Also published as: KR20220144350A; KR20220102168A; KR102578911B1; WO2022149943A1

Abstract

차등 프라이버시를 이용한 합성 데이터 생성 방법 및 장치가 제공된다. 상기 방법은, 원본 데이터에 포함된 복수의 데이터를 종류에 따라 연속형 데이터 및 범주형 데이터로 구분하는 데이터 구분 단계; 상기 연속형 데이터에 대해 경계 라플라시안 기법 (bounded Laplacian method)을 이용하여 제1 합성 데이터를 생성하는 제1 합성 데이터 생성 단계; 상기 범주형 데이터에 대해 상기 경계 라플라시안 기법 및 후-처리 이산화 (post-processing discretization) 기법을 이용하여 제2 합성 데이터를 생성하는 제2 합성 데이터 생성 단계; 및 상기 제1 합성 데이터 및 상기 제2 합성 데이터를 이용하여 전체 합성 데이터를 생성하는 전체 합성 데이터 생성 단계를 포함한다.A method and apparatus for generating synthetic data using differential privacy are provided. The method may include a data classification step of classifying a plurality of data included in the original data into continuous data and categorical data according to types; a first synthesized data generating step of generating first synthesized data using a bounded Laplacian method with respect to the continuous data; a second synthesized data generating step of generating second synthesized data for the categorical data using the boundary Laplacian technique and a post-processing discretization technique; and generating the entire composite data by using the first composite data and the second composite data.

Description

Method and Device of generating synthetic data with differential privacy}

본 발명은 차등 프라이버시를 이용한 합성 데이터를 생성하는 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for generating synthetic data using differential privacy.

빅데이터 및 인공지능 (artificial intelligence; AI) 학습 과정에서 개인 정보 보호의 중요성이 부각되며, 개인 정보 보호 기법으로써 차등 프라이버시 (differential privacy) 기법에 많은 관심이 쏠리고 있다.The importance of personal information protection is highlighted in the process of big data and artificial intelligence (AI) learning, and a lot of attention is paid to the differential privacy technique as a personal information protection technique.

차등 프라이버시 기법은 특정 개인의 개인 정보가 노출되는 위험을 최소화하며 해당 데이터를 유의미하게 활용 가능하도록 데이터를 변형/구성하는 것을 목적으로 한다. 이를 위해, 차등 프라이버시 기법은 특정 데이터 세트에 임의의 노이즈를 삽입함으로써 개인 정보가 제3자에게 노출되지 않도록 보호하는 것을 특징으로 한다.The differential privacy technique aims to minimize the risk of exposure of a specific individual's personal information and to transform/structure data so that the data can be used meaningfully. To this end, the differential privacy technique is characterized by protecting personal information from being exposed to third parties by inserting random noise into a specific data set.

다만, 삽입되는 노이즈는 학습용 데이터로부터 구체적인 개인 정보가 노출되는 것을 방지하는 역할을 하기도 하지만, 노이즈의 정도에 따라 해당 데이터를 활용하는 인공지능 모델의 성능을 저하시키는 문제점이 있다.However, the inserted noise also serves to prevent specific personal information from being exposed from the training data, but there is a problem in that the performance of the artificial intelligence model that utilizes the corresponding data is degraded depending on the degree of noise.

등록특허공보 제10-1935528호, 2019.01.04Registered Patent Publication No. 10-1935528, 2019.01.04

본 발명이 해결하고자 하는 과제는 차등 프라이버시를 이용해 데이터를 충분히 익명화하면서 동시에 데이터의 유용성을 유지시킬 수 있는 합성 데이터 생성 방법 및 장치를 제공하는 것이다.An object of the present invention is to provide a method and apparatus for generating synthetic data that can sufficiently anonymize data using differential privacy while maintaining usefulness of data.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 차등 프라이버시를 이용한 합성 데이터를 생성하는 방법은, 원본 데이터에 포함된 복수의 데이터를 종류에 따라 연속형 데이터 및 범주형 데이터로 구분하는 데이터 구분 단계; 상기 연속형 데이터에 대해 경계 라플라시안 기법 (bounded Laplacian method)을 이용하여 제1 합성 데이터를 생성하는 제1 합성 데이터 생성 단계; 상기 범주형 데이터에 대해 상기 경계 라플라시안 기법 및 후-처리 이산화 (post-processing discretization) 기법을 이용하여 제2 합성 데이터를 생성하는 제2 합성 데이터 생성 단계; 및 상기 제1 합성 데이터 및 상기 제2 합성 데이터를 이용하여 전체 합성 데이터를 생성하는 전체 합성 데이터 생성 단계를 포함할 수 있다.In a method for generating synthetic data using differential privacy according to an aspect of the present invention for solving the above-mentioned problems, a plurality of data included in original data is divided into continuous data and categorical data according to the type. step; a first synthesized data generating step of generating first synthesized data using a bounded Laplacian method with respect to the continuous data; a second synthesized data generating step of generating second synthesized data for the categorical data using the boundary Laplacian technique and a post-processing discretization technique; and generating the entire composite data by using the first composite data and the second composite data.

본 발명에 있어, 상기 제1 합성 데이터 및 상기 제2 합성 데이터는, 상기 원본 데이터 대비 상기 전체 합성 데이터의 정확도 (accuracy) 값이 일정 값 이상인 엡실론 값을 이용하여 생성될 수 있다.In the present invention, the first composite data and the second composite data may be generated using an epsilon value in which an accuracy value of the entire composite data compared to the original data is equal to or greater than a predetermined value.

이때, 상기 일정 값은 75 % 이상인 값을 가질 수 있다. 또한, 상기 엡실론 값은 상기 일정 값에 기초하여 10³ 내지 10⁴ 범위에 포함되는 하나의 값을 가질 수 있다.In this case, the predetermined value may have a value of 75% or more. Also, the epsilon value may have one value included in the range of 10 ³ to 10 ⁴ based on the predetermined value.

본 발명에 있어, 상기 경계 라플라시안 기법은, -1 내지 1 범위로 모든 변수들을 정규화(normalization)하는 기법을 포함할 수 있다.In the present invention, the boundary Laplacian technique may include a technique of normalizing all variables in the range of -1 to 1.

본 발명에 있어, 상기 후-처리 이산화 기법은, 상기 경계 라플라시안 기법에 의해 변환된 (perturbed) 데이터를 확률적으로 이산화하는 기법을 포함할 수 있다.In the present invention, the post-processing discretization technique may include a technique of probabilistically discretizing data perturbed by the boundary Laplacian technique.

이때, 상기 확률적으로 이산화하는 기법은, 베르누이 분포 함수에 따른 베르누이 확률에 기초하여 이산화하는 기법을 포함할 수 있다.In this case, the probabilistic discretization technique may include a discretization technique based on a Bernoulli probability according to a Bernoulli distribution function.

본 발명에 있어, 상기 차등 프라이버시를 이용한 상기 합성 데이터를 생성하는 방법은 상기 원본 데이터를 소유하는 엔티티에 의해 수행될 수 있다.In the present invention, the method of generating the composite data using the differential privacy may be performed by an entity owning the original data.

상술한 과제를 해결하기 위한 본 발명의 다른 면에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치는, 원본 데이터를 저장하는 데이터 베이스; 및 상기 원본 데이터로부터 차등 프라이버시를 이용한 합성 데이터를 생성하도록 설정된 프로세서를 포함할 수 있다. 이때, 상기 프로세서는, 상기 원본 데이터에 포함된 복수의 데이터를 종류에 따라 연속형 데이터 및 범주형 데이터로 구분하고, 상기 연속형 데이터에 대해 경계 라플라시안 기법 (bounded Laplacian method)을 이용하여 제1 합성 데이터를 생성하고, 상기 범주형 데이터에 대해 상기 경계 라플라시안 기법 및 후-처리 이산화 (post-processing discretization) 기법을 이용하여 제2 합성 데이터를 생성하고, 상기 제1 합성 데이터 및 상기 제2 합성 데이터를 이용하여 전체 합성 데이터를 생성하도록 설정될 수 있다.Synthetic data generating apparatus using differential privacy according to another aspect of the present invention for solving the above-described problems, a database for storing original data; and a processor configured to generate synthetic data using differential privacy from the original data. In this case, the processor divides the plurality of data included in the original data into continuous data and categorical data according to types, and first synthesizes the continuous data using a bounded Laplacian method. generating data, generating second synthesized data using the boundary Laplacian technique and a post-processing discretization technique for the categorical data, and combining the first synthesized data and the second synthesized data It can be set to generate the entire composite data using

본 발명에 있어, 상기 프로세서는, 상기 원본 데이터 대비 상기 전체 합성 데이터의 정확도 (accuracy) 값이 일정 값 이상인 엡실론 값을 이용하여 상기 제1 합성 데이터 및 상기 제2 합성 데이터를 생성하도록 설정될 수 있다.In the present invention, the processor may be configured to generate the first composite data and the second composite data by using an epsilon value in which an accuracy value of the entire composite data compared to the original data is a predetermined value or more. .

상술한 과제를 해결하기 위한 본 발명의 또 다른 면에 따른 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램은, 컴퓨터와 결합하여, 원본 데이터에 포함된 복수의 데이터를 종류에 따라 연속형 데이터 및 범주형 데이터로 구분하는 데이터 구분 단계; 상기 연속형 데이터에 대해 경계 라플라시안 기법 (bounded Laplacian method)을 이용하여 제1 합성 데이터를 생성하는 제1 합성 데이터 생성 단계; 상기 범주형 데이터에 대해 상기 경계 라플라시안 기법 및 후-처리 이산화 (post-processing discretization) 기법을 이용하여 제2 합성 데이터를 생성하는 제2 합성 데이터 생성 단계; 및 상기 제1 합성 데이터 및 상기 제2 합성 데이터를 이용하여 전체 합성 데이터를 생성하는 전체 합성 데이터 생성 단계를 실행시키도록 설정될 수 있다.A computer program stored in a computer-readable recording medium according to another aspect of the present invention for solving the above-described problems is combined with a computer to convert a plurality of data included in the original data into continuous data and categorical data according to the type data classification step of dividing by ; a first synthesized data generating step of generating first synthesized data using a bounded Laplacian method with respect to the continuous data; a second synthesized data generating step of generating second synthesized data for the categorical data using the boundary Laplacian technique and a post-processing discretization technique; and generating an entire composite data for generating the entire composite data by using the first composite data and the second composite data.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명에 따르면, 익명화를 위한 노이즈는 범위를 벗어난 (out-of-bounds) 데이터가 생성되지 않도록 하며, 생성된 합성 데이터는 익명성 및 데이터의 유용성 모두를 만족시킬 수 있는 효과가 있다.According to the present invention, noise for anonymization prevents out-of-bounds data from being generated, and the generated synthetic data has an effect of satisfying both anonymity and data usefulness.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 예에 따른 차등 프라이버시를 이용한 합성 데이터 생성 방법을 나타낸 흐름도이다.
도 2는 본 발명에 따라 로컬 차등 프라이버시를 이용한 합성 데이터 생성 방법을 간단히 나타낸 도면이다.
도 3a 내지 도 3c는 본 발명에 따른 경계 라플라시안 기법의 성능을 나타내는 그래프이다.
도 4a 및 도 4b는 본 발명에 따른 엡실론 값 및 데이터 교란 정도의 관계를 간단히 나타낸 그래프이다.
도 5는 본 발명에 따른 엡실론 값에 대한 다른 기계 학습 모델 간의 분류 정확도를 나타낸 그래프이다.
도 6은 본 발명의 다른 예에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치를 나타낸 도면이다.1 is a flowchart illustrating a method for generating synthetic data using differential privacy according to an example of the present invention.
2 is a diagram briefly illustrating a method for generating synthetic data using local differential privacy according to the present invention.
3A to 3C are graphs showing the performance of the boundary Laplacian technique according to the present invention.
4A and 4B are graphs simply showing the relationship between the epsilon value and the degree of data disturbance according to the present invention.
5 is a graph showing classification accuracy between different machine learning models for epsilon values according to the present invention.
6 is a diagram illustrating an apparatus for generating synthesized data using differential privacy according to another example of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully inform those skilled in the art of the scope of the present invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural, unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless clearly specifically defined.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에 대해 상세히 설명하기에 앞서, 본 발명이 적용 가능한 기술의 특징에 대해 상세히 설명한다.Before describing the present invention in detail, features of the technology to which the present invention is applicable will be described in detail.

빅 데이터는 여러 분야 (예: 의료 분야, 영상 처리 분야, 자연어 해석 분야 등)의 혁신을 위한 핵심 요소 기술로 평가되고 있다. 데이터 자체는 대부분 무용지물이지만 기계 학습 (machine learning; ML)과 같은 알고리즘을 적용하면 이러한 데이터 대부분은 유의미하게 활용될 수 있다. ML 알고리즘의 특성상, 규칙 기반 시스템과 달리 데이터 중심적이며 (data-driven) 대량의 데이터를 필요로 한다. 또한 기존의 ML 접근 방식에 따른 학습 시스템은 중앙 집중식 데이터를 필요로 한다. 이러한 많은 양의 데이터를 획득하여 강력한 ML 모델을 구축하기 위해, 서로 다른 조직 간 데이터 교환은 반드시 필요하게 된다.Big data is being evaluated as a key element technology for innovation in several fields (eg, medical field, image processing field, natural language interpretation field, etc.). Although the data itself is mostly useless, most of this data can be used meaningfully by applying an algorithm such as machine learning (ML). Due to the nature of ML algorithms, unlike rule-based systems, they are data-driven and require large amounts of data. In addition, learning systems according to existing ML approaches require centralized data. In order to acquire such a large amount of data and build a powerful ML model, data exchange between different organizations is essential.

그러나, 서로 다른 당사자들 간 데이터 교환은 프라이버시 문제를 야기하고, 이에 따라 대기업들이 프라이버시 문제를 위반하는 것에 대한 우려 또한 증가하고 있다. 특히, 대부분 민감한 정보를 포함하는 의료 데이터는 제3자와 공유할 때 적절히 보호될 필요가 있다. 이에 따라, 유럽연합의 GDPR (General Data Protection Regulation) 및 미국의 HIPAA (Unitied State's Health Insurance Portabiliy and Accountability Act of 1996)은 이러한 문제점을 인식하고 사용자의 프라이버시를 강화할 것을 요구한다.However, the exchange of data between different parties raises privacy issues, and concerns about breaches of privacy issues by large enterprises are also increasing. In particular, medical data, which contains mostly sensitive information, needs to be adequately protected when shared with third parties. Accordingly, the European Union's General Data Protection Regulation (GDPR) and the United States' HIPAA (United State's Health Insurance Portability and Accountability Act of 1996) recognize these problems and require users to strengthen their privacy.

의료 데이터는 민감한 속성뿐만 아니라 다양한 속성도 가질 수 있다. 예를 들어, 혈청 포도당 레벨 (serum glucose level)은 연속적인 값인 반면 의료 기록 (medical history)은 일반적으로 범주형 값이다. 또한, 상기 의료 기록은 다중 모드 값 (multi-modal values)을 포함할 수 있다: 일부 테스트 결과는 혈액 테스트로부터 획득될 수 있고, 다른 테스트 결과는 방사선 검사 및 신체 검사 테스트로부터 획득될 수 있다.Medical data can have a variety of properties as well as sensitive properties. For example, serum glucose level is a continuous value whereas medical history is usually a categorical value. In addition, the medical record may contain multi-modal values: some test results may be obtained from blood tests, and other test results may be obtained from radiographic and physical examination tests.

따라서, 프라이버시 정보 (예: 의료 데이터 등)는 무분별하게 제3자에게 공유되어서는 안된다. 이를 위해, 제3자 사용자에게 개인 정보가 제공되기 전에 개인 정보를 보호하기 위한 기술로써 익명화 (anonmization) 기술이 필요하다. 이때, 익명화 이후 프라이버시 위험을 식별하기 위한 방법으로써 다음의 세가지 주요 척도에 기반한 평가가 수행될 수 있다: k-anonymity, l-diveristy, t-closeness. ARX와 같은 탈-식별 도구 (De-identification tool)는 특징 일반화 (feature generalization), 기록들의 압축 (suppresion of records)를 통해 완벽한 개인 정보 보호를 제공할 수 있다. 차등 프라이버시는 데이터 프라이버시에 대한 또 다른 접근 방식으로, 의미론적 모델 (semantic model)이다. 통사적 익명성 (syntactice anonymity)에 비해, 차등 프라이버시 방법은 보다 적은 도메인 지식을 요구하고, 본질적으로, 도메인 지식과 결합된 연계 공격 (linkage attack)에 강력한 특징을 갖는다.Therefore, privacy information (eg medical data, etc.) should not be indiscriminately shared with third parties. To this end, an anonymization technology is required as a technology to protect personal information before it is provided to a third party user. In this case, as a method for identifying privacy risks after anonymization, an evaluation based on the following three main metrics can be performed: k-anonymity, l-diverity, and t-closeness. De-identification tools such as ARX can provide complete privacy protection through feature generalization and compression of records. Differential privacy is another approach to data privacy, a semantic model. Compared to syntactice anonymity, the differential privacy method requires less domain knowledge and, in essence, has strong characteristics against linkage attacks combined with domain knowledge.

이러한 특징들에 기초하여, 본 발명에서는 합성 데이터 (synthetic data)의 구현 가능성 (feasibility), 데이터 프라이버시 및 활용도 (utility) 간 균형 (balance) 등을 고려하여 차등 프라이버시 기법을 위한 합성 데이터를 생성하는 방법에 대해 상세히 설명한다.Based on these characteristics, in the present invention, a method of generating synthetic data for a differential privacy technique in consideration of the feasibility of synthetic data, the balance between data privacy and utility, etc. will be described in detail.

이하의 도 1 내지 도 6에서 설명되는 본 발명에 따른 차등 프라이버시를 이용한 합성 데이터 생성 방법은 서버와 같은 컴퓨터 장치를 통해 모든 동작이 수행될 수 있다.In the method for generating synthetic data using differential privacy according to the present invention described with reference to FIGS. 1 to 6 below, all operations may be performed through a computer device such as a server.

도 1은 본 발명의 일 예에 따른 차등 프라이버시를 이용한 합성 데이터 생성 방법을 나타낸 흐름도이다.1 is a flowchart illustrating a method for generating synthetic data using differential privacy according to an example of the present invention.

도 1에 도시된 바와 같이, 본 발명에 따라 차등 프라이버시를 이용한 합성 데이터를 생성하는 방법은, 원본 데이터에 포함된 복수의 데이터를 종류에 따라 연속형 데이터 및 범주형 데이터로 구분하는 데이터 구분 단계 (S110), 상기 연속형 데이터에 대해 경계 라플라시안 기법 (bounded Laplacian method)을 이용하여 제1 합성 데이터를 생성하는 제1 합성 데이터 생성 단계 (S120), 상기 범주형 데이터에 대해 상기 경계 라플라시안 기법 및 후-처리 이산화 (post-processing discretization) 기법을 이용하여 제2 합성 데이터를 생성하는 제2 합성 데이터 생성 단계 (S130), 및 상기 제1 합성 데이터 및 상기 제2 합성 데이터를 이용하여 전체 합성 데이터를 생성하는 전체 합성 데이터 생성 단계 (S140)를 포함할 수 있다.As shown in FIG. 1 , the method for generating synthetic data using differential privacy according to the present invention includes a data classification step ( S110), a first synthetic data generation step of generating first synthetic data using a bounded Laplacian method for the continuous data (S120), the boundary Laplacian method for the categorical data, and post- A second synthesized data generation step (S130) of generating second synthesized data using a post-processing discretization technique, and generating entire synthesized data using the first synthesized data and the second synthesized data It may include the entire synthetic data generation step (S140).

본 발명에 있어, 전체 합성 데이터는 제3자에게 프라이버시 정보가 노출되지 않도록 교란된 데이터 (perturbed data)를 포함할 수 있다. 따라서, 이하에서는 설명의 편의상 합성 데이터 (synthetic data) 및 교란 데이터 (perturbed data)가 동일한 의미를 갖는다고 가정한다. In the present invention, the entire composite data may include perturbed data so that privacy information is not exposed to a third party. Therefore, for convenience of description, it is assumed that synthetic data and perturbed data have the same meaning.

본 발명에 있어, 상술한 각 단계에 따라 차등 프라이버시를 이용한 합성 데이터를 생성하는 방법은 차등 프라이버시를 이용한 합성 데이터 생성 장치에 의해 수행될 수 있다. 상기 합성 데이터 생성 장치는, 실시예에 따라 각 원본 데이터를 소유하는 엔티티 또는 각 원본 데이터 소유 엔티티로부터 데이터를 수합하는 큐레이터 (curator) 등을 포함할 수 있다. In the present invention, the method of generating synthetic data using differential privacy according to each of the above-described steps may be performed by an apparatus for generating synthetic data using differential privacy. The apparatus for generating synthetic data may include an entity owning each original data or a curator collecting data from each original data-owning entity according to an embodiment.

도 2는 본 발명에 따라 로컬 차등 프라이버시를 이용한 합성 데이터 생성 방법을 간단히 나타낸 도면이다.2 is a diagram briefly illustrating a method for generating synthetic data using local differential privacy according to the present invention.

일반적으로, 본 발명에 적용 가능한 차등 프라이버시 기법은 노이즈를 추가하는 방식에 따라 로컬 차등 프라이버시 (local differential privacy) 및 글로벌 차등 프라이버시 (global differential privacy)로 구분할 수 있다. 이때, 로컬 차등 프라이버시 방식은 각각의 개별 데이터에 노이즈가 추가되어 해당 데이터들을 애그리케이터 (aggregator)가 수합하는 방식을 의미하고, 글로벌 차등 프라이버시 방식은 별도의 큐레이터 (curator)가 개별 데이터를 수합하여 노이즈를 추가하는 방식을 의미할 수 있다. 상기 글로벌 차등 프라이버시 방식을 위해서, 데이터 베이스 소유자는 상기 큐레이터를 신뢰할 것을 필요로 할 수 있다.In general, the differential privacy technique applicable to the present invention can be divided into local differential privacy and global differential privacy according to a method of adding noise. At this time, the local differential privacy method refers to a method in which an aggregator collects the data by adding noise to each individual data, and in the global differential privacy method, a separate curator collects individual data It may mean a method of adding noise. For the global differential privacy scheme, the database owner may need to trust the curator.

본 발명에 따른 합성 데이터 생성 방법은 상술한 모든 차등 프라이버시를 이용한 합성 데이터를 생성하는 방법에 적용될 수 있다. 특히, 본 발명에서는 최악의 시나리오 (예: 큐레이터 및 제3자 모두를 신뢰할 수 없는 상황 등)를 고려한 차등 프라이버시를 이용한 합성 데이터를 생성하는 방법에 적용될 수 있다.The method for generating synthetic data according to the present invention can be applied to the method for generating synthetic data using all of the above-described differential privacy. In particular, the present invention can be applied to a method of generating synthetic data using differential privacy in consideration of a worst case scenario (eg, a situation in which both a curator and a third party cannot be trusted).

본 발명에 적용 가능한 일 예로, 데이터 세트는 질병, 의료 기록, 보험 현황과 같은 민감한 정보를 포함할 수 있기 때문에 의료 영역의 누출에 중요한 영향을 미칠 수 있다. 따라서, 본 발명에서는 네트워크 외부의 어느 누구도 신뢰하지 않음으로써 데이터 유출 위험을 최소화하는 방법을 중점적으로 한 합성 데이터 생성 방법을 개시한다.As an example applicable to the present invention, since the data set may include sensitive information such as diseases, medical records, and insurance status, it may have a significant effect on leakage of the medical field. Accordingly, the present invention discloses a method of generating synthetic data focusing on a method of minimizing the risk of data leakage by trusting no one outside the network.

보다 구체적으로, 도 2에 도시된 바와 같이, 본 발명의 바람직한 일 예에 따른 원본 데이터 소유 엔티티는 로컬 프라이버시 기법에 따라 원본 데이터를 교란 데이터로 변환하여 제3자에게 제공할 수 있다. 이를 위해, 본 발명에 따른 상기 원본 데이터 소유 엔티티는 원본 데이터 (x)에 대해 경계 라플라시안 기법 (M1) 및 후-처리 이산화 (M2)를 적용하여 교란 데이터 (z)를 생성할 수 있다. 이를 통해, 상기 원본 데이터 소유자는 원본 데이터의 프라이버시를 유지하면서 높은 충실도(fidelity)의 데이터를 제3자에게 제공할 수 있다.More specifically, as shown in FIG. 2 , the entity owning the original data according to a preferred embodiment of the present invention may convert the original data into disturbing data according to a local privacy technique and provide it to a third party. To this end, the original data-owning entity according to the present invention may apply the boundary Laplacian technique (M1) and post-process discretization (M2) to the original data (x) to generate the perturbed data (z). Through this, the original data owner can provide high fidelity data to a third party while maintaining the privacy of the original data.

본 발명에 있어, (로컬) 차등 프라이버시를 이용한 엡실론 (Epsilon, ε)이 설정될 수 있다. 연접하는 데이터 Y₁ 및 Y₂를 위해, 하기 수학식 1을 만족하는 경우, 함수 κ는 (ε,δ)-differentially private 하다고 정의할 수 있다. In the present invention, epsilon (ε) using (local) differential privacy may be set. For concatenated data Y ₁ and Y ₂ , when Equation 1 below is satisfied, the function κ may be defined as (ε,δ)-differentially private.

여기서,

이다. here,

to be.

다시 말해, 차등 프라이버시를 이용한 엡실론 값은 차등 프라이버시를 위해 생성된 두 연접 데이터의 결과물 분포의 비율 차이를 의미할 수 있다. 이에 따라, 엡실론 값이 작은 케이스는 엡실론 값이 큰 케이스보다 강한 차등 정보 보호를 제공함을 의미할 수 있다.In other words, the epsilon value using differential privacy may mean a difference in the ratio of the result distribution of two contiguous data generated for differential privacy. Accordingly, the case in which the epsilon value is small may mean that stronger differential information protection is provided than the case in which the epsilon value is large.

위와 같은 사항들에 기초하여, S110 단계에서 본 발명에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치는 원본 데이터에 포함된 복수의 데이터를 종류에 따라 연속형 데이터 및 범주형 데이터로 구분할 수 있다. 여기서, 상기 데이터의 종류에 따른 구분 동작은 연속형 데이터와 범주형 데이터를 각각 식별하는 동작을 포함할 수 있다. 다시 말해, 상기 데이터의 종류에 따른 구분 동작은, 연속형 데이터에 대해서는 경계 라플라시안 기법을 적용하고 범주형 데이터에 대해서는 경게 라플라시안 기법 및 후-처리 이산화 기법을 적용하기 위해 연속형 데이터와 범주형 데이터를 구별하여 식별할 수 있다.Based on the above, in step S110, the apparatus for generating synthetic data using differential privacy according to the present invention may classify a plurality of data included in the original data into continuous data and categorical data according to types. Here, the classification operation according to the type of data may include an operation of respectively identifying continuous data and categorical data. In other words, in the classification operation according to the type of data, continuous data and categorical data are applied in order to apply the boundary Laplacian technique to continuous data and to apply the boundary Laplacian technique and the post-processing discretization technique to categorical data. can be distinguished and identified.

본 발명에 있어, 연속형 데이터는 이산적 값이 아닌 값을 갖는 모든 데이터를 포함할 수 있다. 반면, 범주형 (categorical) 데이터 (또는, 서수형 (ordinal) 또는 명목형 (norminal) 데이터)는 이산적 값을 갖는 데이터를 포함할 수 있다 (예: 심장 박동수 (heart rates), 약물 치료 이력 (medico-surgical history) 등). In the present invention, continuous data may include all data having non-discrete values. On the other hand, categorical data (or ordinal or nominal data) may include data with discrete values (eg, heart rates, medication history). -surgical history), etc.).

S120 단계 및 S130 단계에서, 본 발명에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치는 연속형 데이터 및 범주형 데이터에 대해 경계 라플라시안 기법 (bounded Laplacian method)을 이용하여 모든 변수들을 -1 내지 1 범위로 정규화(normalization)할 수 있다. 본 발명에 적용 가능한 일 예로, 상기 S120 단계 및 S130 단계의 경계 라플라시안 기법은 동시에 또는 병렬적으로 수행될 수 있다.In steps S120 and S130, the apparatus for generating synthetic data using differential privacy according to the present invention normalizes all variables to a range of -1 to 1 using a bounded Laplacian method for continuous data and categorical data. (normalization) is possible. As an example applicable to the present invention, the boundary Laplacian technique of steps S120 and S130 may be performed simultaneously or in parallel.

일반적인 라플라시안 분포 (Laplacian distribution)는 경계가 무한하기에 이를 그대로 활용하기에 비논리적인 단점 (illogical drawback)을 야기할 수 있다는 단점을 갖는다. 예를 들어, 음의 값을 가질 수 없는 호흡률 (respiratory rates)에 대해 일반적인 라플라시안 기법을 적용할 경우, 상기 호흡률이 음의 값을 갖는 케이스가 발생할 수도 있다. 따라서, 본 발명에서는 경계 라플라시안 기법을 활용하여 상기 문제점을 해결하고자 한다. 이를 통해, 본 발명은 상기 문제점을 해결함과 동시에 데이터 조작 가능성을 최소화할 수 있다. The general Laplacian distribution has an infinite boundary, so it has a disadvantage that it may cause an illogical drawback to use as it is. For example, when the general Laplacian technique is applied to respiratory rates that cannot have a negative value, a case in which the respiratory rate has a negative value may occur. Therefore, in the present invention, the above problem is to be solved by using the boundary Laplacian technique. Through this, the present invention can minimize the possibility of data manipulation while solving the above problems.

본 발명에 따른 경계 라플라시안 기법에 있어, 입력 변수는 출력 영역 내에 있다고 가정한다.

가 주어졌을 때,

인 각각의 q를 위한 확률 밀도 함수 (

)는 하기 수학식 2와 같이 정의될 수 있다.In the boundary Laplacian technique according to the present invention, it is assumed that the input variable is within the output region.

When is given,

The probability density function for each q of

) may be defined as in Equation 2 below.

여기서, 본 발명에 적용 가능한 일 예로써, 각각의 변수들은

,

와 같이 정의될 수 있다. 이때, 상위 경계 값 (upper bound) 및 하위 경계 값 (lower bound)은 실시예에 따라 다양하게 변경될 수 있다. 다시 말해, 본 발명에 적용 가능한 다른 예로써, 경계 라플라시안 기법에 적용되는 상위 경계 값 (upper bound) 및 하위 경계 값 (lower bound)은 각각 n 및 -n (여기서 n은 1보다 큰 자연수)로 설정될 수도 있다. 그리고, 엡실론 (ε) 값은 원본 데이터 대비 상기 전체 합성 데이터의 정확도 (accuracy) 값이 일정 값 이상인 엡실론 값으로 설정될 수 있다. 상기 엡실론 값을 결정하는 방법에 대해서는 이후에 상세히 설명한다.Here, as an example applicable to the present invention, each variable is

,

can be defined as In this case, an upper bound value and a lower bound value may be variously changed according to embodiments. In other words, as another example applicable to the present invention, the upper bound and lower bound applied to the boundary Laplacian technique are set to n and -n (where n is a natural number greater than 1), respectively. it might be In addition, the epsilon (ε) value may be set to an epsilon value in which an accuracy value of the entire composite data compared to the original data is equal to or greater than a predetermined value. A method of determining the epsilon value will be described later in detail.

S130 단계에서, 본 발명에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치는 경계 라플라시안 기법 (bounded Laplacian method)이 적용된 범주형 데이터에 대해 추가적으로 후-처리 이산화 (post-processing discretization) 기법을 적용할 수 있다. 이를 통해, 상기 합성 데이터 생성 장치는 상기 경계 라플라시안 기법에 의해 변환된 (perturbed) 데이터를 확률적으로 이산화할 수 있다.In step S130, the apparatus for generating synthetic data using differential privacy according to the present invention may additionally apply a post-processing discretization technique to categorical data to which the bounded Laplacian method is applied. Through this, the synthesized data generating apparatus may probabilistically discretize the perturbed data by the boundary Laplacian technique.

보다 구체적으로, 앞서 상술한 바와 같이 원본 데이터를 교란시키기 위해 경계 라플라시안 기법을 적용하는 경우, 주어진 입력에 대해 무한한 가능성이 있을 수 있다. 반면, 많은 의료 영역의 변수들은 심장 박동수 및 의료-수술 이력 등과 같이 범주형 데이터일 수 있다.More specifically, when applying the boundary Laplacian technique to perturb the original data as described above, there can be infinite possibilities for a given input. On the other hand, variables in many medical fields may be categorical data, such as heart rate and medical-surgery history.

따라서, 본 발명에 따른 합성 데이터 생성 장치는 경계 라플라시안 기법이 적용된 범주형 데이터에 대해 추가적인 후-처리 이산화 기법을 적용하여, 해당 데이터를 이산화시킬 수 있다.Accordingly, the apparatus for generating synthetic data according to the present invention may discretize the data by applying an additional post-processing discretization technique to the categorical data to which the boundary Laplacian technique is applied.

이를 위한 구체적인 일 예로, 경계 라플라시안 기법이 적용된 범주형 데이터 (예: 도 2의 Intermediate data)에 대해 베르누이 분포 (Bernoulli distrubition) 기법이 적용될 수 있다. 이를 위해, 교란된 데이터

는 m 조각으로 분리될 수 있다. 이때, m은 원본 입력 변수의 집합 개수 (cardinality)로써 양의 값을 가질 수 있다. 이어, [-C,C] 범위는 [0,m]로 전환되며 동일한 간격 (예:

)으로 나뉘어질 수 있다. 주어진 교란된 데이터 y를 위한 k 값은 하기 수학식3을 만족하도록 설정될 수 있다.As a specific example for this, a Bernoulli distribution technique may be applied to categorical data to which the boundary Laplacian technique is applied (eg, intermediate data of FIG. 2 ). For this, the perturbed data

can be separated into m pieces. In this case, m may have a positive value as the number of sets (cardinality) of the original input variables. Then, the range [-C,C] is converted to [0,m] and the same interval (e.g.:

) can be divided into A value of k for a given perturbed data y may be set to satisfy Equation 3 below.

k를 산출 결과에 따라, 베르누이 확률 p는 하기 수학식 4를 만족하도록 설정될 수 있다. 이때, 확률 p는 두 인접한 확률들 간의 거리를 의미할 수 있다.According to the result of calculating k, the Bernoulli probability p may be set to satisfy Equation 4 below. In this case, the probability p may mean a distance between two adjacent probabilities.

끝으로, 베르누이 확률 p를 고려하여 교란된 데이터 y는 하기 수학식 5를 만족하도록 이산화될 수 있다.Finally, the perturbed data y in consideration of the Bernoulli probability p may be discretized to satisfy Equation 5 below.

위 수학식에서, B는 베르누이 분포 함수를 의미한다.In the above equation, B denotes a Bernoulli distribution function.

S140 단계에서, 본 발명에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치는 제1 합성 데이터 및 제2 합성 데이터를 이용하여 전체 합성 데이터를 생성할 수 있다. 이를 위해, 상기 합성 데이터 생성 장치는 적정한 엡실론 값을 선택하여 상기 전체 합성 데이터를 생성할 수 있다. 일 예로, 상기 사용되는 엡실론 값으로는 원본 데이터 대비 상기 전체 합성 데이터의 정확도 (accuracy) 값이 일정 값 이상인 엡실론 값이 활용될 수 있다. 구체적인 일 예로, 상기 일정 값은 75 % 이상인 값을 가질 수 있고, 상기 엡실론 값은 상기 일정 값에 기초하여 10³ 내지 10⁴ 범위에 포함되는 하나의 값을 가질 수 있다. 다만, 상기 값들은 하나의 예시에 불과하며, 실시예에 따라 다양한 값으로 설정될 수 있다. 다시 말해, 상기 예시를 일반적으로 설명하면, 상기 일정 값은 X % 이상인 값으로 설정될 수 있고, 상기 X 값 및 데이터의 특성에 따라 엡실론 값은 일정 범위에 포함되는 값으로 설정될 수 있다.In step S140, the apparatus for generating synthetic data using differential privacy according to the present invention may generate the entire synthetic data using the first and second synthetic data. To this end, the synthesized data generating apparatus may generate the entire synthesized data by selecting an appropriate epsilon value. For example, as the epsilon value used, an epsilon value in which an accuracy value of the entire synthesized data compared to original data is a predetermined value or more may be used. As a specific example, the predetermined value may have a value of 75% or more, and the epsilon value may have one value included in the range of 10 ³ to 10 ⁴ based on the predetermined value. However, the above values are only an example, and may be set to various values according to embodiments. In other words, referring to the example in general, the predetermined value may be set to a value greater than or equal to X%, and the epsilon value may be set to a value included in a predetermined range according to the X value and characteristics of data.

이하에서는 상술한 합성 데이터 생성 방법에 대한 시뮬레이션 검증 결과에 대해 상세히 설명한다.Hereinafter, simulation verification results for the above-described synthetic data generation method will be described in detail.

먼저, 본 발명에 대한 시뮬레이션 검증을 위해, eICU Collaborative Research Database의 데이터가 활용되었다. 보다 구체적으로, 첫째, 차등 프라이버시 알고리즘이 주어진 원래 데이터를 효과적으로 교란하는지 여부를 평가하기 위해, 두 데이터 사이의 유사성을 측정할 때 범주형 데이터에 대해서는 중첩 기법이, 연속형 데이터에 대해서는 평균 제곱 오차(MSE) 기법이 활용되었다. 둘째, 차등 프라이버시가 데이터 세트의 효용성에 어떤 악영향을 미치는지 평가하기 위해, 다양한 엡실론 값에 기반한 APACHE 점수 변수를 사용하여 중환자실(ICU) 입원 후 사망률을 예측하는 정확도가 서로 비교되었다. 데이터 세트에는 관입 (intubated), 환기 (ventilation), 투석 (dialysis), 약물 상태 (medication status, 이때, cardinality=2가 적용될 수 있음), 눈, 운동 (motor), 언어 상태 (verbal status, 이때, cardinality=6가 적용될 수 있음) 등의 범주형 데이터가 포함될 수 있다. 또한, 상기 데이터 세트에는 소변 생산량 (urine output), 온도, 호흡률, 나트륨 (sodium), 심박수, 평균 혈압, pH, 헤마토크리트 (hematocrit), 크레아티닌, 알부민, 산소 압력, CO2 압력, 혈당 질소, 포도당, 빌리루빈 및 FiO2 값 등의 연속형 데이터가 포함될 수 있다. 상기 데이터 세트에 있어, 초기에는 148,532명의 환자(행)가 있었지만, 결측값을 삭제한 후 총 4,740명의 환자(생존 3,597명, 유효 기간 1,143명)의 데이터가 포함되었다. 사망률 예측을 위해 다음과 같은 기계 학습 방법이 활용되었다: 의사 결정 트리 (Decision tree), K-Nearest Neighbor, 지원 벡터 머신 (Support Vector Machine), 로지스틱 회귀 분석 (Logistic Regression), Naive Bayes, 랜덤 포리스트 (random Forest). 주어진 데이터는 80 대 20의 비율로 설정된 트레인 (train)으로 나뉘었다. 모든 예측은 5배 교차 검증 방식 (five-fold cross-validation manner)을 사용하여 평균화되었으며, 이를 위해 파이썬 프로그래밍 언어를 사용하는 Scikit-learn 라이브러리가 사용되었다.First, for the simulation verification of the present invention, data from the eICU Collaborative Research Database was utilized. More specifically, first, to evaluate whether the differential privacy algorithm effectively perturbs the given original data, the superposition technique for categorical data and the mean squared error ( MSE) technique was used. Second, to evaluate how differential privacy adversely affects the utility of the data set, the accuracy of predicting mortality after admission to the intensive care unit (ICU) using the APACHE score variable based on various epsilon values was compared with each other. Data sets include intubated, ventilation, dialysis, medicine status (where cardinality=2 may apply), eye, motor, verbal status (where, categorical data such as cardinality=6 may be applied) may be included. In addition, the data set includes urine output, temperature, respiratory rate, sodium, heart rate, mean blood pressure, pH, hematocrit, creatinine, albumin, oxygen pressure, CO2 pressure, glycemic nitrogen, glucose, bilirubin. and continuous data such as FiO2 values. In this data set, there were 148,532 patients (rows) initially, but data from a total of 4,740 patients (3,597 survivors, 1,143 validity period) were included after deleting missing values. The following machine learning methods were used for mortality prediction: Decision tree, K-Nearest Neighbor, Support Vector Machine, Logistic Regression, Naive Bayes, Random Forest ( random Forest). The given data were divided into trains set at a ratio of 80 to 20. All predictions were averaged using a five-fold cross-validation manner, for which the Scikit-learn library using the Python programming language was used.

[경계 라플라시안 함수의 검증을 위한 합성 데이터 (Synthetic data for validation of bounded Laplacian function)][Synthetic data for validation of bounded Laplacian function]

본 발명에 따르면, -1과 1 사이의 일정한 간격의 분포가 생성되어 경계 라플라시안 기법이 적용될 수 있다. 범위가 무한대인 기존의 라플라시안 기법과 대조적으로, 상기 경계 라플라시안 기법은 -1에서 1까지 범위를 가질 수 있다.According to the present invention, a distribution with a constant interval between -1 and 1 is generated so that the boundary Laplacian technique can be applied. In contrast to the existing Laplacian technique, which has an infinite range, the boundary Laplacian technique can range from -1 to 1.

도 3a 내지 도 3c는 본 발명에 따른 경계 라플라시안 기법의 성능을 나타내는 그래프이다. 보다 구체적으로, 도 3a는 -1 내지 1 범위로 임의적으로 생성된 연속형 데이터의 히스토그램을 나타낸 도면이고, 도 3b는 원래 0 내지 9 범위로 생성되어 -1 내지 1 범위로 정규화된 임의적으로 생성된 연속형 데이터의 히스토그램을 나타낸 도면이고, 도 3c는 도 3b에 후-처리 이산화가 적용된 이후 임의적으로 생성된 연속형 데이터의 히스토그램을 나타낸 도면이다. 상기 예시들에 있어, ε=0.1, δ=0 인 라플라시안 기법이 적용되었다고 가정한다.3A to 3C are graphs showing the performance of the boundary Laplacian technique according to the present invention. More specifically, FIG. 3A is a diagram showing a histogram of continuous data arbitrarily generated in the range of -1 to 1, and FIG. 3B is a randomly generated randomly generated data originally generated in the range of 0 to 9 and normalized to the range of -1 to 1. It is a view showing a histogram of continuous data, and FIG. 3C is a view showing a histogram of continuous data arbitrarily generated after post-processing discretization is applied to FIG. 3B. In the above examples, it is assumed that the Laplacian technique with ε=0.1 and δ=0 is applied.

도 3a을 참고하면, 기존 라플라시안 기법에 따른 출력 값은 경계 라플라시안 기법에서는 존재하지 않는 범위 밖에 위치한 출력 값을 포함할 수 있다. 보다 구체적으로, 범주형 데이터와 후-처리 이산화 기법을 테스트하기 위해, 본 시뮬레이션에서 우리는 0에서 9 사이의 100개의 랜덤 정수를 생성한 이후 -1에서 1까지의 범위로 정규화했다. 이때, 기존 라플라시안 기법에 따르면, 범위를 벗어난 몇몇 출력 값들이 발생하였다. 반면, 경계 라플라시안 기법이 적용된 범주형 데이터는 연속형 데이터처럼 일정 데이터 범위 내에 존재하였다.Referring to FIG. 3A , an output value according to the existing Laplacian technique may include an output value located outside a range that does not exist in the boundary Laplacian technique. More specifically, to test the categorical data and the post-processing discretization technique, in this simulation we generated 100 random integers between 0 and 9 and then normalized them to the range of -1 to 1. At this time, according to the existing Laplacian technique, some output values out of the range were generated. On the other hand, categorical data to which the boundary Laplacian technique was applied existed within a certain data range like continuous data.

다만, 도 3b을 참고하면, 일부 범주형 데이터는, 경계 밖의 값과 유사하게, 처음에 주어진 데이터에 존재하지 않음을 확인할 수 있다. 따라서, 도 3c를 참고하면, 추가적인 후-처리 이산화가 적용될 경우, 모든 범주형 데이터가 일정 데이터 범위 내에 존재하는 것이 보장됨을 확인할 수 있다.However, referring to FIG. 3B , it can be confirmed that some categorical data do not exist in the initially given data, similarly to values outside the boundary. Therefore, referring to FIG. 3C , when additional post-processing discretization is applied, it can be confirmed that all categorical data are guaranteed to exist within a certain data range.

[실제 데이터를 사용한 검증 (Validation using real-world data)][Validation using real-world data]

본 검증을 위해, 우리는 eICU Collaborative Research Database를 활용하였다. 또한, 연속형 데이터의 평균 제곱 오차(MSE)와 범주형 데이터의 오분류 비율 (misclassification rates)을 사용하여 원래 데이터와 교란된 데이터의 차이를 산출하였다.For this verification, we utilized the eICU Collaborative Research Database. In addition, the difference between the original data and the confounded data was calculated using the mean square error (MSE) of the continuous data and the misclassification rates of the categorical data.

도 4a 및 도 4b는 본 발명에 따른 엡실론 값 및 데이터 교란 정도의 관계를 간단히 나타낸 그래프이다. 보다 구체적으로, 도 4a는 연속형 데이터들에 엡실론 값 및 데이터 교란 정도의 관계를 나타낸 그래프이고, 도 4b는 범주형 데이터들에 엡실론 값 및 데이터 교란 정도의 관계를 나타낸 그래프이다.4A and 4B are graphs simply showing the relationship between the epsilon value and the degree of data disturbance according to the present invention. More specifically, FIG. 4A is a graph showing the relationship between the epsilon value and the degree of data perturbation for continuous data, and FIG. 4B is a graph showing the relationship between the epsilon value and the degree of data perturbation for categorical data.

도 4a를 참고하면, 원본 데이터의 값 간 분산으로 인해 eICU의 연속형 데이터의 MSE가 다양함을 확인할 수 있다. 예를 들어, pH와 알부민은 서로 다른 개인들 간에 유사하지만, 심박수와 포도당은 광범위한 차이를 가지는 것을 확인할 수 있다.Referring to FIG. 4A , it can be confirmed that the MSE of continuous data of eICU varies due to dispersion between values of original data. For example, while pH and albumin are similar between different individuals, it can be seen that heart rate and glucose have wide-ranging differences.

도 4b를 참고하면, 범주형 데이터에 있어, 삽관, 환기, 투석 상태는 0 또는 1이고 기회 수준 (chance level)은 0.5이다. 눈은 0에서 4까지, 언어 범위는 0에서 5까지, 그리고 모터는 0에서 6까지 범위를 가질 수 있다. 따라서, 특히 ε가 작을 경우 오분류율에 차이가 있음을 확인할 수 있다.Referring to FIG. 4B , in the categorical data, the intubation, ventilation, and dialysis states are 0 or 1, and the chance level is 0.5. Eyes can range from 0 to 4, speech can range from 0 to 5, and motors can range from 0 to 6. Therefore, it can be confirmed that there is a difference in the misclassification rate, especially when ε is small.

반면, 도 4a 및 도 4b를 참고하면, ε가 증가하면 연속형 데이터 및 범주형 데이터에서 모든 교란 값이 원래 값에 근접하게 되는 것을 확인할 수 있다.On the other hand, referring to FIGS. 4A and 4B , it can be seen that when ε increases, all the disturbance values in the continuous data and the categorical data approach the original values.

도 5는 본 발명에 따른 엡실론 값에 대한 다른 기계 학습 모델 간의 분류 정확도를 나타낸 그래프이다. 도 5에 있어, 중앙 집중식 모델 (centralized model)의 성능은 파선 (dashed lines)으로 표시된다.5 is a graph showing classification accuracy between different machine learning models for epsilon values according to the present invention. 5 , the performance of the centralized model is indicated by dashed lines.

엡실론 값에 대한 데이터 효용을 시뮬레이션하기 위해, 우리는 eICU 데이터 세트의 사망률을 예측하기 위한 예측 분류기를 구성했다. 특히 4,740명의 환자로부터 3,597명의 살아있는 환자가 있는 바, 76% 정확도의 기회 수준 (chance level)을 제공하였다. 도 5를 참고하면, 엡실론 값이 낮을수록 데이터 교란이 심해져 기회 수준 (chance level)에 가까운 정확도가 나타났다. 반면, 엡실론 값이 커질수록 분류 성능이 향상되어 원본 데이터를 이용한 정확도로 수렴함을 확인할 수 있다 (도 5의 파선). 이러한 경향은 다른 모델들 사이에서 일관되게 나타났고, 특히 랜덤 포레스트가 최고의 성능을 발휘함을 확인할 수 있다.To simulate the data utility for epsilon values, we constructed a predictive classifier to predict mortality in the eICU data set. In particular, there were 3,597 live patients from 4,740 patients, providing a chance level with 76% accuracy. Referring to FIG. 5 , the lower the epsilon value, the more severe the data disturbance, and thus the accuracy close to the chance level was obtained. On the other hand, as the epsilon value increases, classification performance is improved, and it can be confirmed that convergence to accuracy using the original data (dashed line in FIG. 5 ). This trend was consistent among the different models, and it can be confirmed that the random forest in particular performed the best.

이처럼, 본 발명에 따른 합성 데이터 생성 방법에 따르면, 차등 프라이버시에 따른 개인 정보 보호 효과를 높이며 동시에 해당 데이터의 활용도 또한 높일 수 있다.As such, according to the method for generating synthetic data according to the present invention, it is possible to increase the effect of protecting personal information according to differential privacy and at the same time increase the utilization of the corresponding data.

추가적으로, 본 발명에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치는 상술한 방법에 따라 생성된 전체 합성 데이터를 제3자에게 제공 (예: publish 등) 할 때 적용된 엡실론 값 정보를 함께 제공할 수 있다.Additionally, the apparatus for generating synthetic data using differential privacy according to the present invention may provide information about the epsilon value applied when providing (eg, publish, etc.) the entire synthetic data generated according to the above-described method to a third party.

도 6은 본 발명의 다른 예에 따른 차등 프라이버시를 이용한 합성 데이터 생성 장치를 나타낸 도면이다.6 is a diagram illustrating an apparatus for generating synthesized data using differential privacy according to another example of the present invention.

도 6에 도시된 바와 같이, 본 발명에 따른 차등 프라이버시를 이용한 합성 데이터 생성 방법을 수행하는 합성 데이터 생성 장치 (600)는 데이터 베이스 (610) 및 프로세서 (620)를 포함할 수 있다.As shown in FIG. 6 , the synthesized data generating apparatus 600 performing the synthetic data generating method using differential privacy according to the present invention may include a database 610 and a processor 620 .

이때, 데이터 베이스 (610)는 원본 데이터뿐만 아니라 상술한 합성 데이터 생성 방법에 따라 생성된 결과물 (예: 제1 합성 데이터, 제2 합성 데이터, 전체 합성 데이터 등)을 포함할 수 있다. 이어, 프로세서 (620)는 상기 데이터 베이스 (610)와 연결되어 상술한 합성 데이터 생성 방법을 수행하도록 설정될 수 있다.In this case, the database 610 may include not only original data but also a result (eg, first composite data, second composite data, total composite data, etc.) generated according to the above-described synthetic data generation method. Subsequently, the processor 620 may be connected to the database 610 and configured to perform the above-described method for generating synthetic data.

추가적으로, 본 발명에 따른 컴퓨터 프로그램은, 컴퓨터와 결합하여, 앞서 상술한 합성 데이터 생성 방법을 실행시키기 위하여 컴퓨터 판독가능 기록매체에 저장될 수 있다.Additionally, the computer program according to the present invention may be stored in a computer-readable recording medium in combination with a computer to execute the above-described synthetic data generating method.

전술한 프로그램은, 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The above-described program is a computer such as C, C++, JAVA, machine language, etc. that the processor (CPU) of the computer can read through the device interface of the computer in order for the computer to read the program and execute the methods implemented as the program. It may include code (Code) coded in the language. Such code may include functional code related to a function defining functions necessary for executing the methods, etc. can do. In addition, the code may further include additional information necessary for the processor of the computer to execute the functions or code related to memory reference for which location (address address) in the internal or external memory of the computer should be referenced. have. In addition, when the processor of the computer needs to communicate with any other computer or server located remotely in order to execute the functions, the code uses the communication module of the computer to determine how to communicate with any other computer or server remotely. It may further include a communication-related code for whether to communicate and what information or media to transmit and receive during communication.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of the method or algorithm described in relation to the embodiment of the present invention may be implemented directly in hardware, implemented as a software module executed by hardware, or implemented by a combination thereof. A software module may include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. As mentioned above, although embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains can realize that the present invention can be embodied in other specific forms without changing its technical spirit or essential features. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

400 : 합성 데이터 생성 장치
410: 데이터 베이스
420: 프로세서400: synthesized data generating device
410: database
420: processor

Claims

A method performed by a server, comprising:
A data classification step of classifying a plurality of data included in the original data into continuous data and categorical data according to types;
a first synthesized data generating step of generating first synthesized data using a bounded Laplacian method with respect to the continuous data;
a second synthesized data generating step of generating second synthesized data for the categorical data using the boundary Laplacian technique and a post-processing discretization technique; and
a total synthesized data generating step of generating total synthesized data using the first synthesized data and the second synthesized data;
The first synthesized data and the second synthesized data are
Characterized in that it is generated using an epsilon value in which an accuracy value of the entire synthetic data compared to the original data is greater than or equal to a certain value,
A method for generating synthetic data using differential privacy.

delete

The method of claim 1,
The constant value has a value of 75% or more,
The epsilon value is characterized in that it has one value included in the range of 10 ³ to 10 ⁴ based on the predetermined value,
A method of generating synthetic data using differential privacy.

The method of claim 1,
The boundary Laplacian technique is,
Characterized in that it includes a technique of normalizing all variables in the range of -1 to 1,
A method of generating synthetic data using differential privacy.

The method of claim 1,
The post-processing discretization technique is
Characterized in that it includes a technique of probabilistically discretizing the data transformed (perturbed) by the boundary Laplacian technique,
A method of generating synthetic data using differential privacy.

6. The method of claim 5,
The probabilistic discretization technique is,
Characterized in that it comprises a technique of discretizing based on the Bernoulli probability according to the Bernoulli distribution function,
A method of generating synthetic data using differential privacy.

The method of claim 1,
The method for generating the composite data using the differential privacy is characterized in that performed by an entity owning the original data,
A method of generating synthetic data using differential privacy.

a database for storing original data; and
A processor configured to generate synthetic data using differential privacy from the original data,
The processor is
Classifying the plurality of data included in the original data into continuous data and categorical data according to the type,
First synthetic data is generated using a bounded Laplacian method for the continuous data,
generating second synthetic data for the categorical data using the boundary Laplacian technique and a post-processing discretization technique;
generating total composite data using the first composite data and the second composite data;
characterized in that the first composite data and the second composite data are generated by using an epsilon value in which an accuracy value of the entire composite data is greater than or equal to a predetermined value compared to the original data,
Synthetic data generation device using differential privacy.

delete

9. The method of claim 8,
The constant value has a value of 75% or more,
The epsilon value is characterized in that it has one value included in the range of 10 ³ to 10 ⁴ based on the predetermined value,
Synthetic data generation device using differential privacy.

9. The method of claim 8,
The boundary Laplacian technique is,
Characterized in that it includes a technique of normalizing all variables in the range of -1 to 1,
Synthetic data generation device using differential privacy.

9. The method of claim 8,
The post-processing discretization technique is
Characterized in that it includes a technique of probabilistically discretizing the data transformed (perturbed) by the boundary Laplacian technique,
Synthetic data generation device using differential privacy.

13. The method of claim 12,
The probabilistic discretization technique is,
Characterized in that it comprises a technique of discretizing based on the Bernoulli probability according to the Bernoulli distribution function,
Synthetic data generation device using differential privacy.

9. The method of claim 8,
characterized in that the composite data generating device corresponds to an entity possessing the original data,
Synthetic data generation device using differential privacy.

a data classification step of classifying a plurality of data included in the original data into continuous data and categorical data according to types in combination with a computer;
a first synthesized data generating step of generating first synthesized data using a bounded Laplacian method with respect to the continuous data;
a second synthesized data generating step of generating second synthesized data for the categorical data using the boundary Laplacian technique and a post-processing discretization technique; and
a total synthesized data generating step of generating total synthesized data using the first synthesized data and the second synthesized data;
The first synthesized data and the second synthesized data are
A computer program stored in a computer-readable recording medium to execute generating using an epsilon value in which an accuracy value of the entire synthetic data compared to the original data is greater than or equal to a predetermined value.

delete

16. The method of claim 15,
The constant value has a value of 75% or more,
The epsilon value is characterized in that it has one value included in the range of 10 ³ to 10 ⁴ based on the predetermined value,
A computer program stored in a computer-readable recording medium.

16. The method of claim 15,
The boundary Laplacian technique is,
Characterized in that it includes a technique of normalizing all variables in the range of -1 to 1,
A computer program stored in a computer-readable recording medium.

16. The method of claim 15,
The post-processing discretization technique is
Characterized in that it includes a technique of probabilistically discretizing the data transformed (perturbed) by the boundary Laplacian technique,
A computer program stored in a computer-readable recording medium.

20. The method of claim 19,
The probabilistic discretization technique is,
Characterized in that it comprises a technique of discretizing based on the Bernoulli probability according to the Bernoulli distribution function,
A computer program stored in a computer-readable recording medium.